Decision Making

Planning Under Uncertainty

Bayesian Decision Making

Pasted image 20250419151259.png
Can also add context so V(decision,context) is the value of decision in situation context
Pasted image 20250419151344.png

Preferences

Preferences Over Outcomes

If o1 and o2 are outcomes

Lotteries

An agent may not know the outcomes of their actions, but only have a probability distribution of the outcomes

Properties of Preferences

Completeness

Theorem

If preferences follow the preceding properties, then the preferences can be measured by a function

utility:outcomes>[0,1]

such that

Rationality and Irrationality

Rational agents act so as to maximize expected utility:

Prospect Theory - Tversky and Kahneman

Humans weight value differently for gains vs losses
Pasted image 20250419154929.png
Pasted image 20250419154829.png

Ultimatum Game

Making Decisions Under Uncertainty

What an agent should do depends on:

Single Decisions

Decision variables are like random variables that an agent gets to choose the value of

Decision Networks

A decision network is a graphical representation of a finite sequential decision problem

Example

Pasted image 20250419162914.png

Finding the Optimal Decision

Suppose the random variables are X1,,Xn, decision variables are D, and utility depends on Xi1,,Xik and D:
Pasted image 20250419163246.png

Sequential Decisions

An intelligent agent doesn’t make a multi-step decision and carry it out without considering revising it based on future information

Sequential decision problems

A sequential decision problem consists of a sequence of decision variables D1,,Dn

Policies

A policy specifies what an agent should do under each circumstance

Expected Utility of a Policy

The expected utility of policy δ is
Pasted image 20250419164407.png
An optimal policy is one with the highest expect utility

Finding the Optimal Policy

  1. Create a factor for each conditional probability table and a factor for the utility
  2. Set remaining decision nodes ← all decision nodes
  3. Multiply factors and sum out variables that are not parents of a remaining decision node
  4. Select and remove a decision variable D from the list of remaining decision nodes:
    1. pick one that is in a factor with only itself and some of its parents (no children)
  5. Eliminate D by maximizing. This returns:
    1. the optimal decision function for D, argmaxDf
    2. a new factor to use, maxDf
  6. Repeat 3-5 until there are no more remaining decision nodes
  7. Eliminate the remaining random variables
    1. Multiply the factors: this is the expected utility of the optimal policy
  8. If any nodes were in evidence, divide by the P(evidence)

Agents as Processes

Agent carry out actions:

Decision-Theoretic Planning

What should an agent do when

World State

The world state is the information such that if you knew the world state, no information about the past is relevant to the future

Decision Processes

Pasted image 20250419214735.png
A Markov decision process augments a Markov chain with actions and values

Markov Decision Processes

For an MDP you specify:

Information Availability

What information is available when the agent decides what to do?

Rewards and Values

Suppose the agent receives the sequence of rewards r1,r2,r3,r4, What value should be assigned?

Policies

A stationary policy is a function:

π:SA

Given a state s,π(s) specifies what action the agent who is following π will do

Value of a Policy
Value of the Optimal Policy

Value Iteration

The t-step lookahead value function, Vt is the expected value with t steps to go

Algorithm

Asynchronous Value Iteration

You don’t need to sweep through all the states

Storing V[s]

Repeat forever:

Storing Q[s,a]

Repeat forever:

Markov Decision Processes: Factored State

Represent S={X1,X2,,Xn}

R(X1,X2,,Xn)=iR(Xi)

Value iteration proceeds as usual but can do one variable at a time

Partially Observable Markov Decision Processes (POMDPs)

A POMDP is like an MDP, but some variables are not observed

Value Functions and Conditional Plans

Pasted image 20250420100709.png
V(b) can be represented with a piecewise linear function over the belief space

Monte Carlo Tree Search (MCTS)

Pasted image 20250420101421.png

Reinforcement Learning

What should an agent do given:

Experiences

We assume there is a sequence of experiences:

Why is reinforcement learning hard?

Credit Assignment Problem: What actions are responsible for the reward may have occurred a long time before the reward was received

Main Approaches

See through a space of policies (controllers)

Temporal Differences

Suppose we have a sequence of values: $$v_1, v_2, v_3, \cdots$$ And want a running estimate of the average of the first k values: $$A_k = \frac{v_1+\cdots + v_k}{k}$$
When a new value vk arrives:
Pasted image 20250420110805.png
”TD formula”

Q-Learning

Idea: store Q[State,Action]

Pseudocode

Pasted image 20250420111357.png

Properties of Q-learning

Q-learning converges to the optimal policy, no matter what the agent does, as long as it tries each action in each state enough

Exploration strategies:

The ϵ-greedy strategy:

Model-Based Reinforcement Learning

Model-based reinforcement learning uses the experiences in a more effective manner

Model-based learner

Pasted image 20250420201009.png

Off/On-policy Learning

Q-learning does off-policy learning:

SARSA

Pasted image 20250420201748.png

Q-function Approximations

Let s=(x1,x2,,xN)T
Pasted image 20250420201954.png

Approximating the Q-function

Pasted image 20250420202303.png

SARSA with linear function approximation

Pasted image 20250420202435.png

Convergence

Linear Q-learning (Qw(s,a)iwaixi) converges under same conditions as Q-learning
Pasted image 20250420202639.png
Non-linear Q-learning (Qw(s,a)g(x;w)) may diverge

Mitigating Divergence

Two tricks used in practice:

  1. Experience Replay
  2. Use two Q function (two networks):
    1. Q network (currently being updated)
    2. Target network (occasionally updated)
Experience Replay

Idea: Store previous experiences (s,a,r,s,a) in a buffer and sample a mini-batch of previous experiences at each step to learn by Q-learning

Target Network

Idea: use a separate target network that is updated only periodically

Deep Q-Network

Pasted image 20250420203435.png

Bayesian Reinforcement Learning