Probability and Bayesian Networks

Uncertainty

Why is uncertainty important?

Agents and humans don’t know everything
- but need to make decisions anyways!
Decisions are made in the absence of information
The best an agent can do:
Know how uncertain it is, and act accordingly

Probability

Frequentist vs. Bayesian

Frequentist view:

Probability of heads = # of heads / # of flips
probability of heads this time = probability of heads (history)
Uncertainty is ontological:
- pertaining to the world

Bayesian view:

Probability of heads this time = agent’s belief about flip
Belief of agent A:
- based on previous experience of agent A
Uncertainty is epistemological:
- pertaining to knowledge

Probability Measure

If $X$ is a random variable (feature, attribute)

it can take on values $x$ , where $x \in D o m a i n (X)$
assume $x$ is discrete
$P (X)$ is the probability that $X = x$
Joint probability $P (x, y)$ is the probability that $X = x$ and $Y = y$ at the same time

Axioms of Probability

Axioms are things we have to assume about probability

$P (X) \geq 0$
$\sum_{x} P (X = x) = 1.0$
$P (a \lor b) = P (a) + P (b)$ if $a$ and $b$ are contradictory
- can’t both be true at the same time
- $P (w i n \lor l o s e) = P (w i n) + P (l o s e) = 1.0$
  Some notes:
probability between 0-1 is purely convention
$P (a) = 0$ means you think a is definitely false
$P (a) = 1$ means you think a is definitely true
$0 < P (a) < 1$ means you have belief about the truth of $a$
- It does not mean that $a$ is true to some degree, just that you are ignorant of its truth value
Probability = measure of ignorance

Independence

Describe a system with $n$ features: $2^{n - 1}$ probabilities
Pasted image 20250418111454.png

Use independence to reduce number of probabilities
- e.g. radially symmetric dartboard, P(hit a sector)
  - $P (s e c t o r) = P (r, θ)$ where $r = 1, \dots, 4$ and $θ = 1, \dots, 8$
  - 32 sectors in total - need to give 31 numbers
- Assume radial independence: $P (r, θ) = P (r) P (θ)$
  - only need 7 + 3 = 10 numbers

Conditional Probability

If $X$ and $Y$ are random variables, the

$P (x | y)$ is the probability that $X = x$ given that $Y = y$
Incorporate Independence:
$P (f l i e s | i s_b i r d, h a s_f e a t h e r s) = P (f l i e s | i s_b i r d)$
- assuming all birds have feathers
  Product Rule (Chain Rule):
$P (f l i e s, i s_b i r d) = P (f l i e s | i s_b i r d) P (i s_b i r d) = P (i s_b i r d | f l i e s) P (f l i e s)$
Leads to: Bayes’ Rule

P (i s_b i r d | f l i e s) = \frac{P (f l i e s | i s_b i r d) P (i s_b i r d)}{P (f l i e s)}

Sum Rule

We know

$\sum_{x} P (X = x) = 1.0$ and therefore that $\sum_{x} P (X = x | Y) = 1.0$
This means that (Sum Rule)

\sum_{x} P (X = x, Y) = P (Y)

We call $P (Y)$ the marginal distribution over $Y$

Conditional Independence

$X$ and $Y$ are independent iff

$P (X) = P (X | Y)$
$P (Y) = P (Y | X)$
$P (X, Y) = P (X) P (Y)$
So learning $Y$ doesn’t influence beliefs about $X$
$X$ and $Y$ are conditionally independent give $Z$ iff
$P (X | Z) = P (X | Y, Z)$
$P (Y | Z) = P (Y | X, Z)$
$P (X, Y | Z) = P (X | Z) P (Y | Z)$
So learning $Y$ doesn’t influence beliefs about $X$ if you already know $Z$
- does not mean $X$ and $Y$ are independent

Expected Values

Expected value of a function on $X$ , $V (X)$ :

E (V) = \sum_{x \in D o m (X)} P (x) V (x)

where $P (x)$ is the probability of $X = x$
This is useful in decision making

where $V (X)$ is the utility of situation $X$
Bayesian decision making is then

E (V (d e c i s i o n)) = \sum_{o u t c o m e} P (o u t c o m e | d e c i s i o n) V (o u t c o m e)

Value of Independence

Complete independence reduces both representation and influence from $O (2^{n})$ to $O (n)$
Unfortunately, complete mutual independence is rare
Fortunately, most domains do exhibit a fair amount of conditional independence
Bayesian Networks or Belief Networks encode this information

Bayesian Networks

Bayesian Networks or Belief Networks

Directed Acyclic Graph
Encodes independencies in a graphical format
Edges give $P (X_{i} | p a r e n t s (X_{i}))$

Example

Pasted image 20250418114123.png

Correlation and Causality

Directed links in Bayes’ net $\approx$ causal

However, not always the case
In a Bayes net, it doesn’t matter
But, some structures will be easier to specify

Example

Pasted image 20250418115151.png
Pasted image 20250418115214.png
Pasted image 20250418115228.png
Pasted image 20250418115236.png

A Bayesian Network or BN over variables ${X_{1}, X_{2}, \dots, X_{N}}$ consists of:

a DAG whose nodes are the variables
a set of Conditional Probability Tables (CPTs) giving $P (X_{i} | P a r e n t s (X_{i}))$ for each $X_{i}$
Example probability tables for the Coffee Bayes Net:

Another Example

Pasted image 20250418120727.png

Semantics of a Bayes’ Net

The structure of the BN means that:

every $X_{i}$ is conditionally independent of all its non-descendants given its parents:

P (X_{i} | S, P a r e n t s (X_{i})) = P (X_{i} | P a r e n t s (X_{i}))

for any subset $S \subseteq N o n D e s c e n d a n t s (X_{i})$

The BN defines a factorization of the joint probability distribution

The joint distribution is formed by multiplying the conditional probability tables together

P (X_{1}, X_{2}, \dots, X_{n}) = \prod_{i} P (X_{i} | p a r e n t s (X_{i}))

Constructing Belief Networks

To represent a domain in a belief network, you need to consider:

What are the relevant variables?
- What will you observe?
  - this is the evidence
- What would you like to find out?
  - this is the query
- What other features make the model simpler?
  - these are the other variables
What values should these variables take?
What is the relationship between them?
- this should be expressed in terms of local influence
How does the value of each variable depend on its parents?
- this is expressed in terms of the conditional probabilities

Three Basic Bayesian Networks

Pasted image 20250418134600.png

Database and Test B independent if Report is observed
Test B and Test A independent if COVID is observed
Malfunction and COVID are independent if Test B is not observed

Updating Belief: Bayes’ Rule

Agent has a prior belief in a hypothesis, $h, P (h)$

Agent observes some evidence $e$ that has a likelihood given the hypothesis: $P (e | h)$
The agent’s posterior belief about $h$ after observing $e, $P (h | e)$ is given by Bayes’ Rule:

P (h | e) = \frac{P (e | h) P (h)}{P (e)} = \frac{P (e | h) P (h)}{\sum_{h} P (e | h) P (h)}

Useful when we have causal knowledge and want to do evidential reasoning

Probabilistic Inference

Simple Forward Inference (Chain)

Pasted image 20250418143030.png

Computing marginal requires simple forward propagation of probabilities

$P (B) = \sum_{m, c} P (M = m, C = c, B)$
- marginalization - sum rule
$P (B) = \sum_{m, c} P (B | m, c) P (m | c) P (c)$
- chain rule
$P (B) = \sum_{m, c} P (B | m, c) P (m) P (c)$
- independence
$P (B) = \sum_{m} P (m) \sum_{c} P (c) P (B | m, c)$
- distribution of product over sum

Same idea when evidence $C O V I D = t r u e$ “upstream”

$P (R | c) = \sum_{m, b} P (R, b, m | c)$
- marginalization
$P (R | c) = \sum_{m, b} P (R | b, m, c) P (b | m, c) P (m | c)$
- chain rule
$P (R | c) = \sum_{m, b} P (R | b) P (b | m, c) P (m)$
- independence and conditional independence

With multiple parents the evidence is “pooled”

Pasted image 20250418143958.png

\begin{aligned} P (F e v) & = \sum_{F l u, M, T S, E T} P (F e v, F l u, M, T S, E T) \\ = \sum_{F l u, M} P (F e v | M, F l u) [\sum_{T S} P (F l u | T S) P (T S)] [\sum_{E T} P (M | E T) P (E T)] \end{aligned}

Also works with “upstream” evidence

\begin{aligned} P (F e v | t s, \overset{―}{m}) & = \sum_{F l u} P (F e v, F l u | \overset{―}{m}, t s) \\ = \sum_{F l u} P (F e v | F l u, \overset{―}{m}, t s) P (F l u | \overset{―}{m}, t s) \\ = \sum_{F l u} P (F e v | F l u, \overset{―}{m}) P (F l u | t s) \end{aligned}

Simple Backward Inference

When evidence is downstream of query, then we must reason “backwards”.

This requires Bayes’ rule

\begin{aligned} P (B | r) & = P (r | B) P (B) / P (r) \propto P (r, B) (proportional to the joint probability) \\ P (r, B) & = P (r | B) P (B) (chain rule) \\ = \sum_{m, c} P (m, c, B, r) \\ = \sum_{m, c} P (m) P (c | m) P (B | m, c) P (r | B, m, c) (marginalization) \\ = \sum_{m, c} P (m) P (c) P (B | m, c) P (r | B) (independence and conditional independence) \end{aligned}

Normalizing constant is $\frac{1}{P (r)}$ , but this can be computed as $$P(r) = \sum_bP(r, b)$$

Variable Elimination

More general algorithm:

applies sum-out rule repeatedly
distributes sums

Factors

a factor is a representation of a function from a tuple of random variables into a number

We will write factor $f$ on variables $X_{1}, \dots, X_{j}$ as $f (X_{1}, \dots, X_{j})$
We can assign some or all of the variables of a factor
- this is restricting a factor:
  - $f (X_{1} = v_{1}, X_{2}, \dots, X_{j})$ , where $v_{1} \in d o m (X_{1})$ , is a factor on $X_{2}, \dots, X_{j}$
  - $f (X_{1} = v_{1}, X_{2} = v_{2}, \dots, X_{j} = v_{j})$ is a number that is the value of $f$ when each $X_{i}$ has value $v_{i}$
    The former is also written as $f (X_{1}, X_{2}, \dots, X_{j})_{X_{1} = v_{1}}$ , etc.

Multiplying Factors

The product of factor $f_{1} (X, Y)$ and $f_{2} (Y, Z)$ , where $Y$ are the variables in common, is the factor $(f_{1} \times f_{2}) (X, Y, Z)$ defined by:

(f_{1} \times f_{2}) (X, Y, Z) = f_{1} (X, Y) f_{2} (Y, Z)

Summing out variables

We can sum out a variable, say $X_{1}$ with domain ${v_{1}, \dots, v_{k}}$ , from factor $f (X_{1}, \dots, X_{j})$ , resulting in a factor on $X_{2}, \dots, X_{j}$ defined by:

\sum_{X_{1}} f (X_{1}, X_{2}, \dots, X_{j}) = f (X_{1} = v_{1}, \dots, X_{j}) + \dots + f (X_{1} = v_{k}, \dots, X_{j})

Evidence

If we want to compute the posterior probability of $Z$ given evidence $Y_{1} = v_{1} \land \dots \land Y_{j} = v_{j}$ :

\begin{aligned} P (Z | Y_{1} = v_{1}, \dots, Y_{j} = v_{j}) & = \frac{P (Z, Y_{1} = v_{1}, \dots, Y_{j} = v_{j})}{P (Y_{1} = v_{1}, \dots, Y_{j} = v_{j})} \\ = \frac{P (Z, Y_{1} = v_{1}, \dots, Y_{j} = v_{j})}{\sum_{Z} P (Z, Y_{1} = v_{1}, \dots, Y_{j} = v_{j})} \end{aligned}

The computation reduces to the joint probability of $P (Z, Y_{1} = v_{1}, \dots, Y_{j} = v_{j})$ , normalize at the end

can also restrict the query variable
- e.g. compute: $P (Z = z | Y_{1} = v_{1}, \dots, Y_{j} = v_{j})$

Probability of a conjunction

Suppose the variables of the belief network are $X_{1}, \dots, X_{n}$

To compute $P (Z, Y_{1} = v_{1}, \dots, Y_{j} = v_{j})$ , we sum out the variables other than query $Z$ and evidence $Y$
- $Z_{1}, \dots, Z_{k} = {X_{1}, \dots, X_{n}} - {Z} - {Y_{1}, \dots, Y_{j}}$
We order the $Z_{i}$ into an elimination ordering $Z_{1}, \dots, Z_{k}$

\begin{aligned} P (Z, Y_{1} = v_{1}, \dots, Y_{j} = v_{j}) & = \sum_{Z_{k}} \dots \sum_{Z_{1}} P (X_{1}, \dots, X_{n})_{Y_{1} = v_{1}, \dots, Y_{j} = v_{j}} \\ = \sum_{Z_{k}} \dots \sum_{Z_{1}} \prod_{i = 1}^{n} P (X_{i} | p a r e n t s (X_{i}))_{Y_{1} = v_{1}, \dots, Y_{j} = v_{j}} \end{aligned}

Computing sums of products

Computation in belief networks reduces to computing the sums of products

How can we compute $a b + a c$ efficiently?
- Distribute out the $a$ giving $a (b + c)$
How can we compute $\sum_{Z_{1}} \prod_{i = 1}^{n} P (X_{i} | p a r e n t s (X_{i}))$ efficiently?
- Distribute out those factors that don’t involve $Z_{1}$

Variable Elimination Algorithm

To compute $P (Z | Y_{1} = v_{1} \land \dots \land Y_{j} = v_{j})$ :

Construct a factor for each conditional probability
Restrict the observed variables to their observed values
Sum out each of the other variables according to some elimination ordering:
- for each $Z_{i}$ in order starting from $i = 1$ :
  - collect all factors that contain $Z_{i}$
  - multiply together and sum out $Z_{i}$
  - add resulting new factor back to the pool
Multiply the remaining factors
Normalize by dividing the resulting factor $f (z)$ by $\sum_{Z} f (Z)$

Summing our a Variable

To sum out a variable $Z_{j}$ from a product $f_{1}, \dots, f_{k}$ of factors:

Partition the factors into
- those that don’t contain $Z_{j}$ , say $f_{1}, \dots, f_{i}$ ,
- those that contain $Z_{j}$ , say $f_{i + 1}, \dots, f_{k}$
  We know:

\sum_{Z_{j}} f_{1} \times \dots \times f_{k} = f_{1} \times \dots \times f_{i} \times (\sum_{Z_{j}} f_{i + 1} \times \dots \times f_{k})

Explicitly construct a representation of the rightmost factor $(\sum_{Z_{j}} f_{i + 1} \times \dots \times f_{k})$
Replace the factors of $f_{i + 1}, \dots, f_{k}$ by the new factor

Notes on Variable Elimination

Complexity is linear in number of variables, and exponential in the size of the largest factor
When we create new factors: sometimes this blows up
Depends on the elimination ordering
For polytrees:
- work outside in
For general BNs this can be hard
simply finding the optimal elimination ordering is NP-hard for general BNs
inference in general is NP-hard

Variable Ordering

Polytrees

Pasted image 20250418165015.png

eliminate singly-connected nodes $(D, A, C, X_{1}, \dots, X_{k})$ first
Then no factor is ever larger than original CPTs
- if you eliminate $B$ first, a large factor is created that includes $A, B, C, X_{1}, \dots, X_{k}$

Relevance

Pasted image 20250418184906.png
Certain variables have no impact

In ABC network above, computing $P (A)$ does not require summing over $B$ and $C$

Can restrict attention to relevant variables:
Given query $Q$ and evidence $E$ , complete approximation is:
- $Q$ is relevant
- if any node is relevant, its parents are relevant
- if $E \in E$ is a descendent of a relevant variable, then $E$ is relevant
  Irrelevant variable: a node that is not an ancestor of a query or evidence variable
This will only remove irrelevant variables, but may not remove them all

Probability and Time

Pasted image 20250418214157.png

A node repeats over time
Explicit encoding of time
chain has length = amount of time you want to model
event-driven times or clock-driven times
- e.g. Markov chain

Markov Assumption

$P (S_{t + 1} | S_{1}, \dots, S_{t}) = P (S_{t + 1} | S_{t})$
This distribution gives the dynamics of the Markov Chain

Hidden Markov Models (HMMs)

Pasted image 20250418214404.png
Add: observations $O_{t}$

always observed, so the node is square
Observation function $P (O_{t} | s_{t})$
Given a sequency of observations $O_{1}, \dots, O_{t}$ , can estimate filtering:
- $P (S_{t} | O_{1}, \dots, O_{t})$
Or smoothing, for $k > t$
- $P (S_{k} | O_{1}, \dots, O_{t})$

Speech Recognition

Pasted image 20250418214659.png

Observations: audio features
States: phonemes
Dynamics: models
- e.g. co-articulation
HMMs: words
Can build hierarchical models (e.g. sentences)

Dynamic Bayesian Networks (DBNs)

In general, any Bayesian network can repeat over time: DBN

Many examples can be solved with variable elimination
may become too complex with enough variables
event-drive times or clock-driven times

Stochastic Simulation

Idea: probabilities $⟺$ samples

Get probabilities from samples:
If we could sample from a variable’s (posterior) probability, we could estimate its (posterior) probability

Generating Samples from a distribution

Pasted image 20250418222206.png
For a variable $X$ with a discrete domain or a one-dimensional real domain:

Totally order the values of the domain of $X$
Generate the cumulative probability distribution:
- $f (x) = P (X \leq x)$
Select a value $y$ uniformly in the range $[0, 1]$
Select $x$ such that $f (x) = y$

Forward Sampling in a Belief Network

Sample the variables one at a time;
Sample parents of $X$ before you sample $X$
- Given values for the parents of $X$ , sample from the probability of $X$ given its parents
for samples $s_{i}, i = 1, \dots, N$ :$$P(X=x_i)\propto \sum_{s_i}\delta(x_i)=N_{X=x_i}$$where $δ (x_{i}) = 1$ if $X = x_{i}$ in $s_{i}$ and 0 otherwise

Example

Pasted image 20250418223034.png

A: 2/3 based on first 7 samples

Probability and Bayesian Networks

Uncertainty

Probability

Frequentist vs. Bayesian

Frequentist view:

Bayesian view:

Probability Measure

Axioms of Probability

Independence

Conditional Probability

Sum Rule

Conditional Independence

Expected Values

Value of Independence

Bayesian Networks

Example

Correlation and Causality

Example

Semantics of a Bayes’ Net

Constructing Belief Networks

Three Basic Bayesian Networks

Updating Belief: Bayes’ Rule

Probabilistic Inference

Simple Forward Inference (Chain)

Computing marginal requires simple forward propagation of probabilities

Same idea when evidence COVID=true “upstream”

With multiple parents the evidence is “pooled”

Also works with “upstream” evidence

Simple Backward Inference

Variable Elimination

Factors

Multiplying Factors

Summing out variables

Evidence

Probability of a conjunction

Computing sums of products

Variable Elimination Algorithm

Summing our a Variable

Notes on Variable Elimination

Variable Ordering

Polytrees

Relevance

Probability and Time

Markov Assumption

Hidden Markov Models (HMMs)

Speech Recognition

Dynamic Bayesian Networks (DBNs)

Stochastic Simulation

Generating Samples from a distribution

Forward Sampling in a Belief Network

Example

Same idea when evidence $C O V I D = t r u e$ “upstream”