Machine Learning

Bayesian Learning

Basic premise:

Suppose X is input features, and Y is target feature, d={x1,y1,x2,y2,,xN,yN} is evidence (data), x is a new input, and we want to know corresponding output y.

P(Y|x,d)=mMP(Y,m|x,d)=mMP(Y|m,x,d)P(m|x,d)=mMP(Y|m,x)P(m|d)

Bayesian Learning

P(H|d)P(d|H)P(H)

Bayesian Prediction

Want to predict X: (e.g. next candy)

P(X|d)=iP(X|d,hi)P(hi|d)=iP(X|hi)P(hi|d)

Bayesian Learning Properties

Optimal:

Price to pay:

Maximum a Posteriori

Idea: make a prediction based on most probable hypothesis: hMAP

MAP Properties

MAP prediction less accurate than full Bayesian since it relies only on one hypothesis

hMAP=argmaxhP(h|d)=argmaxhP(h)P(d|h)=argmaxhP(h)iP(di|h)

* product induces a non-linear optimisation

hMAP=argmaxh[logP(h)+ilogP(di|h)]

Maximum Likelihood (ML)

Idea: Simplify MAP by assuming uniform prior

hML=argmaxhP(d|h)

Make prediction based on hML only

P(X|d)P(X|hML)

ML Properties

ML prediction less accurate than Bayesian or MAP prediction since it ignores prior and relies on one hypothesis

hML=argmaxhilogP(di|h)

Binomial Distribution

Generalise the hypothesis space to a continuous quantity

Priors on Binomials

Pasted image 20250419105329.png
The Beta Distribution B(θ,a,b)=θa1(1θ)b1

Bayesian Classifiers

Idea: if you knew the classification you could predict the values of features

P(Class|X1,,Xn)P(X1,,Xn|Class)P(Class)

Naïve Bayesian Classifier

Xi are independent of each other given the class

P(Class|X1,,Xn)[iP(Xi|Class)]P(Class)

Predict class C based on attributes Ai

Laplace Correction

If a feature never occurs in the training set, but does in the test set

Bayesian Network Parameter Learning (ML)

For fully observed data

Occam’s Razor

Pasted image 20250419113139.png
Simplicity is encouraged in the likelihood function:

Supervised Machine Learning

Linear Regression

Linear regression is a model in which the output is a linear function of the input features

Y^w(e)=w0+w1X1(e)++wnXn(e)=i=0nwiXi(e)

where w=<w0,w1,w2,,wn>. we invent a new feature X01, to make it not a special case.
The sum o squares error on examples E for output Y is:$$Error(E, \vec{w}) = \sum_{e\in E}(Y(e)-\hat Y^{\vec w}(e))^2 = \sum_{e\in E}(Y(e)-\sum_{i=0}^nw_iX_i(e))^2$$
Goal: Find weights that minimize Error(E,w)

Finding weights that minimize Error

Find the minimum analytically

yT=wXyTXT(XXT)1=w

(XXT)1 is the pseudo-inverse

Find the minimum iteratively

Gradient Descent

Pasted image 20250419114615.png
η is the gradient descent step size, the learning rate
If
Pasted image 20250419114651.png
then update rule:
Pasted image 20250419114715.png
where we have set η2η (arbitrary scale)

Stochastic and Batched Gradient Descent

Linear Classifier

Assume we are doing binary classification, with classes

Gradient Descent for Linear Classifiers

If the activation function is differentiable, we can use gradient descent to update the weights. The sum of squares error:
Pasted image 20250419115633.png
The partial derivative with respect to weight wi is:
Pasted image 20250419115653.png
where Pasted image 20250419115704.png
Thus, each example e updates each weight wi by
Pasted image 20250419115748.png

The sigmoid or logistic activation function

Pasted image 20250419115857.png
So, f(x) can be computed from f(x)

Neural Networks

Pasted image 20250419125739.png

Neural Networks Basics

Activation Functions

Pasted image 20250419130734.png

Connecting the neurons together in to a network

Feedforward Network

Learning Weights

Back-propagation implements stochastic gradient descent

The Backpropagation Algorithm

Pasted image 20250419132233.png
An efficient method of calculating the gradients in a multi-layer neural network

Improving Optimization

Improving Generalization: Regularization

Regularized Neural nets: prevent overfitting, increased bias for reduced variance

Sequence Modeling

Composite models and other learning methods

Unsupervised Machine Learning

Incomplete data

Many real-world problems have hidden variables (AKA latent variables)

How to Deal with Missing Data

  1. Ignore hidden variables
    • Complexity increases
      Pasted image 20250419140431.png
  2. Ignore records with missing values
    • does not work with true latent variables
      • e.g. always missing
You cannot ignore missing data unless you know it is missing at random

Often data is missing because of something correlated with a variable of interest

  1. maximize likelihood directly
    • suppose Z is hidden and E is observable with values e
      Pasted image 20250419141025.png
Problem: can’t push log inside the sum to linearize

Expectation-Maximization (EM)

If we knew the missing values, computing hML would be easy again!

  1. Guess hML
  2. iterate:
    • expectation: based on hML, compute expectation of missing values P(X|hML,e)
    • maximization: based on expected missing values, compute new estimate of hML
Really simple version (K-means algorithm)

Expectation: based on hML, compute most likely missing values argmaxZP(Z|hML,e)
Maximization: based on those missing values, you now have complete data

K-Means Algorithm

K-means algorithm can be used for clustering:

  1. pick k means in X, one per class, C
  2. iterate until means stop changing:
    1. assign examples to k classes (e.g. as closest to current means)
    2. re-estimate k-means based on assignment

Expectation Maximization

Approximate the maximum likelihood

General Bayes Network EM

Complete Data: Bayes Net Maximum Likelihood
Pasted image 20250419144630.png
parents(V): parents of V
Incomplete data: Bayes Net Expectation Maximization

Belief Network Structure Learning

Pasted image 20250419145039.png

Autoencoders

A representational learning algorithm

Components

2 main components:

  1. Encoder e(x): maps x to low-dimensional representation z^
  2. Decoder d(z^): maps z^ to its original representation x
    Autoencoder implements x^=d(e(x))

Deep Neural Network Autoencoders

Generative Adversarial Networks

A generative unsupervised learning algorithm

Components

GANs are actually a pair of neural networks:

Training

GANs are trained with a minimax error:
Pasted image 20250419150809.png