Supervised Machine Learning

Foundations

Learning

The ability to improve behaviour based on experience

Components of a Learning Problem

The following component are part of any learning problem:

Types of Learning

Common Learning Tasks

Feedback

Learning Tasks can be characterized by the feedback given to the learner

Measuring Success

The measure of success is not how well the agent performs on the training examples, but how well the agent performs for new (unseen) examples

Bias

The tendency to prefer one hypothesis over another is called a bias

Supervised Learning

Given:

Noise

Data isn’t perfect

Evaluating Predictions

Suppose Y is a feature and e is an example:

Measure of Error

E is the set of examples. T is the set of target features

eEYT|Y(e)Y^(e)| eEYT(Y(e)Y^(e))2 maxeEmaxYT|Y(e)Y^(e)|

Precision and Recall

Not all errors are equal, e.g. predict:

Receiver Operating Curve (ROC)

Pasted image 20250416154648.png
The ROC gives full range of performance of an algorithm across different biases

Basic Models for Supervised Learning

Many learning algorithms can be seen as deriving from:

Decision Trees

Simple, successful technique for supervised learning from discrete data

Properties of Decision Tree

Examples

Pasted image 20250416160120.png

Learning a Decision Tree

Pseudocode

Remaining Issues

Stopping Criteria

How do we decide to stop splitting?

Feature Selection

Ideal: choose sequence of features that result in smallest tree
Actual: myopically split

Good Feature Selection

Pasted image 20250416162126.png

Bad Feature Selection

Pasted image 20250416162114.png

Information Theory

In general, need log2P(x) bits to encode x

Information Gain

Given a set E of N training examples, if the number of examples with output feature Y=yi is Ni, then $$P(Y=y_i)=P(y_i)=\frac{N_i}{N}$$(the point estimate)
Total information content for the set E is $$I(E)=-\sum_{y_i\in Y}P(y_i)\log_2P(y_i)$$
So, after splitting E up into E1 and E2 with size N1,N2 based on input attribute Xi, the information content $$I(E_{split}) = \frac{N_1}{N}I(E_1) + \frac{N_2}{N}I(E_2)$$
and we want the Xi that maximizes the information gain:$$I(E)-I(E_{split})$$
Information gain is always non-negative:

Final Return Value

Point estimate of Y (output features) over all examples

Using a Priority Queue to Learn the DT

Some branches might be more worthwhile to expand
Idea: sort the leaves using a priority queue ranked by how much information can be gained with the best feature at the leaf

Pseudocode

Overfitting

Pasted image 20250416173403.png
Sometimes the decision tree is too good at classifying the training data, and will not generalise very well.

Some methods to avoid overfitting:

Test set errors caused by:

Cross Validation

Split training data into a training and a validation set