Fully visible belief network (FVBN)

Explicit density model

p (x) = p (x_{1}, x_{2}, . . ., x_{n})

$p (x)$ = likelihood of image x
$p (x_{1}, x_{2}, . . ., x_{n})$ = joint likelihood of each pixel in the image
Use chain rule to decompose likelihood of an image x into product of 1-d distributions

Then, maximize likelihood of training data

PixelRNN

Generate image pixels one at a time, starting at the upper left corner
Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above
At each pixel, predict red, then blue, then green
1. Softmax over [0, 1, …, 255]

Sequential generation is slow!

Each pixel depends implicitly on all pixels above and to the left
Pasted image 20241205160746.png

PixelCNN

Still generate image pixels starting from corner
Dependency on previous pixels now modelled using a CNN over context region (masked convolution)
Training faster than PixelRNN
Generation still sequential → slow

Pros

Can explicitly compute likelihood p(x)
Explicit likelihood of training data gives good evaluation metric
Good samples

Cons

Sequential generation → slow

Improving PixelCNN performance

Gated convolutional layers
Short-cut connections
Discretized logistic loss
Multi-scale
Training tricks
Etc…