Fully visible belief network (FVBN)
Explicit density model
= likelihood of image x = joint likelihood of each pixel in the image
Use chain rule to decompose likelihood of an image x into product of 1-d distributions
Then, maximize likelihood of training data
PixelRNN
- Generate image pixels one at a time, starting at the upper left corner
- Compute a hidden state for each pixel that depends on hidden states and RGB values from the left and from above
- At each pixel, predict red, then blue, then green
- Softmax over [0, 1, …, 255]
Sequential generation is slow!
Each pixel depends implicitly on all pixels above and to the left
PixelCNN
- Still generate image pixels starting from corner
- Dependency on previous pixels now modelled using a CNN over context region (masked convolution)
Training faster than PixelRNN
Generation still sequential → slow
Pros
- Can explicitly compute likelihood p(x)
- Explicit likelihood of training data gives good evaluation metric
- Good samples
Cons
- Sequential generation → slow
Improving PixelCNN performance
- Gated convolutional layers
- Short-cut connections
- Discretized logistic loss
- Multi-scale
- Training tricks
- Etc…