计算机视觉 W45 Generative models
Unsupervised learning
regression vs. classification
why unsupervised learning?
- few of data are labeled
- pre-train an unsupervised model
- anomaly detection - learn what data distribution typically looks like
- cheaper than supervised learning
Generative models
a generative model is a class of statistical model that contrasts with discriminative models.
informally
- discriminative model - classifier
- generative model - generate new data instances
generative model includes the distribution of the data itself and tells you how likely a given example is.
for example, models that predict the next word in a sequence are typically generative models.
many kinds
Generative models are hard
- have to model more - generative model try to model how data is placed throughout the space
- a generative model for images might capture some difficult correlations.
Autoencoder
Traditional Autoencoder (AE)
an AE is a neural network that is trained to attempt to copy its input to its output
given data (no labels) we would like to learn function (encoder) and (decoder) s.t.:
where g is an approximation of the identify function
is the latent or hidden representation or code and is a non-linearity function.
is ’s reconstruction.
Motivation
- learning the identity function everywhere is not especially useful
- autoencoders are restricted in ways that allow them to copy only approximately.
- traditionally an autoencoder is used for dimensionality reduction and feature learning
Training an encoder
training set is no-label.
using SGD
loss function: L2 Norm
undercomplete vs overcomplete AE
AE is undercomplete if hidden representation has smaller dimensionality than the input
- good feature for the training distribution
- bad for other types of input
AE is overcomplete if hidden representation has larger dimensionality than the input
- no guarantee that the hidden units will extract meaningful structure
- higher* dimension code helps model a more complex distribution - good for training a linear classifier.
Stacked AE - deep
deep is better
- more representational power
- easier for similar datapoints to group together.
- better separation of classes
Latent space
the space in which the encoding exists is called the latent space or the hidden space
if AE is undercomplete, the compressed latent representation is often called the bottleneck
what if we tr to interpolate 插值 between two latent space?
我们可以发现, 从 0 到 1,得到的 像 代表的东西变得越来越像 代表的东西了
Problem with simple autoencoders
- latent space may not be continuous
- 容易变成 replicate
solution: variational autoencoders
Regularization
motivation: learn meaningful features without altering the code’s dimensions
solution: imposing other constraints on the network
Sparse AE
suppose the AE has learned this hidden representation features.
It has just remembered the data, not learned it. This will not generalize well to unseen data
the image is decomposed into a combination of sparse features:
- if our learned features are sparse, we can generalize better
- sparse means: most activations are 0 except a few that are close to 1.
Let denote the activation of -th hidden unit of the autoencoder
Let be activation of this specific node on a give input .
Let be the average activation of hidden unit .
we would like to enforce the constraint where is a sparsity parameter, typically small.
in other words, we want the average activation of each neuron to be close to .
we penalize for deviating from
penalty term:
is a Kullback-Liebler divergence function, is the number of units in the hidden layer
if
overall loss function:
we need to know before hand, so we have to compute a forward pass on the whole training set.
Recap
cross-entropy:
KL divergence:
Denoising AE
aim to encode the input to learn and describe latent attributes of the data
try to undo the effect of a corruption process stochastically applied to the input
model isn’t able to simply develop a mapping which memorizes the training data, becuase input and target output are not the same
更稳定 robust 的模型:在输入的时候加上噪声,目标输出还是原图
where is a copy of that has been corrupted by some noise process
example of noise
random assignment of subset of inputs to 0 with probability .
Gaussian additive noise.
training
reconstruction computed from the corrupted input
loss function compares reconstruction with the noiseless
what does it learn?
the correlations of ’s features
based on those relations we can learn a more “not prone to changes” model
we are forcing the hidden layer to learn a generalized structure of the data
The DAE is trained to map a corrupted data point back to the manifold. DAE 经过训练,可以将损坏的数据点映射回流形。
Going convolutional
we can replace the fully connected layers with convolutional layers
Variational autoencoders
very similar to the regular autoencoder
unique property: latent space are continuous (by design), allowing random sampling and interpolation
how?
by making its encoder not output an encoding vector of size , but rather outputting two vectors of size .
a vector of means , and another vector of standard deviations
probabilistic nature using a sampling layer
Random sampling layer
the stochastic generation means, that even for the same input, while the mean and standard deviations remain the same, the actual encoding will somewhat vary on every single pass simply due to sampling.
将输出整成了一个概率分布而不是一个确定的值。
- the hidden layer outputs the parameters of a vector of random variables of length
- the -th element of and represents the mean and standard deviation of the -th random variable , from which we sample to obtain the sampled encoding, which we pass onward to the decoder
Probabilistic nature of VAE
not only is a single point in latent space referring to a sample of that class, but all nearby points refer to the same as well
the decoder is exposed to a range of variations
results in a smooth latent space
Local smoothness is not enough
if the encoder learns to cluster samples apart, the decoder will be able to reconstruct the training data better
we want encodings all of which are as close as possible to each other while still being distinct, allowing smooth interpolation, and enabling the construction of new samples.
最好这些 smooth 的类别分布能紧密挨着,分开了就不好
solution:
- KL divergence in the loss function
- for VAEs, the KL loss is equivalent to the sum of all the KL divergences between the latent component in . and the standard normal, where
- KL loss term:
- we still minimize similarity as well
Effect of KL loss term
encourages the encoder to distribute all encoding, evenly around the center of the latent space
Optimizing similarity and KL loss, results in the generation of a latent space which maintains the similarity of nearby encodings on the local scale via clustering, yet globally, is very densely packed near the latent space origin.
Generative Adversarial Network (GAN)
Density estimation
goal of probabilistic generative models: fund model s.t.
explicitly model density function (distribution)
- 高斯分布、VAE 的 latent 计算
implicit: GAN - just learn to generate samples from distribution
motivation of GAN
problem: want to sample from complex, high-dimensional training distribution. no direct way to do this
solution: sample from a simple distribution.
System of 2 neural network competing against each other in a 0-sum game framework
discriminator network: try to distinguish between real and fake images
generator network: try to fool the discriminator by generating real-looking images
Training GAN
represents the probability that input came from the real data rather than the generator
takes random input noise vector and transforms it into an image
minibatch consists of
- examples from the real distribution
- samples from a prior distribution (like Gaussian)
then, from the discriminator’s perspective, we would like to have
(real), (fake)
but for generator’s perspective, we would like to have
(fool the discriminator)
Loss function
minimax game
for a fixed generator, it can be shown that this minimax game has a global optimum for
in that case, the discriminator can’t distinguish the real from the fake, because
it works!
- has a reinforcement learning task
- it knows when it does good but it is not given a supervised signal
- RL is hard
- back propagation through provides with a supervised signal; the better is, the better this signal will be.
- Can’t describe optimum via a single loss
- is seldom fooled
Common problems in training GAN
hard to train!
non-convergence: the model parameters oscillate, destabilize and never converge
mode collapse: the generator collapses which produces limited varieties of samples
表现:生成器往往只生成单一类型的数据
diminished gradient: the discriminator gets to successful, so the generator gradient vanishes
unbalanced between generator and discriminator causing overfitting
highly sensitive to the hyperparameter selections
Some variation of GAN
Deep Convolutional GAN
aka DCGAN
加了卷积操作的对抗生成网络
Conditional GAN
aka CGAN
the original GAN generates data from random noise, but has no knowledge about class labels
CGAN aims to solve this by telling the generator to generate images of only one particular class, like a cat or dog
specifically, CGAN concatenates a one-hot vector to the random noise vector to result in an architecture.
Pix2Pix
CycleGAN - cycle consistency loss
ProGAN - progressive growing of GAN
WGAN - wasserstein GAN
SAGAN - self attention GAN
BigGAN
Summary
Auto encoders
- best for dimensionality reduction / feature learning.
- Not good at generating new samples due to discontinuities in latent space
variational autoencoders VAE
- useful latent representation. better at generating new samples, but current sample quality not the best.
generative adversarial networks GAN
- game-theoretic approach, best samples
- but can be tricky and unstable to train