Unsupervised learning

regression vs. classification

why unsupervised learning?

  • few of data are labeled
  • pre-train an unsupervised model
  • anomaly detection - learn what data distribution typically looks like
  • cheaper than supervised learning

Generative models

a generative model is a class of statistical model that contrasts with discriminative models.

informally

  • discriminative model - classifier datapdata(x)\text{data}\sim p_{data}(x)
  • generative model - generate new data instances generated samplespmodel(x)\text{generated samples}\sim p_{model}(x)

generative model includes the distribution of the data itself and tells you how likely a given example is.

for example, models that predict the next word in a sequence are typically generative models.

many kinds

Generative models are hard

  • have to model more - generative model try to model how data is placed throughout the space
  • a generative model for images might capture some difficult correlations.

Autoencoder

Traditional Autoencoder (AE)

an AE is a neural network that is trained to attempt to copy its input to its output

given data xx (no labels) we would like to learn function ff (encoder) and gg (decoder) s.t.:

h=f(x)x^=g(h)x^=g(f(x))=xh=f(x)\\ \hat x=g(h)\\ \hat x=g(f(x))=x

where g is an approximation of the identify function

h=f(x)=s(Wx+b)h=f(x)=s(Wx+b) is the latent or hidden representation or code and ss is a non-linearity function.

x^=g(h)=s(WTh+b)\hat x=g(h)=s(W^Th+b') is xx’s reconstruction.

Motivation

  • learning the identity function gg everywhere is not especially useful
  • autoencoders are restricted in ways that allow them to copy only approximately.
  • traditionally an autoencoder is used for dimensionality reduction and feature learning

Training an encoder

training set is no-label.

using SGD

loss function: L2 Norm

undercomplete vs overcomplete AE

AE is undercomplete if hidden representation has smaller dimensionality than the input

  • good feature for the training distribution
  • bad for other types of input

AE is overcomplete if hidden representation has larger dimensionality than the input

  • no guarantee that the hidden units will extract meaningful structure
  • higher* dimension code helps model a more complex distribution - good for training a linear classifier.

Stacked AE - deep

deep is better

  • more representational power
  • easier for similar datapoints to group together.
  • better separation of classes

Latent space

the space in which the encoding exists is called the latent space or the hidden space

if AE is undercomplete, the compressed latent representation hh is often called the bottleneck

what if we tr to interpolate 插值 between two latent space? hi=αh1+(1α)h2h_i=\alpha h_1+(1-\alpha)h_2

我们可以发现,α\alpha 从 0 到 1,得到的 hih_ih1h_1 代表的东西变得越来越像 h2h_2 代表的东西了

Problem with simple autoencoders

  • latent space may not be continuous
  • 容易变成 replicate

solution: variational autoencoders

Regularization

motivation: learn meaningful features without altering the code’s dimensions

solution: imposing other constraints on the network

Sparse AE

suppose the AE has learned this hidden representation features.

It has just remembered the data, not learned it. This will not generalize well to unseen data

the image is decomposed into a combination of sparse features:

  • if our learned features are sparse, we can generalize better
  • sparse means: most activations are 0 except a few that are close to 1.

Let hj(lBn)h_j^{(l_{Bn})} denote the activation of jj-th hidden unit of the autoencoder

Let hj(lBn)(x)h_j^{(l_{Bn})}(x) be activation of this specific node on a give input xx.

Let ρ^j=1ni=1n[hj(lBn)(xi)]\hat\rho_j=\frac{1}{n}\sum_{i=1}^n[h_j^{(l_{Bn})}(x^{i})] be the average activation of hidden unit jj.

we would like to enforce the constraint ρ^j=ρ\hat\rho_j=\rho where ρ\rho is a sparsity parameter, typically small.

in other words, we want the average activation of each neuron jj to be close to ρ\rho.

we penalize ρ^j\hat\rho_j for deviating from ρ\rho

penalty term: j=1SBnKL(ρρ^j)\sum_{j=1}^{S_{Bn}}KL(\rho|\hat\rho_j)

KLKL is a Kullback-Liebler divergence function, SBnS_{Bn} is the number of units in the hidden layer

KL(ρρ^j)=0KL(\rho|\hat\rho_j)=0 if ρ^j=ρ\hat\rho_j=\rho

overall loss function: JS(W)=J(W)+βj=1SBnKL(ρρ^j)J_S(W)=J(W)+\beta\sum_{j=1}^{S_{Bn}}KL(\rho|\hat\rho_j)

J(W)=12i=1nx^(i)x(i)J(W)=\frac{1}{2}\sum_{i=1}^n||\hat x^{(i)}-x^{(i)}||

we need to know ρ^j\hat\rho_j before hand, so we have to compute a forward pass on the whole training set.

Recap

cross-entropy: H(p,q)=i=1npilog2(qi)H(p,q)=-\sum_{i=1}^np_i\log_2(q_i)

KL divergence: DKL(pq)=H(p,q)H(p)=i=1npilog2qi(i=1npilog2pi)=i=1npilog2qipiD_{KL}(p||q)=H(p,q)-H(p)=-\sum_{i=1}^np_i\log_2{q_i}-(-\sum_{i=1}^np_i\log_2p_i)=-\sum_{i=1}^np_i\log_2\frac{q_i}{p_i}

Denoising AE

aim to encode the input to learn and describe latent attributes of the data

try to undo the effect of a corruption process stochastically applied to the input

model isn’t able to simply develop a mapping which memorizes the training data, becuase input and target output are not the same

更稳定 robust 的模型:在输入的时候加上噪声,目标输出还是原图

x^=g(f(x~))=x\hat x=g(f(\widetilde x))=x where x~\widetilde xis a copy of xx that has been corrupted by some noise process

example of noise

random assignment of subset of inputs to 0 with probability vv.

Gaussian additive noise.

training

reconstruction x^\hat x computed from the corrupted input x~\widetilde x

loss function compares x^\hat x reconstruction with the noiseless xx

what does it learn?

the correlations of xx’s features

based on those relations we can learn a more “not prone to changes” model

we are forcing the hidden layer to learn a generalized structure of the data

The DAE is trained to map a corrupted data point back to the manifold. DAE 经过训练,可以将损坏的数据点映射回流形。

Going convolutional

we can replace the fully connected layers with convolutional layers

Variational autoencoders

very similar to the regular autoencoder

unique property: latent space are continuous (by design), allowing random sampling and interpolation

how?

by making its encoder not output an encoding vector of size nn, but rather outputting two vectors of size nn.

a vector of means μ\mu, and another vector of standard deviations σ\sigma

probabilistic nature using a sampling layer

Random sampling layer

the stochastic generation means, that even for the same input, while the mean and standard deviations remain the same, the actual encoding will somewhat vary on every single pass simply due to sampling.

将输出整成了一个概率分布而不是一个确定的值。

  • the hidden layer outputs the parameters of a vector of random variables of length nn
  • the ii-th element of μ\mu and σ\sigma represents the mean and standard deviation of the ii-th random variable xix_i, from which we sample to obtain the sampled encoding, which we pass onward to the decoder

Probabilistic nature of VAE

not only is a single point in latent space referring to a sample of that class, but all nearby points refer to the same as well

the decoder is exposed to a range of variations

results in a smooth latent space

Local smoothness is not enough

  • if the encoder learns to cluster samples apart, the decoder will be able to reconstruct the training data better

  • we want encodings all of which are as close as possible to each other while still being distinct, allowing smooth interpolation, and enabling the construction of new samples.

    最好这些 smooth 的类别分布能紧密挨着,分开了就不好

solution:

  • KL divergence in the loss function
  • for VAEs, the KL loss is equivalent to the sum of all the KL divergences between the latent component ZiN(μi,σi2)Z_i\sim N(\mu_i,\sigma_i^2) in ZZ. and the standard normal, where μi=0,σi2=1\mu_i=0,\sigma_i^2=1
  • KL loss term: i=1nσi2+μi2logσi1\sum_{i=1}^n\sigma_i^2+\mu_i^2-\log\sigma_i-1
  • we still minimize similarity as well 12x^x2\frac{1}{2}\sum||\hat x-x||^2

Effect of KL loss term

encourages the encoder to distribute all encoding, evenly around the center of the latent space

Optimizing similarity and KL loss, results in the generation of a latent space which maintains the similarity of nearby encodings on the local scale via clustering, yet globally, is very densely packed near the latent space origin.

Generative Adversarial Network (GAN)

Density estimation

goal of probabilistic generative models: fund model s.t. pdata(x)=pmodel(x)p_{data}(x)=p_{model}(x)

explicitly model density function (distribution)

  • 高斯分布、VAE 的 latent 计算

implicit: GAN - just learn to generate samples from distribution

motivation of GAN

problem: want to sample from complex, high-dimensional training distribution. no direct way to do this

solution: sample from a simple distribution.

System of 2 neural network competing against each other in a 0-sum game framework

discriminator network: try to distinguish between real and fake images

generator network: try to fool the discriminator by generating real-looking images

Training GAN

D(x)D(x) represents the probability that input xx came from the real data rather than the generator

G(z)G(z) takes random input noise vector zz and transforms it into an image

minibatch consists of

  • {x(i)}i=1m\{x^{(i)}\}_{i=1}^m examples from the real distribution
  • {z(i)}i=1m\{z^{(i)}\}_{i=1}^m samples from a prior distribution (like Gaussian)

then, from the discriminator’s perspective, we would like to have

D(x(i))=1D(x^{(i)})=1 (real), D(G(z(i)))=0D(G(z^{(i)}))=0 (fake) maxWDi=1mlog(D(x(i)))+log(1D(G(z(i))))\Rightarrow\max_{W_D}\sum_{i=1}^m\log(D(x^{(i)}))+\log(1-D(G(z^{(i)})))

but for generator’s perspective, we would like to have

D(G(z(i)))=1D(G(z^{(i)}))=1 (fool the discriminator) minWGi=1mlog(1D(G(z(i))))\Rightarrow\min_{W_G}\sum_{i=1}^m\log(1-D(G(z^{(i)})))

Loss function

minimax game

minWGmaxWDExpdatalog(D(x))+Ezpzlog(1D(G(z)))=pdata(x)log(D(x))dx+pzlog(1D(G(z)))dz\min_{W_G}\max_{W_D}\mathbb E_{x\sim p_{data}}\log(D(x))+\mathbb E_{z\sim p_z}\log(1-D(G(z)))\\ =\int p_{data}(x)\log(D(x))\text dx+\int p_z\log(1-D(G(z)))\text dz

for a fixed generator, it can be shown that this minimax game has a global optimum for pdata(x)=pgenerator(x)p_{data}(x)=p_{generator}(x)

in that case, the discriminator can’t distinguish the real from the fake, because

D(x)=Pr(realx)=pdata(x)pdata(x)+pgenerator(x)=12D^*(x)=\Pr(real|x)=\frac{p_{data}(x)}{p_{data}(x)+p_{generator}(x)}=\frac{1}{2}

it works!

  • GG has a reinforcement learning task
    • it knows when it does good but it is not given a supervised signal
    • RL is hard
    • back propagation through DD provides GG with a supervised signal; the better DD is, the better this signal will be.
  • Can’t describe optimum via a single loss
  • DD is seldom fooled

Common problems in training GAN

hard to train!

  • non-convergence: the model parameters oscillate, destabilize and never converge

  • mode collapse: the generator collapses which produces limited varieties of samples

    表现:生成器往往只生成单一类型的数据

  • diminished gradient: the discriminator gets to successful, so the generator gradient vanishes

  • unbalanced between generator and discriminator causing overfitting

  • highly sensitive to the hyperparameter selections

Some variation of GAN

Deep Convolutional GAN

aka DCGAN

加了卷积操作的对抗生成网络

Conditional GAN

aka CGAN

the original GAN generates data from random noise, but has no knowledge about class labels

CGAN aims to solve this by telling the generator to generate images of only one particular class, like a cat or dog

specifically, CGAN concatenates a one-hot vector yy to the random noise vector zz to result in an architecture.

Pix2Pix

CycleGAN - cycle consistency loss

ProGAN - progressive growing of GAN

WGAN - wasserstein GAN

SAGAN - self attention GAN

BigGAN

Summary

Auto encoders

  • best for dimensionality reduction / feature learning.
  • Not good at generating new samples due to discontinuities in latent space

variational autoencoders VAE

  • useful latent representation. better at generating new samples, but current sample quality not the best.

generative adversarial networks GAN

  • game-theoretic approach, best samples
  • but can be tricky and unstable to train