
disadvantage of fully connected MLP: difficult to detect local features.

In fully connected neural network, we need a separate neuron to detect each of the items (same feature appear in different position). a waste of network parameters.

Solution: using convolution to identify local features


  • fewer trainable params (sparse interactions)
  • Same params reused multiple times (param sharing)
  • translation invariance (equi-variance) 平移不变性

for continue:

(fg)(t)=f(τ)g(tτ)dτ=f(tτ)g(τ)dτ(f*g)(t)=\int_{-\infty}^\infty f(\tau)g(t-\tau)d\tau=\int_{-\infty}^\infty f(t-\tau)g(\tau)d\tau

for discrete

(fg)(t)=f[k]g[tk]=f[tk]g[k](f*g)(t)=\sum_{-\infty}^\infty f[k]g[t-k]=\sum_{-\infty}^\infty f[t-k]g[k]

Convolution in images

g(x)g(x) is referred to as filter or kernel.


aka inner product between the filter rotated by 180 degree and the input.


高斯滤波:Gaussian filter acts as a smoothness filter. 使图像更平滑

索贝尔滤波:Sobel filters act as edge detectors. 边缘检测

Conv in NN

output neuron is connected only to pixels in a certain region

all weights for neurons in the same layer are the same


  • sparse interactions
    • input img have many pixels, we detect small meaningful features with kernels that occupy only a few pixels.
    • reduce the mem requirements of model; improve statistical efficiency; requires fewer operations to compute the output
  • parameter sharing
    • aka tied weight
    • weight applied to one input is tied to value of a weight applied elsewhere
    • learn a single set works for all locations
  • equi-variant representation
    • euqivariance in translation: input changes some way, output changes the same way.

Architecture of CNN

CNN are 3 dimensional height*width*channels

usually use local features, i. e. locally-connected

output is the dot product between filter w and small chunk x plus a bias wTx+bw^Tx+b


  • local in space
  • full in depth (3 RGB channels)

Replication at the same area: usually we use many filters

we have n filters, the output at each location of the input image is 1*1*n tensor array.

think of output as a multi-dimensional image with n channels: each channel corresponds to a feature map or activation map.

note that the feature maps get smaller and smaller as we move deeper into the network.

receptive field 感受野

Param sharing

  • CNN feature maps share the same weight/filter params

Stride 步长

output size: NWstride+1\frac{N-W}{stride}+1

zero-pad: add zeros at the border of the image

Pad W12\frac{W-1}{2}

summary of conv layer

input W1×H1×D1W_1\times H_1\times D_1

num of filters KK, spatial extent FF, stride SS, zero padding PP

output W2×H2×D2W_2\times H_2\times D_2




Activation layers

ReLU by default


trivial to calculate the derivative, speed up computation.

avoid problems with small gradients, speed up the optimization.

other layer types in CNN

Pool layer

conv layers are often followed by pool layers

reduce the number of weights by keeping only the most important information.

max-pooling: choose the maximum

  • adds some spatial invariance(空间不变性) to the exact feature locations

average-pooling: calc average

window size is the pooling range.

fully connected layer

the same as layers in MLP

aka dense/linear layer

often used at the end of CNNs

output of conv is 3D tensor. reshape/flatten it to a vector as input of fully connected layer.

1*1 convolution

1*1*m convolution to mimic fully connected layers

Example of CNNs


  • first CNN 1998

  • conv filter 5*5, stride = 1

  • subsampling (pooling) layers were 2*2 at stride 2

  • architecture: conv - pool - conv - pool - fc - fc



Training CNN


need large training data sets, avoid overfitting

A possible solution: transfer learning

  • use pre-trained encode trained on ImageNet
  • fine-tune weights of decoder
  • training time: typically less than 1 hour


Object classification and image retrieval

Object detection and segmentation

Human pose estimation

super resolution


  1. CNNs are based on sparse interaction and parameter sharing.
  2. notice: training is time-consuming, prone to overfitting
  3. transfer learning instead of training from scratch