计算机视觉 W38 Convolutional Neural Networks
Convolution
disadvantage of fully connected MLP: difficult to detect local features.
In fully connected neural network, we need a separate neuron to detect each of the items (same feature appear in different position). a waste of network parameters.
Solution: using convolution to identify local features
Advantages:
- fewer trainable params (sparse interactions)
- Same params reused multiple times (param sharing)
- translation invariance (equi-variance) 平移不变性
for continue:
for discrete
Convolution in images
is referred to as filter or kernel.
aka inner product between the filter rotated by 180 degree and the input.
Example
高斯滤波:Gaussian filter acts as a smoothness filter. 使图像更平滑
索贝尔滤波:Sobel filters act as edge detectors. 边缘检测
Conv in NN
output neuron is connected only to pixels in a certain region
all weights for neurons in the same layer are the same
motivation:
- sparse interactions
- input img have many pixels, we detect small meaningful features with kernels that occupy only a few pixels.
- reduce the mem requirements of model; improve statistical efficiency; requires fewer operations to compute the output
- parameter sharing
- aka tied weight
- weight applied to one input is tied to value of a weight applied elsewhere
- learn a single set works for all locations
- equi-variant representation
- euqivariance in translation: input changes some way, output changes the same way.
Architecture of CNN
CNN are 3 dimensional height*width*channels
usually use local features, i. e. locally-connected
output is the dot product between filter w and small chunk x plus a bias
connectivity:
- local in space
- full in depth (3 RGB channels)
Replication at the same area: usually we use many filters
we have n filters, the output at each location of the input image is 1*1*n tensor array.
think of output as a multi-dimensional image with n channels: each channel corresponds to a feature map or activation map.
note that the feature maps get smaller and smaller as we move deeper into the network.
receptive field 感受野
Param sharing
- CNN feature maps share the same weight/filter params
Stride 步长
output size:
zero-pad: add zeros at the border of the image
Pad
summary of conv layer
input
num of filters , spatial extent , stride , zero padding
output
Activation layers
ReLU by default
why?
trivial to calculate the derivative, speed up computation.
avoid problems with small gradients, speed up the optimization.
other layer types in CNN
Pool layer
conv layers are often followed by pool layers
reduce the number of weights by keeping only the most important information.
max-pooling: choose the maximum
- adds some spatial invariance(空间不变性) to the exact feature locations
average-pooling: calc average
window size is the pooling range.
fully connected layer
the same as layers in MLP
aka dense/linear layer
often used at the end of CNNs
output of conv is 3D tensor. reshape/flatten it to a vector as input of fully connected layer.
1*1 convolution
1*1*m convolution to mimic fully connected layers
Example of CNNs
LeNet-5
first CNN 1998
conv filter 5*5, stride = 1
subsampling (pooling) layers were 2*2 at stride 2
architecture: conv - pool - conv - pool - fc - fc
AlexNet
VGG Net
Training CNN
time-consuming
need large training data sets, avoid overfitting
A possible solution: transfer learning
- use pre-trained encode trained on ImageNet
- fine-tune weights of decoder
- training time: typically less than 1 hour
Applications
Object classification and image retrieval
Object detection and segmentation
Human pose estimation
super resolution
Summary
- CNNs are based on sparse interaction and parameter sharing.
- notice: training is time-consuming, prone to overfitting
- transfer learning instead of training from scratch