This class will introduce some typical CNNs.

Transfer learning

pre-train finetune

replace decoder and train it.

time saving

Examples

reinitialize and train only the last few fully connected layer.

encoder: more generic features (edge, blob, texture)

decoder: more specific features (object parties)

transfer learning freeze the encoder, whereas fine tuning unfreeze encoder.

AlexNet

1998, first CNN.

conv filter 5*5, stride 1

subsampling layers were 2*2 applied at stride 2.

architecture: conv-pool-conv-pool-fc-fc

Number of parameters

4 bytes per parameter

294 MiB

Details

  • first use of ReLU
  • local response normalization layers
  • heavy data augmentation
  • dropout 0.5 in FC layers
  • batch size 128, SGD Momentum 0.9, learning rate 1e-2.
  • L2 weight decay 5e-4

VGGNet

small filters, depper networks

from 8 layers to 16-19 layers.

only 3*3 conv stride, pad 1 and 2*2 max pooling stride 2.

Why use smaller filters?

Stack of three 3*3 conv stride layers has same effective receptive filed as one 7*7 conv layer

large size filters not necessary.

Number of parameters of one 7*7 filter vs three 3*3 filters:

assuming 3 channel

1*7*7*3=147

3*3*3*3=81

fewer parameters

deeper, more non-linearities.

Memory and number of parameters

total memory: 24M*4 bytes=96MB\image

total params: 138M parameters, 552 MiB.

Details

non-local response normalization

GoogleLeNet

aka Inception V1

  • architecture different from VGG, AlexNet
  • 1*1 conv at the middle of the network
  • global average pooling at the end of the network instead of FC layers
  • inception module: different size/types of conv for the same input and stacking all the output.

1*1 convolution

for introducing more non-linearity to increase the representational power.

used as a dimension reduction module to reduce the computation.

by reducing computation bottleneck, depth and width of network can be increased.

Inception module

design a good local network topology and then stack these modules on top of each other.

different kinds of features are extracted.

high computational complexity

Global average pooling

FC layers are used at the end of network

all inputs are connected to each output

global average pooling is used nearly at the end of network by averaging each feature map from 7*7 to 1*1

number of weight=0

average pooling improved top-1 acc, less prone to overfitting.

Auxiliary classifier

辅助分类器

intermediate softmax branches

used for training only

branch consist of:

  • 5*5 average pooling stride 3
  • 1/1 conv
  • 1024 FC
  • 1000 FC
  • softmax

can be used for combating gradient vanishing problem, providing regularization

Details

deeper network, with computational efficiency

22 layers

inception module

avoids expensive FC layers

only 5M params

ResNet

deep residual networks

very deep networks using residual connections

152-layers model for ImageNet

motivation for going deeper: deeper, more model capacity

core idea: skip connection. skip one or more layers as shown in figure.

residual mapping

Residual block

residual function F(x)=hW(x)xF(x)=h_W(x)-x

easily push the residuals to 0s

skip connection and shape

跳跃连接有两类:一类是直接跳,一类是跳的时候经过新的 conv 层

training in practise

  • Batch Norm after every conv layer
  • Xavier 2/ initialization from He et al.
  • SGD + Momentum 0.9
  • learning rate 0.1, divided by 10 when validation error plateaus.
  • mini-batch size 256
  • weight decay 1e-5
  • no dropout

MobileNet

Basic idea

  • depthwise separable convolution is used to reduce the model size and complexity
  • MobileNet is particularly useful for mobile and embedded vision applications
  • smaller model size: fewer number of parameters
  • lower complexity: fewer multiplications and additions

Depthwise separable conv

depthwise conv is the channel wise Dk×DkD_k\times D_k spatial conv.

pointwise conv is a 1*1 conv to change the number of channels.

Network architectures

Batch Norm and ReLU applied after each conv

summary

separable conv replace standard conv by factorizing them into a depthwise conv and 1*1 conv

much more efficient with little loss in accuracy

MobileNet V2

add ResNet blocks

Other types of architectures

Autoencoder

无监督学习

decoder 重建图像, 学习 latent space representation

convolutional autoencoder

  • encoder + decoder
  • transposed convolution

U-Net

loose fine image details at the bottleneck

skip connections: use the feature maps of the encoder to expand the bottleneck

designed for image segmentation

Siamese networks

一个预训练模型,可以对接多个 downstream task model

summary

ResNet and MobileNet are currently good defaults to use

Network have gotten increasingly deep over time

many other aspects of network architectures are also continuously being investigated and improved

more recent trend: meta-learning…