计算机视觉 W43 CNN architectures
This class will introduce some typical CNNs.
Transfer learning
pre-train finetune
replace decoder and train it.
time saving
Examples
reinitialize and train only the last few fully connected layer.
encoder: more generic features (edge, blob, texture)
decoder: more specific features (object parties)
transfer learning freeze the encoder, whereas fine tuning unfreeze encoder.
AlexNet
1998, first CNN.
conv filter 5*5, stride 1
subsampling layers were 2*2 applied at stride 2.
architecture: conv-pool-conv-pool-fc-fc
Number of parameters
4 bytes per parameter
294 MiB
Details
- first use of ReLU
- local response normalization layers
- heavy data augmentation
- dropout 0.5 in FC layers
- batch size 128, SGD Momentum 0.9, learning rate 1e-2.
- L2 weight decay 5e-4
VGGNet
small filters, depper networks
from 8 layers to 16-19 layers.
only 3*3 conv stride, pad 1 and 2*2 max pooling stride 2.
Why use smaller filters?
Stack of three 3*3 conv stride layers has same effective receptive filed as one 7*7 conv layer
large size filters not necessary.
Number of parameters of one 7*7 filter vs three 3*3 filters:
assuming 3 channel
1*7*7*3=147
3*3*3*3=81
fewer parameters
deeper, more non-linearities.
Memory and number of parameters
total memory: 24M*4 bytes=96MB\image
total params: 138M parameters, 552 MiB.
Details
non-local response normalization
GoogleLeNet
aka Inception V1
- architecture different from VGG, AlexNet
- 1*1 conv at the middle of the network
- global average pooling at the end of the network instead of FC layers
- inception module: different size/types of conv for the same input and stacking all the output.
1*1 convolution
for introducing more non-linearity to increase the representational power.
used as a dimension reduction module to reduce the computation.
by reducing computation bottleneck, depth and width of network can be increased.
Inception module
design a good local network topology and then stack these modules on top of each other.
different kinds of features are extracted.
high computational complexity
Global average pooling
FC layers are used at the end of network
all inputs are connected to each output
global average pooling is used nearly at the end of network by averaging each feature map from 7*7 to 1*1
number of weight=0
average pooling improved top-1 acc, less prone to overfitting.
Auxiliary classifier
辅助分类器
intermediate softmax branches
used for training only
branch consist of:
- 5*5 average pooling stride 3
- 1/1 conv
- 1024 FC
- 1000 FC
- softmax
can be used for combating gradient vanishing problem, providing regularization
Details
deeper network, with computational efficiency
22 layers
inception module
avoids expensive FC layers
only 5M params
ResNet
deep residual networks
very deep networks using residual connections
152-layers model for ImageNet
motivation for going deeper: deeper, more model capacity
core idea: skip connection. skip one or more layers as shown in figure.
residual mapping
Residual block
residual function
easily push the residuals to 0s
skip connection and shape
跳跃连接有两类:一类是直接跳,一类是跳的时候经过新的 conv 层
training in practise
- Batch Norm after every conv layer
- Xavier 2/ initialization from He et al.
- SGD + Momentum 0.9
- learning rate 0.1, divided by 10 when validation error plateaus.
- mini-batch size 256
- weight decay 1e-5
- no dropout
MobileNet
Basic idea
- depthwise separable convolution is used to reduce the model size and complexity
- MobileNet is particularly useful for mobile and embedded vision applications
- smaller model size: fewer number of parameters
- lower complexity: fewer multiplications and additions
Depthwise separable conv
depthwise conv is the channel wise spatial conv.
pointwise conv is a 1*1 conv to change the number of channels.
Network architectures
Batch Norm and ReLU applied after each conv
summary
separable conv replace standard conv by factorizing them into a depthwise conv and 1*1 conv
much more efficient with little loss in accuracy
MobileNet V2
add ResNet blocks
Other types of architectures
Autoencoder
无监督学习
decoder 重建图像, 学习 latent space representation
convolutional autoencoder
- encoder + decoder
- transposed convolution
U-Net
loose fine image details at the bottleneck
skip connections: use the feature maps of the encoder to expand the bottleneck
designed for image segmentation
Siamese networks
一个预训练模型,可以对接多个 downstream task model
summary
ResNet and MobileNet are currently good defaults to use
Network have gotten increasingly deep over time
many other aspects of network architectures are also continuously being investigated and improved
more recent trend: meta-learning…