Back propagation with manual derivation
Let’s consider a neural network
L is total number of layers
sl is number of units (except bais) in layer l.
(W,b)=(Wij(l),bi(l))l=1nL is model parameters, where we write Wij(l) to denote the weight associated with the connection between unit j in layer l, unit i in layer l+1. bi(l) is the bias associated with unit i in layer l+1. W(l) has size sl+1×sl and b(l) has length sl+1.
Forward propagation
a(i)=σ(z(i))
z(i)=W(i−1)a(i−1)+b(i−1)
goal of training: learn parameters s. t. hWb(x(k))=y(k)
Back propagation
intuition δj(l) = error of unit j in layer l.
δ(4)=(a(4)−y(k))⊙σ′(z(4))
δ(i)=(W(i))Tδ(i+1)⊙σ′(z(i))
∂Wij(l)∂J(W,b)=aj(l)δi(l+1)
∂bi(l)∂J(W,b)=δi(l+1)
Proof
last layer
using chain rule!
δj(L)=(aj(L)−yj)σ′(zj(L))
J(W,b)=21∑k(hW,b(x)k−yk)2=21∑k(ak(L)−yk)2
derivative r.w.t. aj(L): ∂aj(L)∂J(W,b)=aj(L)−yj
define error of unit j in layer l is:
\delta_j^{(l)}=\frac{\partial J(W,b)}{\partial z_j^{(l)}}\\ =\sum_k\frac{\part J}{\part a_j^{(l)}}\times\frac{\partial a_j^{(l)}}{\partial z_j^{(l)}}\\ =(a_j^{l}-y_j)\sigma'(z_j^{(l)})
proof of intermediate layers is the same
Weight update
∂Wij(l)∂J(W,b)=aj(l)δi(l+1)
∂W∂J=∂z∂J×∂W∂z
Bias update
∂bi(l)∂J(W,b)=δi(l+1)
∂b∂J=∂z∂J×∂b∂z
Training NNs in practice
Saturation
aka 饱和
以 sigmoid 函数为例,当 输入比较大时,激活函数的函数值改变很小。因此计算的值就会很小,网络更新缓慢。
选择非 sigmoid 函数解决这个问题。
leaky ReLU, Maxout, ELU, ReLU, tanh…
Vanishing gradient problem
饱和会导致梯度消失
gradient of the loss function can approach zero. 导致传播时,梯度越传越小,更新不动网路参数。
possible solutions: batch normalization and residual network 批次正则化,残差连接
Over-fitting
too many parameters, too few training data.
fail to generalize to new data.
L2 regualrization
weight decay
J(W,b)=21n1∑i∑k(h(xk(i))−yk(i))+2nλ∑l∑j∑i(Wij(l))
the partial derivative gets an extra term:
∂Wij(l)∂J(W,b)=n1ΔWij(l)+λWij(l)
other types of regularization
dropout
early stopping
- up to a certain point, gradient descent improves the models performance on data outside of the training set
- provide guidance as to how many iterations can be run before the model begins to overfit.
batch normalization
data augmentation
stochastic gradient descent
- for large training set, computing loss and gradient for entire training set is very slow.
- generally each parameter update in SGD is computed w.r.t. a few training examples or a minibatch as opposed to a single example.
using other activation function
detecting over-fitting
if the loss function on training set always decrease over time, showing no tendency to convergence.
meanwhile, loss on validation set increases.