Back propagation with manual derivation

Let’s consider a neural network

LL is total number of layers

sls_l is number of units (except bais) in layer ll.

(W,b)=(Wij(l),bi(l))l=1nL(W,b)=(W_{ij}^{(l)},b_i^{(l)})_{l=1}^{n_L} is model parameters, where we write Wij(l)W_{ij}^{(l)} to denote the weight associated with the connection between unit jj in layer ll, unit ii in layer l+1l+1. bi(l)b_i^{(l)} is the bias associated with unit ii in layer l+1l+1. W(l)W^{(l)} has size sl+1×sls_{l+1}\times s_l and b(l)b^{(l)} has length sl+1s_{l+1}.

Forward propagation

a(i)=σ(z(i))a^{(i)}=\sigma(z^{(i)})

z(i)=W(i1)a(i1)+b(i1)z^{(i)}=W^{(i-1)}a^{(i-1)}+b^{(i-1)}

goal of training: learn parameters s. t. hWb(x(k))=y(k)h_{Wb}(x^{(k)})=y^{(k)}

Back propagation

intuition δj(l)\delta_j^{(l)} = error of unit jj in layer ll.

δ(4)=(a(4)y(k))σ(z(4))\delta^{(4)}=(a^{(4)}-y^{(k)})\odot\sigma'(z^{(4)})

δ(i)=(W(i))Tδ(i+1)σ(z(i))\delta^{(i)}=(W^{(i)})^T\delta^{(i+1)}\odot\sigma'(z^{(i)})

J(W,b)Wij(l)=aj(l)δi(l+1)\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=a_j^{(l)}\delta_i^{(l+1)}

J(W,b)bi(l)=δi(l+1)\frac{\partial J(W,b)}{\partial b_i^{(l)}}=\delta_i^{(l+1)}

Proof

last layer

using chain rule!

δj(L)=(aj(L)yj)σ(zj(L))\delta_j^{(L)}=(a_j^{(L)}-y_j)\sigma'(z_j^{(L)})

J(W,b)=12k(hW,b(x)kyk)2=12k(ak(L)yk)2J(W,b)=\frac{1}{2}\sum_k(h_{W,b}(x)_k-y_k)^2=\frac{1}{2}\sum_k(a_k^{(L)}-y_k)^2

derivative r.w.t. aj(L)a_j^{(L)}: J(W,b)aj(L)=aj(L)yj\frac{\partial J(W,b)}{\partial a_j^{(L)}}=a_j^{(L)}-y_j

define error of unit j in layer l is:

\delta_j^{(l)}=\frac{\partial J(W,b)}{\partial z_j^{(l)}}\\ =\sum_k\frac{\part J}{\part a_j^{(l)}}\times\frac{\partial a_j^{(l)}}{\partial z_j^{(l)}}\\ =(a_j^{l}-y_j)\sigma'(z_j^{(l)})

proof of intermediate layers is the same

Weight update

J(W,b)Wij(l)=aj(l)δi(l+1)\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=a_j^{(l)}\delta_i^{(l+1)}

JW=Jz×zW\frac{\partial J}{\partial W}=\frac{\partial J}{\partial z}\times\frac{\partial z}{\partial W}

Bias update

J(W,b)bi(l)=δi(l+1)\frac{\partial J(W,b)}{\partial b_{i}^{(l)}}=\delta_i^{(l+1)}

Jb=Jz×zb\frac{\partial J}{\partial b}=\frac{\partial J}{\partial z}\times\frac{\partial z}{\partial b}

Training NNs in practice

Saturation

aka 饱和

以 sigmoid 函数为例,当 输入比较大时,激活函数的函数值改变很小。因此计算的值就会很小,网络更新缓慢。

选择非 sigmoid 函数解决这个问题。

leaky ReLU, Maxout, ELU, ReLU, tanh…

Vanishing gradient problem

饱和会导致梯度消失

gradient of the loss function can approach zero. 导致传播时,梯度越传越小,更新不动网路参数。

possible solutions: batch normalization and residual network 批次正则化,残差连接

Over-fitting

too many parameters, too few training data.

fail to generalize to new data.

L2 regualrization

weight decay

J(W,b)=121nik(h(xk(i))yk(i))+λ2nlji(Wij(l))J(W,b)=\frac{1}{2}\frac{1}{n}\sum_i\sum_k(h(x^{(i)}_k)-y_k^{(i)})+\frac{\lambda}{2n}\sum_l\sum_j\sum_i(W_{ij}^{(l)})

the partial derivative gets an extra term:

J(W,b)Wij(l)=1nΔWij(l)+λWij(l)\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=\frac{1}{n}\Delta W_{ij}^{(l)}+\lambda W_{ij}^{(l)}

other types of regularization

dropout

early stopping

  • up to a certain point, gradient descent improves the models performance on data outside of the training set
  • provide guidance as to how many iterations can be run before the model begins to overfit.

batch normalization

data augmentation

stochastic gradient descent

  • for large training set, computing loss and gradient for entire training set is very slow.
  • generally each parameter update in SGD is computed w.r.t. a few training examples or a minibatch as opposed to a single example.

using other activation function

detecting over-fitting

if the loss function on training set always decrease over time, showing no tendency to convergence.

meanwhile, loss on validation set increases.