Back propagation with manual derivation

Let’s consider a neural network

LL is total number of layers

sls_l is number of units (except bais) in layer ll.

(W,b)=(Wij(l),bi(l))l=1nL(W,b)=(W_{ij}^{(l)},b_i^{(l)})_{l=1}^{n_L} is model parameters, where we write Wij(l)W_{ij}^{(l)} to denote the weight associated with the connection between unit jj in layer ll, unit ii in layer l+1l+1. bi(l)b_i^{(l)} is the bias associated with unit ii in layer l+1l+1. W(l)W^{(l)} has size sl+1×sls_{l+1}\times s_l and b(l)b^{(l)} has length sl+1s_{l+1}.

Forward propagation



goal of training: learn parameters s. t. hWb(x(k))=y(k)h_{Wb}(x^{(k)})=y^{(k)}

Back propagation

intuition δj(l)\delta_j^{(l)} = error of unit jj in layer ll.



J(W,b)Wij(l)=aj(l)δi(l+1)\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=a_j^{(l)}\delta_i^{(l+1)}

J(W,b)bi(l)=δi(l+1)\frac{\partial J(W,b)}{\partial b_i^{(l)}}=\delta_i^{(l+1)}


last layer

using chain rule!



derivative r.w.t. aj(L)a_j^{(L)}: J(W,b)aj(L)=aj(L)yj\frac{\partial J(W,b)}{\partial a_j^{(L)}}=a_j^{(L)}-y_j

define error of unit j in layer l is:

\delta_j^{(l)}=\frac{\partial J(W,b)}{\partial z_j^{(l)}}\\ =\sum_k\frac{\part J}{\part a_j^{(l)}}\times\frac{\partial a_j^{(l)}}{\partial z_j^{(l)}}\\ =(a_j^{l}-y_j)\sigma'(z_j^{(l)})

proof of intermediate layers is the same

Weight update

J(W,b)Wij(l)=aj(l)δi(l+1)\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=a_j^{(l)}\delta_i^{(l+1)}

JW=Jz×zW\frac{\partial J}{\partial W}=\frac{\partial J}{\partial z}\times\frac{\partial z}{\partial W}

Bias update

J(W,b)bi(l)=δi(l+1)\frac{\partial J(W,b)}{\partial b_{i}^{(l)}}=\delta_i^{(l+1)}

Jb=Jz×zb\frac{\partial J}{\partial b}=\frac{\partial J}{\partial z}\times\frac{\partial z}{\partial b}

Training NNs in practice


aka 饱和

以 sigmoid 函数为例,当 输入比较大时,激活函数的函数值改变很小。因此计算的值就会很小,网络更新缓慢。

选择非 sigmoid 函数解决这个问题。

leaky ReLU, Maxout, ELU, ReLU, tanh…

Vanishing gradient problem


gradient of the loss function can approach zero. 导致传播时,梯度越传越小,更新不动网路参数。

possible solutions: batch normalization and residual network 批次正则化,残差连接


too many parameters, too few training data.

fail to generalize to new data.

L2 regualrization

weight decay


the partial derivative gets an extra term:

J(W,b)Wij(l)=1nΔWij(l)+λWij(l)\frac{\partial J(W,b)}{\partial W_{ij}^{(l)}}=\frac{1}{n}\Delta W_{ij}^{(l)}+\lambda W_{ij}^{(l)}

other types of regularization


early stopping

  • up to a certain point, gradient descent improves the models performance on data outside of the training set
  • provide guidance as to how many iterations can be run before the model begins to overfit.

batch normalization

data augmentation

stochastic gradient descent

  • for large training set, computing loss and gradient for entire training set is very slow.
  • generally each parameter update in SGD is computed w.r.t. a few training examples or a minibatch as opposed to a single example.

using other activation function

detecting over-fitting

if the loss function on training set always decrease over time, showing no tendency to convergence.

meanwhile, loss on validation set increases.