计算机视觉 W36 Machine Learning fundamentals
Linear regression
goal: build a model take a vector X as input and predicts the output Y. Output is a linear function of the input
Model and loss function
model:
find a choice of w s. t. is as close as possible to
search for w that minimizes the L2 norm:
loss function:
Why multiply a coefficient 1/2? Because there will be a coe 2 in differentiation.
Gradient descent
how to minimize val of loss function?
Random search - very bad idea ❎
gradient descent - progress is slow but steady ✅
α: learning rate/step size
j-th element of
(according to the chain rule )
limitations of gradient descent
- low speed (speed up: momentum…)
- local minimum (may fail to find a global min)
Learning rate
- very high lr: loss grow up unstop
- high lr: loss fall, but stop at a pretty high val
- low lr: loss may fall, very slow, hard to convergent
- good lr: loss fall, convergent to a relatively low val
under-fitting & over-fitting
Hyper-parameters
a param whose val is set before the learning process.
e. g. degree of polynomial, learning rate…
Hyper-param search
idea 1: choose hyper-param that work best on all data
idea 2: split data into train and test sets, choose hyper-param that work best on test data
idea 3: split data into train, validation and test sets, choose hyper-param that work best on validation set and evaluate test better.
idea 4: cross-validation, split data into folds, try each fold as validation and average the results. (useful for small datasets, but not used too frequently in deep learning)
Logistic Regression
classifier: use features to distinguish between 2 or more classes
regressors: use features to predict some functional relationship
logistic regression is classification algorithm.
Model
goal: to predict one of two possible labels, s. t.
candidate solution:
sigmoid (or logistic) function: , interpret as a probability.
convention: if > 0.5, let y be 1, else let y be 0.
Loss function
cross-entropy
when we need to minimize the loss
why negative?
We want loss to be positive. log lie in range from 0 to 1 where log is negative.
Optimization
the same as linear regression
Loss function derivation
Loss function
(这两页幻灯片看不太懂……)
Regularization
Weight Decay
in order to solve an ill-posed problem or to prevent over-fitting
more different approaches to solve these problems
- data augmentation
- early stopping
- dropout
- batch normalization
- weight decay
- sparsity
loss plus with an extra term , where R(w) is either the L1-norm or L2-norm of w.
𝜆 determines the trade-off between minimizing the data loss and minimizing the model parameters 𝑤.
avoid over-fitting, don’t fit noise in the data.
by keeping some weights small, the regularization term make the model simpler, thereby avoid over-fitting.
L1 vs L2
- sparse solutions (make many entries of w = 0)
- feature selection
- [1,1,1,1] → [1,0,0,0] or [0,1,0,0] or [0,0,1,0] or[0,0,0,1] (prefer many zero-weights)
- faster than L1 regularization
- aka weight decay
- non-sparse
- [1,1,1,1] → [0.25,0.25,0.25,0.25] (prefer to spread the weights)
Softmax regression
multi-class classification
idea: convert feature vector x’ into class probabilities? use softmax function
search for weights that minimize the difference between
- output vector of predicted probabilities
- target vector of true probabilities (one-hot vector)
there are K classes
why instead of ?
Probabilities must be non-negative. Applying exp ensures that numbers are positive.
what’s the purpose of ?
It normalizes the probability distribution, so that it sums to 1.
loss function
indicator function: 1{true statement} = 1, 1{false statement}=0
⚠️only one term in the inner sum will be non-zero
optimization
cannot solve for minimum of J(W) analytically, thus we resort to an iterative optimization algorithm
for each k=1, 2, …, K
is a vector, has to be calculated for all K
There are some hard cases for a linear classifier
- 两类交叉
- 1类包围2类包围1类
- 1类中包围多个2类
solution
极坐标变换
神经网络
不能用 softmax 回归解决,因为其仍然算是线性的,所以需要神经网络
K-Nearest Neighbors
image classification
basic nearest neighbor classifier
distance metrics to compare images: L1 Norm, N2 Norm
train: O(1), predict: O(N).
bad! we only accept classifier that are fast at prediction, slow for training
Instead of copying label from nearest neighbor, take majority vote from K closest points.
Hyper-parameters of KNN
- The best value of K?
- best distance metric?
problem-dependent!!!
Poor choice of features
- KNN is never used on raw pixel intensities
- very slow at test time
- distance metrics on pixels are not informative
Curse of dimensionality
In high dim data, all obj appear to be sparse and dissimilar in many ways. image classification based on KNN performs poorly when features are raw pixels.
search becomes slow as dimensionality increases.
Preview of CNN
instead of using KNN: use a pre-trained CNN to obtain a better feature representation.
non-linear transformation of pixel values into a presentation that is optimal for classification
transfer learning
- pretrained encoder
- finetune weights of decoder: linear classifier performing softmax regression
Cluster analysis
vector in the same group (a cluster) are more similar to each other
e. g. K-Means