https://www.bilibili.com/video/BV164411S78V
线性回归(Linear Regression)与梯度下降(Gradient Descent)
记号:
\(m\) = 训练样本数,\(n\) = 特征数,\(x\) = 输入变量/特征,\(y\) = 输出变量/目标变量
\((x, y)\) = 训练样本。第i个: \((x^{(i)},y^{(i)})\)
\(h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n\)
令\(x_0\)为\(1\),则\(h_\theta(x) = \sum_{i=0}^{n}\theta_ix_i=\theta^T x\)
\(Minimize_{\theta}\ \ J(\theta) = \frac{1}{2m} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\)
(最小二乘线性回归)
初始 \(\theta = \boldsymbol{0}\),注意:\(\theta,x,y\) 均为向量
批量梯度下降(Batch Gradient Regression)(使用全部样本,循环直到收敛,复杂度\(knm\)):
\(\theta_i := \theta_i - \alpha\frac{\partial}{\partial\theta_i}J(\theta) = \theta_i - \frac{\alpha}{m}(h_\theta(x)-y)x_i = \theta_i - \frac{\alpha}{m} \sum_{j = 1}^m (h_\theta(x^{(j)}) - y^{(j)})x_i^{(j)}\)
随机梯度下降(Stochastic Gradient Descent)(一步只使用一对\((x,y)\) ):
For j:=1 to m \(\theta_i := \theta_i - \frac{\alpha}{m}(h_\theta(x^{(j)})-y^{(j)})x_i^{(j)}\ (For\ all\ i)\)
正则化方法(复杂度\((nm)^3\)):\(\theta = (X^TX)^{-1}X^Ty\)
向量缩放:\(x_i = \frac{x_i - \mu_i}{s_i}\) (\(\mu_i\) 为\(x_i\)平均数,\(s_i\) 为极差或标准差)
逻辑回归(Logistic Regression)
二分类:
\(h_\theta(x) = \frac{1}{1 + e^{-\theta^TX}}\),但若仍使用原先代价函数会得到非凸图像,容易收敛至非最值点。
原先代价函数:\(Cost(h_\theta(x), y) = \sum_{i = 1}^m \frac{1}{2} (h_\theta(x^{(i)})-y^{(i)})^2\)
重新定义代价函数:
\(Cost(h_\theta(x), y) = \begin{cases} -\log(h_\theta(x)) & if\ y = 1 \\ -\log(1 - h_\theta(x)) & if\ y = 0 \end{cases}\)
代入得\(J(\theta) = \frac{1}{m} \sum_{i = 1}^m Cost(h_\theta(x),y) = -\frac{1}{m}\sum_{i = 1}^m [y^{(i)} \log (h_\theta(x^{(i)})) + (1-y^{(i)}) \log (1 - h_\theta(x^{(i)}))]\)
\(\theta_i := \theta_i - \frac{\alpha}{m} \sum_{j = 1}^m (h_\theta(x^{(j)}) - y^{(j)})x_i^{(j)}\)
(形式与线性回归完全相同)
多拟合分类器:\(h_\theta^{(i)}(x) = P(y = i|x; \theta)\ \ \ (i = 1, 2, \cdots)\)
对每个样本寻找:\(\max_i h_\theta^{(i)}(x)\)
正则化(Regularization):为避免过拟合(Overfitting),对于某些高次项系数\(\theta_i\),将\(1000\theta_i^2\)加入\(J(\theta)\),以使得此系数尽量小,从而消除此系数(显然\(\theta_0\)不需要)。
\(J(\theta) = \frac{1}{m} [\sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j = 1}^n \theta_j^2]\)
\(\theta_0 := \theta_0 - \frac{\alpha}{m} \sum_{i = 1}^m (h_\theta(x^{(0)}) - y^{(0)})x_0^{(0)}\)
\(\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \frac{\alpha}{m} \sum_{i = 1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\)
\(\theta = \Big{(}X^TX + \lambda \begin{bmatrix} 0 & & & & \\ & 1 & & & \\ & & 1 & & \\ & & & \ddots & \\ & & & & 1 \end{bmatrix}\Big{)}^{-1}X^Ty\) \((n+1)\times(n+1)\)
若\(m \leqslant n\)且\(\lambda > 0\),那么此矩阵可逆。
神经网络(Neural Networks)
\(L\):网络层数,\(s_l\):神经元数量(不包括偏差单元bias unit)
二分类(Binary Classification):\(y = 0\ \text{or}\ 1\),\(s_L = K = 1\)
多分类(Multi-class Classification):\(y \in \mathbb{R}^K\),\(S_L = K\)
\(J(\theta) = -\frac{1}{m}[\sum_{i = 1}^m \sum_{k = 1}^K y_k^{(i)} \log(h_\theta(x^{(i)}))_k + (1 - y_k^{(i)}) \log(1 - (h_\theta(x^{(i)}))_k)] \\ + \frac{\lambda}{2m}\sum_{l = 1}^{L - 1}\sum_{i = 1}^{s_l}\sum_{j = 1}^{s_l + 1}(\theta_{ji}^{(l)})^2\)
\(a_j^{(l)}\):第\(l\)层第\(j\)个节点的激活值(Activation),\(z^{(l+1)} = \theta^{(l)} \cdot a^{(l)}\),\(a^{(l+1)} = g(z^{(l+1)})\),此例中\(g(z) = \frac{1}{1 - e^{-z}}\)
\(cost(i) = y^{(i)}\log h_\theta(x^{(i)}) + (1 - y^{(i)}) \log h_\theta(x^{(i)})\)
\(\delta_j^{(l)}\)是\(a_j^{(l)}\)的误差代价,\(\delta_j^{(l)} = \frac{\partial}{\partial z_j^{(l)}}cost(i)\ (j \geqslant 0)\)
复杂推导得:\(\delta^{(l)} = (\theta^{(l)})^T \delta^{(l+1)} \cdot g'(z^{(l)})\),\(g'(z^{(l)}) = a^{(l)} \cdot (1 - a^{(l)})\),\(\Delta_{ij}^{(l)} := \Delta_{ij}^{(l)} + a_j^{(l)}\delta_i^{(l+1)}\)
对于每个样本\((x^{(i)}, y^{(i)})\),正向传播(Forward Propagation)得到\(a\),再计算输出层的\(\delta^{(L)}\),再反向传播(Back Propagation)得到\(\delta^{(2 \sim L-1)}\)和\(\Delta^{(2 \sim L-1)}\),最后得到代价函数的偏导数:
\(D_{ij}^{(l)} := \frac{1}{m}\Delta_{ij}^{(l)} + \lambda \theta_{ij}^{(l)}\ \ \ if\ j \neq 0\)
\(D_{ij}^{(l)} := \frac{1}{m} \Delta_{ij}^{(l)}\ \ \ if\ j = 0\)
梯度检测(Gradient Check):\(\frac{\partial}{\partial \theta_i} \approx \frac{J(\theta_1, \cdots, \theta_i+\epsilon, \cdots, \theta_n) - J(\theta_1, \cdots, \theta_i - \epsilon, \cdots \theta_n)}{2\epsilon}\)
随机初始化:每个\(\theta_{ij}^{(l)}\)都在\([-\epsilon, \epsilon]\)范围内随机。
机器学习诊断法(Diagnostics):
0/1分类错误(0/1 Misclassfication error)
\(err(h_\theta(x), y) = \begin{cases} 1 & \text{if } h_\theta(x) \geqslant 0.5 & , y = 0 \\ & \text{or if } h_\theta(x)<0.5 &, y = 1 \\ 0 & \text{otherwise} \end{cases}\)
\(Test\ error = \frac{1}{m_{test}} \sum_{i = 1}^{m_{test}} err(h_\theta(x_{test}^{(i)}), y_{test}^{(i)})\)
训练集(Trainning set) 60%,交叉验证集(Cross validation set) 20%,测试集(Test set) 20%
选择误差小的、泛化能力强的多项式次数\(d\)作为最终拟合结果。
先训练最小化\(J_{train}(\theta)\),再选取\(J_{cv}(\theta)\)最小的次数\(\theta^{(i)}\),最后在测试集上测试其泛化能力。
偏差值(Bias)过高:欠拟合。\(J_{train}(\theta)\)和\(J_{cv}(\theta)\)都很高
方差值(Variance)过高:过拟合。\(J_{train}(\theta)\)很低,\(J_{cv}(\theta)\)很高
可以用同样的方法决定正则化系数\(\lambda\)
偏斜类问题(skew classes)的评估方法:
查准率(True Positive):\(\frac{\text{True positives}}{\text{ predicted positives}} = \frac{\text{True pos}}{\text{True pos + Fake pos}}\)
召回率(Fake Positive):\(\frac{True\ positives}{\# actual\ positives} = \frac{True\ pos}{True\ pos + False\ neg}\)
常用评估值:\(F_1\ Score: \frac{2PR}{P + R}\)
支持向量机(Support Vector Machine)
逻辑回归:
\(J(\theta) = \frac{1}{m}[\sum_{i = 1}^m (-\log h_\theta(x^{(i)})) + (1 - y^{(i)})(-\log (1 - h_\theta(x^{(i)})))] + \frac{\lambda}{2m}\sum_{j = 1}^n \theta_j^2\)
\(SVM\):
\(\min_\theta C\sum_{i = 1}^m[y^{(i)}cost_1(\theta^Tx^{(i)}) + (1 - y^{(i)})cost_0(\theta^Tx^{(i)})] + \frac{1}{2}\sum_{i = 1}^n \theta_j^2\)
若希望预测结果\(y = 1\),则需要\(\theta^Tx^{(i)} \geqslant 1\),若希望\(y = 0\),则需要\(\theta^Tx^{(i)} \leqslant -1\)
核函数(Kernel):\(f_i = similarity(x, l^{(i)}) = exp(-\frac{\parallel x - l^{(i)}\parallel^2}{2\sigma^2})\)
训练:\(\min_\theta C\sum_{i = 1}^m [y^{(i)}cost_1(\theta^T f^{(i)}) + (1 - y^{(i)})cost_0(\theta^Tf^{(i)})] + \frac{1}{2}\sum_{j = 1}^n \theta_j^2\) (\(n = m\))