https://open.163.com/newview/movie/free?pid=IEU2H8NIJ&mid=VEU2H8NKA
Supervised Learning & Gradient Descent
Notation:
- \(m\) = #training examples.
- \(n\) = #features
- \(x\) = input variables / features.
- \(y\) = output variable / target variable
- \((x, y)\) = training example
- The i-th: \((x^{(i)},y^{(i)})\)
\(h(x)=h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n\)
let \(x_0\) be 1,then
\(h(x) = \sum_{i=0}^{n}\theta_ix_i=\theta^T x\)
\(Minimize_{\theta}\ \ J(\theta) = \frac{1}{2} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\)
(Least squares linear regression)
Start with \(\theta = \boldsymbol{0}\)
Notice:\(\theta,x,y\) are all vectors now.
Batch Gradient Descent:
\(\theta_i := \theta_i - \alpha\frac{\partial}{\partial\theta_i}J(\theta) = \theta_i - \alpha(h_\theta(x)-y)x_i = \theta_i - \alpha \sum_{j = 1}^n (h_\theta(x^{(j)}) - y^{(j)})x_i^{(j)}\)
(Repeat until convergence)
Stochastic Gradient Descent:
For j:=1 to m \(\theta_i := \theta_i - \alpha(h_\theta(x^{(j)})-y^{(j)})x_i^{(j)}\ (For\ all\ i)\)
(one \((x,y)\) for one step)
Def:\[\nabla_\theta J=\left[ \begin{array}\\ \frac{\partial J}{\partial \theta_0} \\ \vdots \\ \frac{\partial J}{\partial \theta_n} \end{array}\right]\]
So:Gradient Descent:\(\theta :=\theta - \alpha \nabla_\theta J\ \ (\theta,\nabla J \in \mathbb{R^{n+1}})\)
\(f(A):\mathbb{R^{m \times n}} \mapsto \mathbb{R}\)
Def:\(\nabla_A f(A)=\left[ \begin{array}\\ \frac{\partial f}{\partial A_{1,1}} & \cdots & \frac{\partial f }{\partial A_{1,n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial A_{m,1}} & \cdots & \frac{\partial f}{\partial A_{m,n}}\end{array} \right]\)
\(tr\ A = \sum_{i=1}^{n} A_{i,i}\ \ (A \in \mathbb{R^{n*n}})\)
Fact:
\(tr\ AB=tr\ BA\)
\(tr\ ABC = tr\ CAB = tr\ BCA\)
\(f(A) = tr\ AB,\nabla_A tr\ AB = B^T\)
\(\nabla_A tr\ ABA^TC = CAB + C^TAB^T\)
\(X = \left[\begin{array}\\(x^{(i)})^T \\ \vdots \\ (x^{(m)})^T\end{array}\right]\)
\(\frac{1}{2}(X \theta - y)^T(X\theta - y) = \frac{1}{2}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})^2 = J(\theta)\)
\(\begin{aligned} &\nabla_\theta J(\theta)\\ =& \nabla_\theta \frac{1}{2}(X \theta - y)^T(X\theta - y) \\ =& \frac{1}{2} \nabla_\theta tr (\theta^TX^TX\theta - \theta^T X^Ty-y^TX\theta +y^Ty) \\ =& \frac{1}{2}(\nabla_\theta tr\ \theta\theta^TX^TX-\nabla_\theta tr\ y^TX\theta - \nabla_\theta y^TX\theta)\\ =& X^TX\theta - X^Ty \end{aligned}\)
Let it be 0,then
\(\theta = (X^TX)^{-1}X^Ty\)
Vector Scaling:\(x_i = \frac{x_i - \mu_i}{s_i}\)
Where \(\mu_i\) is the average of all the values for feature (i) and \(s_i\) is the range of values (max - min), or \(s_i\) is the standard deviation.
Underfitting and Overfitting
Parametric Learning Algorithm:The number of parameters grows with m.
A non-parametric learning Algorithm:Locally Weighted Regression (LOESS)
To evaluate \(h\) at a certain \(x\)
LR(Linear Regression):Fit \(\theta\) to minimize \(\sum_{i}(y^{(i)}-\theta^Tx^{(i)})^2\) ,return \(\theta^TX\)
LWR:Fit \(\theta\) to minimize \(\sum_i w^{(i)}(y^{(i)}-\theta^Tx^{(i)})^2\)
where \(w^{(i)}=exp(-\frac{(x^{(i)}-x)^2}{2\tau^2})\) (\(\tau\):bandwidth parameter)
If \(|x^{(i)}-x|\) small, then \(w^{(i)}\approx 1\)
If \(|x^{(i)}-x|\) large, then \(w^{(i)}\approx 0\)
One way to proof that the Least Squares works correctly.
Assume \(y^{(i)} = \theta^Tx^{(i)}+\varepsilon^{(i)}\)
\(\varepsilon^{(i)} = error,\varepsilon^{(i)} \sim \mathscr{N}(O,\sigma^2)\)
\(P(\varepsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}}\sim \mathscr{N}(\theta^Tx^{(i)},\sigma^2)\)
p.s.From Central Limit Theorem,we can know that \(y^{(i)}\) given \(x^{(i)}\) and parameterized by \(\theta\) is distributed Gaussian。
\(\varepsilon^{(i)}\)s are IID (Independently ,Identically Distributed)
\(L(\theta) = P(\overrightarrow{y}|X;\theta) = \prod_{i=1}^{m} P(y^{(i)}|x^{(i)};\theta) = \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}}\) (Likelihood)
The priciple of maximun likelihood:
\(\begin{aligned} l(\theta) =& \log L(\theta) \\ =& \log \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}} \\ =& m\log\frac{1}{\sqrt{2\pi}\sigma} + \sum_{i=1}^m-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\end{aligned}\)
So maximize \(l(\theta)\) is the same as minimizing \(\frac{\sum_{i=1}^m(y^{(i)} - \theta^Tx)^2}{2} = J(\theta)\)
Logistic Regression:Binary Classification (LR can't work well)
\(y\in\{0,1\},h_\theta (x)\in[0,1]\)
choose:\(h_\theta(x) = g(\theta^TX) = \frac{1}{1+e^{-\theta^TX}}\)
\(g(z) = \frac{1}{1+e^{-z}}\) (sigmoid function)
\(P(y = 1 | x; \theta) = h_\theta (x),P(y = 0 | x ; \theta) = 1 - h_\theta(x)\)
So \(P(y | x; \theta) = h_\theta(x)^y(1 - h_\theta(x))^{1-y}\)
\(L(\theta) = P(\overrightarrow{y} | x; \theta) = \prod_{i}P(y^{(i)} | x^{(i)}; \theta) = \prod_i h(x^{(i)})^{y(i)}(1 - h(x^{(i)}))^{1 - y^{(i)}}\)
\(l(\theta) = \log L(\theta) = \sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)}) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))\)
\(\theta := \theta + \alpha \nabla_\theta l(\theta)\)
Batch Gradient Ascent
\(\frac{\partial}{\partial\theta_j}l(\theta) = \sum_{i=1}^m (y^{(i)} - h_\theta(x^{(i)}))x_j^{(i)}\)
\(\theta_j := \theta_j + \alpha\sum_{i=1}^m (y^{(i)} - h_\theta{(i)})x_j^{(i)}\)
Stochastic Gradient Descent
\(\theta_j := \theta_j + \alpha(y^{(i)} - h(x^{(i)}))x_j^{(i)}\)
\(g(z) = \begin{cases} 1 & if\ z\geqslant 0 \\ 0 & if\ otherwise\end{cases}\)
\(h_\theta(x) = g(\theta^Tx)\)
\(\theta_j := \theta_j + \alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\)
Newton's Method(Not Finished)
\(f(\theta)\),Find \(\theta\) s.t. \(f(\theta) = 0\)
\(\Delta = \theta^{(t)} - \theta^{(t+1)} = \frac{f(\theta^{(t)})}{f'(\theta^{(t)})}\)
\(l(\theta)\) Want \(\theta\) s.t. \(l(\theta) = 0\)
\(\theta^{(t+1)} = \theta^{(t)} - \frac{l'(\theta^{(t)})}{l''(\theta^{(t)})}\)
More generally:
\(\theta^{(t+1)} = \theta^{(t)} - H^{-1}\nabla_\theta l\)
Where H is the Hessian matrix
\(H_j = \frac{\partial^2 l}{\partial \theta_i \partial \theta_j}\)
Generalized Linear Methods
Exponential Family Distributions
\(P(y|x; \theta)\)
\(y \in \mathbb{R}\): Gaussian -> Least Square
\(y \in \{0,1\}\):Bernoulli -> Logisitic regression
\(Bernoulli(\Phi)\) \(P(y = 1; \Phi) = \Phi\) \(\mathscr{N}(\mu,\sigma^2)\)
\(P(y;\eta) = b(y)\exp(\eta^T T(y) - a(\eta))\)
\(\eta\)-natural parameter
\(T(y)\)-sufficient statistic(Usually \(T(y) = y\))
\(Ber(\Phi)\) :
\(P(y;\Phi) = \Phi^y(1-\Phi)^{1-y} = \exp(y \log \Phi + (1 - y) \log (1 - \Phi)) = \exp(\log \frac{\Phi}{1 - \Phi} y + \log (1 - \Phi))\)
\(\eta = \log \frac{\Phi}{1 - \Phi}\),\(y = T(y)\),\(-a(\eta) = \log(1 - \Phi)\)
\(\Rightarrow \Phi = \frac{1}{1 + e^{-\eta}} \Rightarrow a(\eta) = -\log (1 - \Phi) = \log(1 + e^\eta)\)
\(T(y) = y\), \(b(y) = 1\)
\(\mathscr{N}(\mu, \sigma^2)\) set \(\sigma^2 = 1\)
\(\frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}(y - \mu)^2) =\frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}y^2)\exp(\mu y-\frac{1}{2}\mu^2)\)
\(b(y) = \frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}y^2)\), \(\mu = \eta\), \(T(y) = y\), \(a(\eta) = \frac{1}{2}\mu^2 = \frac{1}{2}\eta^2\)
Generalized Linear Models(GLM)
Assume:
\(y|x;\theta \sim \text{Exp Family} (\eta)\)
Given \(x\),goal is to output \(E[T(y)|x]\), Want \(h(x) = E[T(y)|x]\)
\(\eta = \theta^T x\) (\(\eta_i = \theta_i^T x\) if \(y \in \mathbb{R}^l\))
Bernoulli: For fixed \(x,\theta\),algorithm output
\(h_\theta(x) = E[y|x;\theta] = P(y = 1 | x;\theta) = \Phi = \frac{1}{1+e^{-\eta}} = \frac{1}{1+e^{-\theta^T X}}\)
\(g(\eta) = E[y;\eta] = \frac{1}{1+e^{-\eta}}\) (Canonical Response Function)
\(g^{-1}\):Canonical Link Function