Andrew Ng - Machine Learning - Stanford

https://open.163.com/newview/movie/free?pid=IEU2H8NIJ&mid=VEU2H8NKA

Supervised Learning & Gradient Descent

Notation:

  • \(m\) = #training examples.
  • \(n\) = #features
  • \(x\) = input variables / features.
  • \(y\) = output variable / target variable
  • \((x, y)\) = training example
    • The i-th: \((x^{(i)},y^{(i)})\)

\(h(x)=h_\theta(x)=\theta_0+\theta_1x_1+\theta_2x_2+...+\theta_nx_n\)

let \(x_0\) be 1,then

\(h(x) = \sum_{i=0}^{n}\theta_ix_i=\theta^T x\)

\(Minimize_{\theta}\ \ J(\theta) = \frac{1}{2} \sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2\)

(Least squares linear regression)

Start with \(\theta = \boldsymbol{0}\)

Notice:\(\theta,x,y\) are all vectors now.

Batch Gradient Descent:

\(\theta_i := \theta_i - \alpha\frac{\partial}{\partial\theta_i}J(\theta) = \theta_i - \alpha(h_\theta(x)-y)x_i = \theta_i - \alpha \sum_{j = 1}^n (h_\theta(x^{(j)}) - y^{(j)})x_i^{(j)}\)

(Repeat until convergence)

Stochastic Gradient Descent:

For j:=1 to m \(\theta_i := \theta_i - \alpha(h_\theta(x^{(j)})-y^{(j)})x_i^{(j)}\ (For\ all\ i)\)

(one \((x,y)\) for one step)

Def:\[\nabla_\theta J=\left[ \begin{array}\\ \frac{\partial J}{\partial \theta_0} \\ \vdots \\ \frac{\partial J}{\partial \theta_n} \end{array}\right]\]

So:Gradient Descent:\(\theta :=\theta - \alpha \nabla_\theta J\ \ (\theta,\nabla J \in \mathbb{R^{n+1}})\)

\(f(A):\mathbb{R^{m \times n}} \mapsto \mathbb{R}\)

Def:\(\nabla_A f(A)=\left[ \begin{array}\\ \frac{\partial f}{\partial A_{1,1}} & \cdots & \frac{\partial f }{\partial A_{1,n}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f}{\partial A_{m,1}} & \cdots & \frac{\partial f}{\partial A_{m,n}}\end{array} \right]\)

\(tr\ A = \sum_{i=1}^{n} A_{i,i}\ \ (A \in \mathbb{R^{n*n}})\)

Fact:

\(tr\ AB=tr\ BA\)

\(tr\ ABC = tr\ CAB = tr\ BCA\)

\(f(A) = tr\ AB,\nabla_A tr\ AB = B^T\)

\(\nabla_A tr\ ABA^TC = CAB + C^TAB^T\)

\(X = \left[\begin{array}\\(x^{(i)})^T \\ \vdots \\ (x^{(m)})^T\end{array}\right]\)

\(\frac{1}{2}(X \theta - y)^T(X\theta - y) = \frac{1}{2}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})^2 = J(\theta)\)

\(\begin{aligned} &\nabla_\theta J(\theta)\\ =& \nabla_\theta \frac{1}{2}(X \theta - y)^T(X\theta - y) \\ =& \frac{1}{2} \nabla_\theta tr (\theta^TX^TX\theta - \theta^T X^Ty-y^TX\theta +y^Ty) \\ =& \frac{1}{2}(\nabla_\theta tr\ \theta\theta^TX^TX-\nabla_\theta tr\ y^TX\theta - \nabla_\theta y^TX\theta)\\ =& X^TX\theta - X^Ty \end{aligned}\)

Let it be 0,then

\(\theta = (X^TX)^{-1}X^Ty\)

Vector Scaling:\(x_i = \frac{x_i - \mu_i}{s_i}\)

Where \(\mu_i\) is the average of all the values for feature (i) and \(s_i\) is the range of values (max - min), or \(s_i\) is the standard deviation.

Underfitting and Overfitting

Parametric Learning Algorithm:The number of parameters grows with m.

A non-parametric learning Algorithm:Locally Weighted Regression (LOESS)

To evaluate \(h\) at a certain \(x\)

LR(Linear Regression):Fit \(\theta\) to minimize \(\sum_{i}(y^{(i)}-\theta^Tx^{(i)})^2\) ,return \(\theta^TX\)

LWR:Fit \(\theta\) to minimize \(\sum_i w^{(i)}(y^{(i)}-\theta^Tx^{(i)})^2\)

where \(w^{(i)}=exp(-\frac{(x^{(i)}-x)^2}{2\tau^2})\) (\(\tau\):bandwidth parameter)

If \(|x^{(i)}-x|\) small, then \(w^{(i)}\approx 1\)

If \(|x^{(i)}-x|\) large, then \(w^{(i)}\approx 0\)

One way to proof that the Least Squares works correctly.

Assume \(y^{(i)} = \theta^Tx^{(i)}+\varepsilon^{(i)}\)

\(\varepsilon^{(i)} = error,\varepsilon^{(i)} \sim \mathscr{N}(O,\sigma^2)\)

\(P(\varepsilon^{(i)}) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}}\sim \mathscr{N}(\theta^Tx^{(i)},\sigma^2)\)

p.s.From Central Limit Theorem,we can know that \(y^{(i)}\) given \(x^{(i)}\) and parameterized by \(\theta\) is distributed Gaussian。

\(\varepsilon^{(i)}\)s are IID (Independently ,Identically Distributed)

\(L(\theta) = P(\overrightarrow{y}|X;\theta) = \prod_{i=1}^{m} P(y^{(i)}|x^{(i)};\theta) = \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}}\) (Likelihood)

The priciple of maximun likelihood:

\(\begin{aligned} l(\theta) =& \log L(\theta) \\ =& \log \prod_{i=1}^m \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}} \\ =& m\log\frac{1}{\sqrt{2\pi}\sigma} + \sum_{i=1}^m-\frac{(y^{(i)}-\theta^Tx^{(i)})^2}{2\sigma^2}\end{aligned}\)

So maximize \(l(\theta)\) is the same as minimizing \(\frac{\sum_{i=1}^m(y^{(i)} - \theta^Tx)^2}{2} = J(\theta)\)

Logistic Regression:Binary Classification (LR can't work well)

\(y\in\{0,1\},h_\theta (x)\in[0,1]\)

choose:\(h_\theta(x) = g(\theta^TX) = \frac{1}{1+e^{-\theta^TX}}\)

\(g(z) = \frac{1}{1+e^{-z}}\) (sigmoid function)

\(P(y = 1 | x; \theta) = h_\theta (x),P(y = 0 | x ; \theta) = 1 - h_\theta(x)\)

So \(P(y | x; \theta) = h_\theta(x)^y(1 - h_\theta(x))^{1-y}\)

\(L(\theta) = P(\overrightarrow{y} | x; \theta) = \prod_{i}P(y^{(i)} | x^{(i)}; \theta) = \prod_i h(x^{(i)})^{y(i)}(1 - h(x^{(i)}))^{1 - y^{(i)}}\)

\(l(\theta) = \log L(\theta) = \sum_{i=1}^m y^{(i)}\log h_\theta(x^{(i)}) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))\)

\(\theta := \theta + \alpha \nabla_\theta l(\theta)\)

Batch Gradient Ascent

\(\frac{\partial}{\partial\theta_j}l(\theta) = \sum_{i=1}^m (y^{(i)} - h_\theta(x^{(i)}))x_j^{(i)}\)

\(\theta_j := \theta_j + \alpha\sum_{i=1}^m (y^{(i)} - h_\theta{(i)})x_j^{(i)}\)

Stochastic Gradient Descent

\(\theta_j := \theta_j + \alpha(y^{(i)} - h(x^{(i)}))x_j^{(i)}\)

\(g(z) = \begin{cases} 1 & if\ z\geqslant 0 \\ 0 & if\ otherwise\end{cases}\)

\(h_\theta(x) = g(\theta^Tx)\)

\(\theta_j := \theta_j + \alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}\)

Newton's Method(Not Finished)

\(f(\theta)\),Find \(\theta\) s.t. \(f(\theta) = 0\)

\(\Delta = \theta^{(t)} - \theta^{(t+1)} = \frac{f(\theta^{(t)})}{f'(\theta^{(t)})}\)

\(l(\theta)\) Want \(\theta\) s.t. \(l(\theta) = 0\)

\(\theta^{(t+1)} = \theta^{(t)} - \frac{l'(\theta^{(t)})}{l''(\theta^{(t)})}\)

More generally:

\(\theta^{(t+1)} = \theta^{(t)} - H^{-1}\nabla_\theta l\)

Where H is the Hessian matrix

\(H_j = \frac{\partial^2 l}{\partial \theta_i \partial \theta_j}\)

Generalized Linear Methods

Exponential Family Distributions

\(P(y|x; \theta)\)

\(y \in \mathbb{R}\): Gaussian -> Least Square

\(y \in \{0,1\}\):Bernoulli -> Logisitic regression

\(Bernoulli(\Phi)\) \(P(y = 1; \Phi) = \Phi\) \(\mathscr{N}(\mu,\sigma^2)\)

\(P(y;\eta) = b(y)\exp(\eta^T T(y) - a(\eta))\)

\(\eta\)-natural parameter

\(T(y)\)-sufficient statistic(Usually \(T(y) = y\))

\(Ber(\Phi)\) :

\(P(y;\Phi) = \Phi^y(1-\Phi)^{1-y} = \exp(y \log \Phi + (1 - y) \log (1 - \Phi)) = \exp(\log \frac{\Phi}{1 - \Phi} y + \log (1 - \Phi))\)

\(\eta = \log \frac{\Phi}{1 - \Phi}\),\(y = T(y)\),\(-a(\eta) = \log(1 - \Phi)\)

\(\Rightarrow \Phi = \frac{1}{1 + e^{-\eta}} \Rightarrow a(\eta) = -\log (1 - \Phi) = \log(1 + e^\eta)\)

\(T(y) = y\), \(b(y) = 1\)

\(\mathscr{N}(\mu, \sigma^2)\) set \(\sigma^2 = 1\)

\(\frac{1}{\sqrt{2\pi}} \exp(-\frac{1}{2}(y - \mu)^2) =\frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}y^2)\exp(\mu y-\frac{1}{2}\mu^2)\)

\(b(y) = \frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}y^2)\), \(\mu = \eta\), \(T(y) = y\), \(a(\eta) = \frac{1}{2}\mu^2 = \frac{1}{2}\eta^2\)

Generalized Linear Models(GLM)

Assume:

  1. \(y|x;\theta \sim \text{Exp Family} (\eta)\)

  2. Given \(x\),goal is to output \(E[T(y)|x]\), Want \(h(x) = E[T(y)|x]\)

  3. \(\eta = \theta^T x\) (\(\eta_i = \theta_i^T x\) if \(y \in \mathbb{R}^l\))

Bernoulli: For fixed \(x,\theta\),algorithm output

\(h_\theta(x) = E[y|x;\theta] = P(y = 1 | x;\theta) = \Phi = \frac{1}{1+e^{-\eta}} = \frac{1}{1+e^{-\theta^T X}}\)

\(g(\eta) = E[y;\eta] = \frac{1}{1+e^{-\eta}}\) (Canonical Response Function)

\(g^{-1}\):Canonical Link Function