7. Gradient Descent

Convex Function

clipboard.png

  • Function f:RdRf: \R^d \rightarrow \R is convex if x1Rd,x2Rd,0t1\forall x_1 \in \R^d, x_2 \R^d, 0 \le t \le 1
f(tx1+(1t)x2)tf(x1)+(1t)f(x2)f(tx_1 + (1 - t)x_2) \le tf(x_1) + (1 - t)f(x_2)

Minimize a convex, continuous and differentiable function

clipboard.png

Taylor Expansion

clipboard.png

(w+s)(w)+g(w)s\ell(w + s) \approx \ell(w) + g(w)^\intercal s

where

  • g(w)=(w)g(w) = \nabla \ell(w) is the gradient
l(w+s)(w)+g(w)s+12sH(w)sl(w + s) \approx \ell(w) + g(w)^\intercal s + \frac{1}{2} s^\intercal H(w)s

where

  • H(w)=2(w)H(w) = \nabla^2 \ell(w) is the Hessian of \ell

Gradient Descent

  • Use the first order approximation
  • Assume that the function \ell around ww is linear and behaves like (w)+g(w)s\ell(w) + g(w)^\intercal s
  1. Find a vector s that minimized \ell
s=αg(w)s = - \alpha g(w)
  • for some small α\alpha as the learning rate
  1. After one update
(w+(αg(w)))(w)αg(w)g(w)>0<(b)\ell(w + (-\alpha g(w))) \approx \ell(w) - \underbrace{\alpha g(w)^\intercal g(w)}_{>0} < \ell(b)

Choosing the step

clipboard.png

  • A safe choice is to set α=ct\alpha = \frac{c}{t}
  • c:c: some constant
  • t:t: the number of updates taken
  • Guarantees that the gradient descent eventually become small enough to converge

Types

\bullet Batch Gradient

  • use error over the entire training set D\mathcal D
  • Do until satisfied:
Compute the graident: D(w)Update the vector of parameters: wwαD(w)\begin{align*} &\text{Compute the graident: } \nabla \ell_D(w) \\ &\text{Update the vector of parameters: } w \leftarrow w - \alpha \nabla \ell_D(w) \end{align*}

\bullet Stochastic Gradient

  • use error over a single training example from D\mathcal D
  • Do until satisfied:
Choose (with replacement) a random training example(x(i),y(i))DCompute the graident just for them: (x(i),y(i))(w)Update the vector of parameters: wwα(x(i),y(i))(w)\begin{align*} &\text{Choose (with replacement) a random training example} (x^{(i)}, y^{(i)}) \in \mathcal D \\ &\text{Compute the graident just for them: } \nabla \ell_{(x^{(i)}, y^{(i)})}(w) \\ &\text{Update the vector of parameters: } w \leftarrow w - \alpha \nabla \ell_{(x^{(i)}, y^{(i)})}(w) \end{align*}

Tip

  • Stochastic approximates Batch arbitrarily closely as α0\alpha \rightarrow 0
  • Stochastic can be much faster when D\mathcal D is very large
  • Interemediate approach: use error over subsets of D\mathcal D
  • Gradient descent is slow when it is close to minimum

Where to converge?

clipboard.png

Newton’s Method: Use 2nd2^{nd} order approximation

(w+s)(w)+g(w)s+12sH(w)s\ell (w + s) \approx \ell(w) + g(w)^\intercal s + \frac{1}{2} s^\intercal H(w)s