Gradient Descent

#Math
Takes small steps using first derivative information to arrive at root, or minima of a function. Similar to Newton's Method except it converges slower if using constant step sizes

$\displaystyle \theta^{t+1} \leftarrow \theta^{t}-\eta \frac{\partial f }{\partial \theta}$

The new parameters for our ML model
$\displaystyle \eta$ is the step size
$\displaystyle f$ is the error function (we want to minimize error)

Training

Gradient Descent

GradientDescent()
- $\displaystyle \theta:=0$
- For epoch $\displaystyle =1\ldots T$
  - $\displaystyle \theta^{t+1} \leftarrow \theta^{t}-\eta \frac{\partial f }{\partial \theta}$
- return $\displaystyle \theta$

$\displaystyle O(nd)$

Time complexity depends on the number of samples and dimensionality

Stochastic Gradient Descent

StochasticGradientDescent()
- $\displaystyle \theta:=0$
- For epoch $\displaystyle =1\ldots T$
  - For $\displaystyle (x,y)\in \mathcal{D}$
    - $\displaystyle \theta^{t+1} \leftarrow \theta^{t}-\eta \nabla J_{(x,y)}(\theta)$
      - The gradient is calculated on just one datapoint
- return $\displaystyle \theta$