gradient descent

Gradient descent is one of the most important algorithms in machine learning. Gradient descent can be used to find a local minimum for any differentiable function. Intuitively, think about trying to get to the lowest point on a landscape by walking quickly downhill. Your next step should be in the direction with the steepest slope. After taking a step, you reassess your next step. You can easily get down a mountain and into a valley in this way, but as you probably can guess you are not guaranteed to be in the lowest point. Where you start from will make a difference. So too will your decision to go left or right at a saddle point. It's also best to avoid cliffs; if you find yourself on a plateau you may not know where to go next. All of these considerations also apply to gradient descent in machine learning. While you can look out and see where to go next, the gradient descent algorithm must rely on the [[derivative]] of the optimization function. The derivative tells the algorithm whether the function is increasing or decreasing at a given point. If increasing, it tells the algorithm to adjust in the opposite direction, and vice versa. If increasing strongly, the algorithm can take a bigger step with confidence. A learning rate $\alpha$ is typically used to avoid taking too big a step and overshooting. Consider updating the a weight in a [[multilayer perceptron]]. $ W^L = W^L_{t -1} - \alpha \frac{\partial{\ell}}{\partial{W^L}} $ The weight $W$ in layer $L$ at time step $t$ is given by the weight at the previous time step less the change in the loss function $\ell$ given a change in the weight (the partial derivative of $\ell$ with respect to $W^L$) multiplied by the learning rate $\alpha$. The update procedure is continued until the loss function stabilizes or otherwise meets some specification. ## stochastic gradient descent In stochastic gradient descent (SGD), the [[gradient descent]] algorithm is used on a randomized subset of the data, called a **minibatch**. Each step only sees a small subset of the data, which greatly decreases the computational requirements. The next step is not necessarily the best next step (not necessarily in the direction of steepest slope for the full training set), but with the correct batch size the marginal gains of any additional training examples are less than the marginal gains in speed. ### momentum SGD is still slow to converge but adding a **momentum** (moving average) improves speed. Intuitively, momentum is meant to make it harder to change direction from step to step. Using the multilayer perceptron example, momentum is given by $\beta$. $ W^L = \beta W^L_{t-1} - \alpha \frac{\partial \ell}{\partial W^L_{t-1}} $ Note however that typically a separate variable $v$ is used to hold velocity and momentum is applied to it, here we are using the weight to hold velocity for simplicity (see [[lit/textbooks/Deep Learning|Deep Learning]] section 8.3.2). Momentum also helps overcome plateaus, where the gradient is zero, allowing the algorithm to simply continue in a direction until the gradient changes again. Nestrov momentum is intended to achieve convergence faster but empirically speaking it doesn't do much. ### decay Decay is used to reduce the learning rate at each time step to prevent overshooting. The learning rate, and thus step size, decreases over time. This can help with overfitting as well. Instead of a constant decay, in `keras` you can set a learning rate schedule with a stepwise function which can also improve prediction error. ### optimization There are a number of SGD variants available in packages like `keras` that manipulate momentum and learning rate primarily for numerical stability and speed in convergence. Adam and RMSprop are good general purpose options. Experimentation is often the best way to select an optimizer for a given problem. See [this article](https://imgur.com/a/visualizing-optimization-algos-Hqolp) on how these variants compare. ![img](https://i.imgur.com/NKsFHJb.gif) ### AdaGrad In Adagrad, the learning rate is normalized by the square root of the total sum of the gradient. ### AdaDelta In AdaDelta, the learning rate is normalized by the root mean square (RMS) of the gradient for numerical stability. The change in weights is proportional to the RMS ratio. ### RMSprop RMSprop is a variant of AdaDelta that takes a moving average when it calculates the RMS of the gradient for numerical stability. ### Adam Adaptive Momentum Estimation (Adam) mimics momentum for gradient and gradient-squared. It is computationally efficient and doesn't require much memory so is well suited for large models or datasets. However, it is sometimes slower than AdaDelta or SGD and has shown worse performance than SGD with momentum on some datasets. Nevertheless, it is a very popular optimizer. AdamW AdamW improves regularization in Adam by correcting the way in which weight decay is implemented. L2 regularization is not effective in Adam but weight decay is. [[computation graph]] ## loss function The loss function quantifies the performance of a model. - [[mean squared error]] - [[cross-entropy]] loss - [[Kullback-Leibler divergence]] (used in [[Akaike Information Criterion|AIC]])