Gradient Descent is the backbone of any neural network. In this post, we will explore Gradient descent and some of its variants.

**I**n a recent episode of the Artificial Intelligence podcast, hosted by Lex Fridman of MIT, Michael I. Jordan, a professor at Berkeley and one of the most influential people in the history of machine learning and even nicknamed the “*Miles Davis**” *of machine learning by Yann LeCun, himself an extraordinarily prominent Deep Learning researcher and considered as one of the three leaders in the field, talked about Gradient Descent extensively and about his fascination with the *Nesterov Accelerated Gradient *algorithm.

During the conversation, Lex asked “in terms of optimization algorithms, do you think we’re stuck with Gradient Descent and what variants of it do you think is interesting?”, to which professor Michael I. Jordan answered, *“Gradients are amazing mathematical objects and when people study them deeply mathematically, they’re kind of shocked about what they are and what they can do”*. He went to on to explain, *“Suppose if I tell you, if you move along the x-axis, you will get uphill towards some objective by 3 units, whereas if you move along the y-axis, you go uphill by 7 units. And I only allow you to move a certain unit distance. Then what are you going to do?”*

The professor illustrated how most people will choose to move along the y-axis because, *“they’re getting the biggest bang for their buck and my buck is only one unit and I am going to put all of it in the y-axis, and why would I put any of my step size in the x- axis because I’ll get less bang for my buck”. *He then explained the false logic here in the context of gradient descent, *“This seems like a clear argument, but it is wrong because the gradient direction is not to go only along the y-axis; it is to take a little bit of the x- axis.” *He added, *“We have come to a place where stochastic gradient algorithms are dominant and there are versions that are little better than others or have more guarantees and are more robust, but there’s still ongoing research to figure out what the best algorithm for each situation is and I think that will start to co-evolve.”*

Later he commented*, “even though it (gradient descent) is good optimization algorithm, is it the best the algorithm as it has a slow rate that is 1/k steps? The answer is no… or not clear yet because 1/k is the lower bound and that’s the best you could do **you*.” But he went on to say that the *Nesterov Accelerated Gradient *is the best variation of gradient descent and his personal favorite algorithm as it achieves 1/k squared by combing two gradients in an obscure way and builds on classical momentum.

Now in order to understand the professor’s appreciation of gradient descent and his fascination with Nesterov Accelerated Gradient, we need to understand the concept behind gradient descent and its place in the context of machine learning. In the process, we will examine some its most prominent variants, especially Yurii Nesterov’s own contribution.

#### Recommended Listening: ‘B***** Brew’ by Miles Davis

** **The main motif in machine learning is optimizing a computer model capacity to predict or classify. At the core of this orientation, we find gradient descent as a pivot in solving many optimization problems and as one from which many algorithmic innovations are educed. Gradient descent, in a very general sense, aims to reduce the margin between the model’s output values and given inputs (a combination of reference and target data values). This margin is known as the loss, cost or simply error which devaluates the model’s performance and viability.

The process that a gradient descent algorithm undertakes to lower the margin (we will call it as cost from now on) involves iterating over a set of steps that will optimize the model by computing a ** cost function **and by adjusting what is or are (depending on the number of reference parameters) called as weight(s) (that’s coefficients in in statistical terms) using a hyperparameter called the

**to inject step size (pace) into the cycle output of the cost function at the stage of computing adjustments to the model weights that take into account the last known values. This process will continue to run the model until no further adjustments can significantly reduce the cost; hence, optimal weight(s) is/are obtained and thereby the model is deemed to be adequately optimized.**

*learning rate*Now let’s *break down *the etymology of the term ‘gradient descent’ before we proceed to examine the dynamics underlying the algorithm. ‘Gradient’ refers to the slope of the convex surface along which the model weights are moved “iterated” and which symbolizes the slope as a derivative of the cost function that is the main conduit used for finding the minima (optimal point on the convex surface) where the margin error is at its lowest—this explains here the usage of ‘descent’ as the process predicates searching *down *derivatives “slopes” of the cost function.

#### Recommended Listening: ‘In a Silent Way’ by Miles Davis

** **As previously mentioned, Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. Lasagne, Caffe, and Keras). These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.

Gradient descent is a way to minimize a cost function *J*(*θ*) parameterized by a model’s parameters by updating the parameters in the opposite direction of the gradient of the cost function ∇*θ J*(*θ*) with respect to the parameters last known values. The learning rate *η *determines the size of the steps we take to reach a (local) minima. In other words, we follow the direction of the slope of the surface created by the cost function downhill until we reach a valley.

There are three variants of gradient descent, which differ in how much data we use to compute the gradient of the cost function. Depending on the amount of data, we make a trade-off between the accuracy of the parameter update and the time it takes to perform an update.

## 1. Batch gradient descent

Vanilla gradient descent, aka batch gradient descent, computes the gradient of the cost function with respect to the parameters *θ *for the entire training dataset:

*θ *= *θ *− *η *· ∇*θ J *(*θ*)

As we need to calculate the gradients for the whole dataset to perform just *one *update, batch gradient descent can be very slow and is intractable for datasets that do not fit in memory. Batch gradient descent also does not allow us to update our model *online*, i.e. with new examples on-the-fly.

## 2. Stochastic gradient descent

** **Stochastic gradient descent (SGD) in contrast performs a parameter update for

*each *training example *x*(*i*) and label *y*(*i*):

*θ *= *θ *− *η *· ∇*θ **J *(*θ*; *x*(*i*); *y*(*i*))

Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for similar examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online. SGD performs frequent updates with a high variance that cause the objective function to fluctuate heavily

## 3. Mini-batch gradient descent

** **Mini-batch gradient descent finally takes the best of both worlds and performs an update for every mini-batch of *n *training examples:

*θ *= *θ – η · *∇*θ **J *(*θ*; *x*(*i*:*i*+*n*); *y*(*i*:*i*+*n*))

This way, it reduces the variance of the parameter updates, which can lead to more stable convergence; and can make use of highly optimized matrix optimizations common to state-of-the-art deep learning libraries that make computing the gradient

w.r.t. a mini-batch very efficient. Common mini-batch sizes range between 50 and 256 but can vary for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used.

Let’s examine now the professor’s favorite algorithm but first we need to understand the Momentum algorithm. SGD has trouble navigating ravines, i.e. areas where the surface curves much more steeply in one dimension than in another, which are common around local optima. In these scenarios, SGD oscillates across the slopes of the ravine while only making hesitant progress along the bottom towards the local optimum as in Figure 2a.

** **Figure 2: Source: Genevieve B. Orr

** **Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in Figure 2b. It does this by adding a fraction

*γ*of the update vector of the past time step to the current update vector.

*v**t *= *γ**vt−*1 + *η*∇*θ **J *(*θ*)

*θ *= *θ *− *v**t*

Essentially, when using momentum, we push a ball down a hill. The ball accumulates momentum as it rolls downhill, becoming faster and faster on the way (until it reaches its terminal velocity, if there is air resistance, i.e. *γ < *1). The same thing happens to our parameter updates: The momentum term increases for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions. As a result, we gain faster convergence and reduced oscillation.

However, a ball that rolls down a hill, blindly following the slope, is highly unsatisfactory. We would like to have a smarter ball, a ball that has a notion of where it is going so that it knows to slow down before the hill slopes up again.

Enter ** Nesterov accelerated gradient (NAG)**; it is a way to give our momentum term this kind of prescience. We know that we will use our momentum term

*γ v*

*t-1*to move the parameters

*θ*. Computing

*θ – γ v*

*t-1*thus gives us an approximation of the next position of the parameters (the gradient is missing for the full update), a rough idea where our parameters are going to be. We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters

*θ*but w.r.t. the approximate future position of our parameters:

*v**t *= *γ v**t – 1 *+ *η *∇*θ **J *(*θ – γ v**t – 1 *)

*θ *= *θ *− *v**t*

Some implementations exchange the signs in the equations.

Again, we set the momentum term *γ *to a value of around 0*:*9. While Momentum first computes the current gradient (small blue vector in Figure 3) and then takes a big jump in the direction of the updated accumulated gradient (big blue vector), NAG first makes a big jump in the direction of the previous accumulated gradient (brown vector), measures the gradient and then makes a correction (green vector). This anticipatory update prevents us from going too fast and results in increased responsiveness, which has significantly increased the performance of RNNs on a number of tasks.

#### Final Recommended Listening: ‘Kind of Blue’ by Miles Davis

Author : Ammar Abdullah Omar Balfaqih

Note : The publisher does not guarantee the originality of the content.