In this post, we are going to talk about mathematical optimization. This term is not to be confused with the word ‘optimization’ that we use in our everyday lives, for instance, improving the efficiency of a workflow. This kind of optimization means to find an optimal solution from a set of possible candidate solutions. An optimization problem is generally given in the following way: one, there is a set of variables we can play with, and two, there is an objective function that we wish to minimize or maximize.
Let’s build a better understanding of this concept through an example. For instance, let’s imagine that we have to cook a meal for our friends from a given set of ingredients. The question is, how much salt, vegetables, and meat goes into the pan. These are the variables that we can adjust, and the goal is to choose the optimal amount of these ingredients to maximize the tastiness of the meal. Tastiness will be our objective function, and for a moment, we shall pretend that tastiness is an objective measure of a meal.
What does this mean in practice? In our cooking example, after making several meals, we would ask our guests about the tastiness of these meals. From their responses, we would recognize that adding a bit more salt led to very favorable results, and since these people are notorious meat eaters, decreasing the number of vegetables and increasing the meat content also led to favorable reviews. Therefore, on the back of this newfound knowledge, will cook more with these variable changes in pursuit of the “best possible meal in the history of mankind.”
A numerical technique which performs similarly to the stated example is Gradient Descent, one of the most popular algorithms for mathematical optimization. Gradient Descent has been around for centuries. These days, it happens to find its most extensive use in machine learning. A remarkably large fraction of modern machine learning research, including the muchhyped ‘deep learning,’ boils down to implementing variants of gradient descent on a very large scale. Like many other machine learning enthusiasts, I was first introduced to the algorithm when I took Andrew Ng’s ‘Machine Learning’ class on Coursera. This algorithm to many has served as the gateway to Machine Learning; a lot miss out on the algorithm as an optimization algorithm.
The algorithm is extremely simple, however, for some, the mathematics or the explanations in this post might be annoying. However, for applying it effectively in practice, it’s important to be familiar with the algorithm at its core which helps to understand the ways that one can tweak the underlying algorithm. This blog post aims to show to demonstrate those areas of Gradient Descent that are often overlooked. We are first going to look at the mathematical outline of the algorithm. Subsequently, a formal understanding of the algorithm is given along with an example of an optimization problem to see gradient descent in action.
How to think about Gradient Descent?
Viewed the right way, gradient descent is straightforward to understand and remember – more so than most algorithms. I highly encourage you to be patient and read the post. Trust me, it is not the mathematics that is intimidating, but the notation.
First off, what problem is gradient descent trying to solve? Unconstrained optimization, meaning that for a realvalued function \( f:\mathbb{R}^n\to\mathbb{R}\) defined on an \( n\)dimensional Euclidean space, the goal is \(min (f(x))\) subject to \( x \in \mathbb{R}^n \). Note that maximizing a function falls into this problem definition, since maximizing \(f\) is the same as minimizing \(\ f\). In unconstrained optimization, there are no constraints, and the only consideration is the objective function. Normally, a key ingredient of linear and convex programs is their constraints, which describe the solutions that are permitted. There are various ways to transform constrained problems into unconstrained ones (like Lagrangian relaxation), and various ways to extend gradient descent to handle constraints (like projecting back to feasible region). Although, for simplicity, those aspects are not discussed in this post. Also, it’s assumed that \( f \) is differentiable (and hence, continuous).
A Simple Case: \( n=1 \)
Suppose \( n=1 \), such that \( f:\mathbb{R}\to\mathbb{R} \) is a univariate realvalued function.
Intuitively, what would it mean to try to minimize \( f \) via a greedy local search? For example, if we start at point \( x_0 \), then we look to the right, \( f \) goes up and to the left, \( f \) goes down, and we go further to the left as we want to make \( f \) as small as possible. If we start at \( x_1 \), then \( f \) is decreasing to the right and increasing to the left, so we’d move further to the right. In the first case, the algorithm terminates at the bottom of the left basin; in the second, at the bottom of the right basin. This is talked over later; for the initial intuition, we consider a fully convex function.
A little more formally – the basic algorithm with \( n=1 \) is the following.
 while \( f'(x)≠0 \) :
 if \( f'(x)>0 \) i.e. \( f \) is increasing – move a little to the left
 if \( f'(x)>0 \) i.e. \( f \) is decreasing – move a little to the right
At each step, the derivative of \( f \) is used to decide which direction to move in. Already with \( n=1 \), it is clear from the first graph that the outcome of gradient descent depends on the starting point. In a full convex function, the algorithm can only terminate at a global minimum. However, this example also shows how with a nonconvex function; gradient descent can compute a local minimum – meaning there’s no way to improve \( f \) by moving a little bit in either direction – that is we have not reached the global minimum.
Stepping Towards \( n>1 \)
In almost all the real applications of gradient descent, the number of dimensions is much larger than \( 1 \). The immediate increase in complexity can be observed with \( n=2 \), from a point \( x \in \mathbb{R}^n \), there’s an infinite number of directions in which we could move, not just two as in the case with a univariate function (as it can be noted in the image below). On a side note, if you recognize the image below, you are awesome!
To develop an intuition, we first consider the case of linear functions of the form \( f(x)=\mathbf{c}^T \mathbf{x} \ + b \), where \( \mathbf{c}\in\mathbb{R}^n \) is an \( n \)vector and \( b\in\mathbb{R} \) is a scalar.
Suppose, you are currently at a point \( ( x \in \mathbb{R}^n ) \), and you are allowed to move at most one unit of Euclidean distance in whatever direction you want. Where should you go to decrease the function \( f(x) = \mathbf{c}^T\mathbf{x}+b \) as much as possible? How much will the function decrease?
To answer this, assume \( u \in \mathbb{R}^n \) be a unit vector moving a unit distance from \( \mathbf{x}. \) The change observed in the objective function is described as follows.
\[ \mathbf{c}^T\mathbf{x} \ + b \ ⟼ \mathbf{c}^T(\mathbf{x} + \mathbf{u} )+ b \]
\[ =(\mathbf{c}^T\mathbf{x} \ + b) \ + \mathbf{c}^T\mathbf{u} \]
\[ =(\mathbf{c}^T+b) \ + \Vert\mathbf{c}\Vert_2\Vert\mathbf{u}\Vert_2 \cos\theta \]
where \( \theta \) denotes the angle between vectors \( \mathbf{c} \) and \( \mathbf{u} \). To decrease \( f \) as much as possible, we should make \( \cos\theta \) as small as possible (that is \( \cos \theta= 1 \), which we do by choosing \( \mathbf{u} \) to point in the opposite direction of \( \mathbf{c} \) (i.e., \( \mathbf{u} = \mathbf{c}/\Vert\mathbf{c}\Vert_2 \)). Hence, moving one unit in this direction causes \( f \) to decrease by \( \Vert\mathbf{c}\Vert_2 \). Hence, the direction of steepest descent is that of \( \ \mathbf{c} \), for a rate of decrease of \( \Vert\mathbf{c}\Vert_2 \).
What’s the Gradient About?
What about general (differentiable) function, the ones we really care about? The idea is to reduce general functions to linear functions. This might sound absurd, given how simple linear functions are and how weird general functions can be, but basic calculus already gives a method for doing this. What it means for a function to be differentiable at a point is that it can be locally approximated at that point by a linear function. For a univariate differentiable function, it’s clear that the linear approximation is the tangent line. That is at the point \( x \), approximate the function \( f \) for \( y \) near \( x \) by the linear function
\[ f(y)\approx f(x) + (yx)\ f'(x) = f(x) + x \ f'(x) + \ y \ f'(x) \]
where \( x \) is fixed and \( y \) is the variable. It’s also clear that tangent line is only a good approximation of \( f \) locally – far away from \( x \), the value of \( f \) and this linear function have nothing to do with each other. Thus, being differentiable means that at each point there exists a good local approximation by a linear function, with the specific linear function depending on the choice of the point.
Another way to think about this, which has the benefit of extending to better approximations via higherdegree polynomials, is through Taylor expansions. Taylor’s theorem states (for \( n=1 \)): if all of the derivatives of a function \( f \) exist at a point \( x \), then all sufficiently small \( \alpha > 0 \), we can write
\[ f(x+\alpha)=f(x) \ + \alpha \ f^{\prime}(x) + \frac{\alpha^2}{2!} \ f^{\prime\prime}(x) + \frac{\alpha^3}{3!} \ f^{\prime\prime\prime} (x)\ + \ … \]
With the first two terms on the righthand side of the expansion, we have a linear approximation of \( f \) around \( x \), similar to the tangent line approximation, with \( \alpha \) playing the role of \( (yx) \).
Although the discussion has been for \( n=1 \) for simplicity, but all the stated mathematics extends to a higher number of dimensions. For example, the Taylor expansion remains valid in higher dimensions, just with the derivatives replaced by their higher dimensional analogs. Since, we’re using linear approximations, we only need to care about the higherdimensional analog of the first derivative \( f'(x) \), which is the gradient. For a differentiable function \( f:\mathbb{R}\to\mathbb{R} \), the gradient \( \nabla f(x) \) of \( f \) at \( x \) is the realvalued \( n \)vector
\[ \nabla f(x)=\left( \frac{\delta y}{\delta x_1}(x), \frac{\delta y}{\delta x_2}(x), …,\frac{\delta y}{\delta x_n}(x)\right) \]
in which the \( i \)th component specifies the rate of change of \( f \) as a function of \( x_i \), holding the other \( n1 \) components of \( \mathbf{x} \) fixed.
To relate this definition, note that if \( n=1 \), then the gradient becomes the scalar \( f'(x) \). If \( f(x)=\mathbf{c}^T\mathbf{x} \ + b \) is linear, then \( \delta f/\delta x_i = c_i \) for every \( i \) (irrespective of what \( x \) is), so \( \nabla f \) is just the constant function everywhere equal to \( \mathbf{c}^T \).
For a simple but slightly less trivial example, we can consider a quadratic function \( f:\mathbb{R}\to\mathbb{R} \) given below, where \( \mathbf{A} \) is a symmetric \( n\times n \) matrix and \( \mathbf{b} \) is an \( n \)vector.
\[ f(\mathbf{x})=\frac{1}{2}\mathbf{x}^T\mathbf{A}\mathbf{x} \ – \mathbf{b}^T\mathbf{x} \]
\[ f(\mathbf{x}) = \frac{1}{2} \sum_{i=1}^{n} \sum_{j=1}^{n} a_{ij}\ x_i\ x_j \ \sum_{i=1}^{n} b_i\ x_i \]
\[ \frac{df}{dx_i}(\mathbf{x})=\sum_{j=1}^n a_{ij} \ x_j – b_i \]
for each \( i=1,2,3,…n \). We can therefore express the gradient succinctly at each point \( \mathbf{x}\in \mathbb{R}^n \) as \( \nabla f(\mathbf{x}) = \mathbf{Ax} – \mathbf{b} \).
Gradient Descent: The General Case
The following is the description of the general case of the gradient descent algorithm. It has three parameters – \( x_0 \), \( \epsilon \), and \( \alpha.\)
 Let \( \mathbf{x}:=\mathbf{x_0} \in \mathbb{R}^n \) be an initial point.
 while \( \Vert \nabla f(\mathbf{x}) \Vert_2 > \epsilon \):
 \( x:=x\alpha.f(\mathbf{x}) \)
Conceptually, gradient descent enters the following agreements with basic calculus:

 Calculus states that for, \( \mathbf{y} \) close to \( \mathbf{x} \), one can make as if the true function \( f \) is just the linear function \( f(\mathbf{x})+\nabla f(\mathbf{x})^T(\mathbf{y}\mathbf{x}) \). We know what to do with linear functions: move in the opposite direction of the coefficient vector i.e. in the direction of \( \nabla f(\mathbf{x}) \) to decrease the function at a rate of \( \Vert\nabla f(\mathbf{x})\Vert_2 \) per unit distance.
 In exchange, gradient descent promises to only take a step (parametrized by the step size \( \alpha \)) away from \( \mathbf{x} \). Taking a large step would violate the agreement, in that far away from \( \mathbf{x} \), the function \( f(\mathbf{x}) \) need not behave anything like the local linear approximation \( f(\mathbf{x})+\nabla f(\mathbf{x})^T(yx) \) (tangent line).
The starting point \( x_0 \) can be chosen arbitrarily, and for a nonconvex \( f \), this can vary the output of gradient descent. For convex functions \( f \) , gradient descent will converge towards the same point – the global minimum, no matter how the starting state is chosen. The choice of the start state can still affect the number of iterations until convergence, however. In practice, one should select \( x_0 \) according to the best guess as to where the global minimum is more likely to lie., ensuring faster convergence.
The parameter \( \epsilon \) determines the stopping rule. Note that because \( \epsilon > 0 \), gradient descent generally does not halt at an actual local minimum, but rather some kind of “approximate local minimum.” Since, the rate of decrease of a given step is \( \Vert\nabla f(\mathbf{x})\Vert_2 \), at least locally, once \( \Vert\nabla f(\mathbf{x})\Vert_2 \) gets close to each iteration of gradient descent makes very little progress making it an obvious time to quit. Smaller values of \( \epsilon \) mean more iterations before stopping but a higher quality solution at termination. In practice, one tires various values of \( \epsilon \) to achieve the right balance between highquality solution and computation cost. Alternatively, one can use gradient descent for a fixed amount of time and use whatever point was computed in the final iteration.
The final parameter \( \alpha \), the stepsize, is perhaps the most important. While gradient descent is flexible enough that different \( \alpha \)’s can be used in different iterations (such as decreasing the value of \( \alpha \) over the course of the algorithm), in practice one typically uses a fixed value of \( \alpha \) over all iterations. All said than done, the best value of \( \alpha \) is typically chosen by experimentation by selecting the best performing \( \alpha \) in multiple runs of the algorithm.
Gradient Descent in Action
One application where most people first see Gradient Descent in action is ‘Linear Regression.’ Since this post is intended to recognize Gradient Descent as an optimization algorithm, let us consider the following differential calculus problem for a change.
An open rectangular container is to have a volume of 62.5 cm^{2}. Determine the dimensions such that the material required to make the container is minimum.
It is clear that \( x*y*z = 62.5 \implies z = \frac{62.5}{xy}. \)
For a minimum of material to be used, the surface area of the cuboid is to be minimized. Hence, the objective function for minimization is given by
\( f =xy + 2yz + 2xz. \)
Substituting the value of \( z \), the equation could be simplified to
\( f = xy + \frac{62.5}{x} + \frac{62.5}{y}. \)
The following code shows how Gradient Descent can be implemented to solve the problem.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 
from sympy import * x, y = symbols('x y') # Objective Function f = x*y + 2*(62.5/x) + 2*(62.5/y) # Differentiating  computing the gradient. fpx = f.diff(x) fpy = f.diff(y) grad = [fpx,fpy] # Starting point theta0 = 20 theta1 = 20 # Algorithm parameters alpha = 0.01 epsilon = 0.00000001 iterations = 0 maxIterations = 1000 printData = True check = 0 while True: # Simultaneously update unknown variables tempTheta0 = theta0  alpha * N(fpx.subs(x, theta0).subs(y, theta1)) tempTheta1 = theta1  alpha * N(fpy.subs(y, theta1).subs(x, theta0)) iterations += 1 if iterations > maxIterations: print("Too many iterations. Adjust alpha.") printData = False break if abs(tempTheta0  theta0) < epsilon and abs(tempTheta1  theta1) < epsilon: break theta0 = tempTheta0 theta1 = tempTheta1 z = 62.5/(theta0*theta1) if printData: print("x = ", theta0, sep = " ") print("y = ", theta1, sep = " ") print("z = ", z, sep = " ") 
\[ x = 5.00000033165439\\ y = 5.00000033165439\\ z = 2.49999966834564 \]
Why not solve it analytically? – Good question! When performing minimization explicitly, multiple derivatives and equations are required to be solved, usually requiring matrix computations. There is always a tradeoff when choosing the approach to an optimization problem. Although, the trivial method might sometimes outperform an iterating algorithm such as Gradient Descent; as the number of variables and complexity of functions increase, analytical methods often become too slow for feasible use.
Epilogue
Phew! That was a lot to digest… I hope that after reading through this article, you understand gradient descent better. My strong recommendation as the “nextstep” would happen to be this blog post – which would explain the variants of gradient descent in good detail. Moreover, it might well be the right time to move on to applications of mathematical optimization in machine learning algorithms such as linear regression, logistic regression or even neural networks. Happy optimization!