Gradient Descent

Recently, I realised that I have used majority of classical machine learning algorithms using scikit, but I haven’t actually implemented them myself. I know the basics of the algorithms, but were I to implement them, I’d most likely fail. There are always a lot gotchas. So, I am going to start with it now. The first stop is Gradient Descent.

Gradient Descent is a method of reaching a minima of a function. It’s a method to find the minima (global or local) of a given function. It is guaranteed to find a minima for convex functions. General algorithm is given a function to minimize, you take a step towards the negative of the gradient (or slope) of the function. If you keep doing it then eventually you’ll reach a minima. It works because when we move in the negative direction of the gradient, we’ll take the fastest route to reach the lowest value of the function. Watch this Gradient Descent video by Andrew NG to get a more intuitive understanding of the process.

Until convergence, repeat
Take gradient descent step for all variables
Update the variables with the new variables

For example, if we have a function, f(x,y) then the Gradient Descent will operate as follows

Until f(x,y) is minimum, repeat
Gradient descent step
x_new obtained after gradient descent step for x
y_new obtained after gradient descent step for y
update x and y with x_new and y_new

The number of steps taken to reach the minima depends on learning rate (\(\alpha\)). The larger the value of \(\alpha\), the bigger the steps will be and it’ll find the minima quickly. Opposite is also true: smaller the \(\alpha\), the smaller the steps will be and it’ll take more time to reach minima. You have to find a balance though, because of a bigger \(\alpha\), gradient descent might fail to converge and a smaller \(\alpha\) will find a more accurate minima, but at the same time will take more time to converge. The following gradient descent step pseudo code shows how the \(\alpha\) is used.

Take variable and alpha as input
new_variable = variable - alpha * gradient
return new_variable

Usually, instead of checking for convergence we use number of iterations to end the gradient descent loop. Number of iterations parameter tells the gradient descent how many steps to go. There might be cases where it is caught in an infinite loop: either it’s diverging or minima is at infinity. So to prevent that from happening gradient descent will stop after a set number of iterations.

Gradient Descent Parameters

Learning Rate (\(\alpha\))
Number of Iterations

Implementation

Generally, the scenario is to decide on a hypothesis function. This hypothesis function will have some parameters for which we want the optimal values according to our training data. To find the optimal values, we’ll have a cost function which evaluates the hypothesis for a given set of parameters and gives us a score. We have to minimize this score to reach the optimal values of the parameters to be used in the hypothesis.

Lets first implement the gradient descent loop. We take the following function parameters:

x: input data
y: target column
thetas: the parameters to be optimized
alpha: learning rate
num_iters: number of iterations to end the gradient descent loop after

We first add a column of ones in x as a bias unit. Then we make empty arrays to hold the historical weights and cost. Now the loop starts; we run it until we reach the convergence or finish all the num_iters iterations. Within the loop, we call the grad_desc_step function to update the thetas. Once we get the updates, we save them and then move on to the next iteration if not already converged.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def grad_desc(x, y, thetas, alpha=0.01, num_iters=100):
    """
    Gradient descent loop
    """
    # adds bias (=1) column to the input data
    x = np.hstack((np.ones((x.shape[0], 1)), x))

    # empty arrays to store thetas and costs
    theta_updates = np.zeros((num_iters, thetas.shape[1]))
    costs = np.zeros(num_iters)

    theta_updates[0] = thetas[:, 0]
    costs[0] = cost_func(x, y, thetas)

    # gradient descent loop
    for i in range(1, num_iters+1):
        thetas = grad_desc_step(x, y, thetas, alpha)
        cur_cost = cost_func(x, y, thetas)

        theta_updates[i] = thetas
        costs[i] = cur_cost

        if abs(costs[i] - costs[i-1]) < 1e-3:
            break

    return theta_updates[:i-1,:], costs[:i-1]

Now we’ll discuss the gradient descent step. We assume we have a function which calculates the gradient of the cost function - cost_func_grad. This function takes x, y and thetas as parameters to calculate the gradient values for the given thetas. We use this function to get the gradient and then update the thetas by moving towards the negative of the gradient. The step is governed by the alpha or learning rate. We return the updated thetas.

1
2
3
4
5
6
7
8
9
10
def grad_desc_step(x, y, thetas, alpha):
    """
    Updates the parameters once.
    """
    # Calculate the gradient
    grad = cost_func_grad(x, y, thetas)

    # updating the parameters
    thetas = thetas - alpha * grad
    return thetas

So this is the generic implementation of the gradient descent. In the future posts, I’ll use this same function to implement linear and logistic regression.