Gradient Descent for Linear Regression

When applied specifically to the case of linear regression, we can derive a new form of the gradient descent equation by substituting our actual cost and hypothesis functions, modifying the equation to:

gradient descent for linear regression algorithm

m: the size of the training set

θ0, θ1: constants that change simultaneously xi, yi: values of the given training set (i.e., the data)

Note that the value for converging upon θ1 is because of the value of the derivative of the cost function J with respect to θ1:

derivative of cost function J with respect to theta1

Starting with a guess for our hypothesis and repeating these equations yields more and more accurate results for our parameters.

batch gradient descent - gradient descent on the original cost function J which looks at every example in the entire training set on every step

Assuming the learning rate α is not too large, batch gradient descent always converges to the global minimum. This is because it always yields a convex, bowl-shaped graph with only one global optimum.

In the 3D plane:

bowl-shaped batch gradient descent graph

In the 2D plane:

2D batch gradient descent graph

Published November 28, 2016