Gradient descent

Let’s begin with our simple problem of estimating the parameters for a linear regression model with gradient descent.

Where the gradient \(\nabla J(\theta)\) is in general defined as:

\[ \nabla J(\theta) = \left[ \frac{\partial J}{\partial \theta_{0}},\frac{\partial J}{\partial \theta_{1}}, \cdots, \frac{\partial J}{\partial \theta_{p}} \right]\]

And in the case of linear regression is:

\[ \nabla J(\theta) = \frac{1}{N}(y^{T} - \theta X^{T})X \]

The gradient decent algorithm finds parameters in the following manner:

repeat while (\(||\eta \nabla J(\theta)|| > \epsilon\)){ \[ \theta := \theta - \eta \frac{1}{N}(y^{T} - \theta X^{T})X \] }

As it turns out, this is quite easy to implement in R as a function which we call gradientR below:

gradientR<-function(y, X, epsilon,eta, iters){
      epsilon = 0.0001
      X = as.matrix(data.frame(rep(1,length(y)),X))
      N= dim(X)[1]
      print("Initialize parameters...")
      theta.init = as.matrix(rnorm(n=dim(X)[2], mean=0,sd = 1)) # Initialize theta
      theta.init = t(theta.init)
       e = t(y) - theta.init%*%t(X)
       grad.init = -(2/N)%*%(e)%*%X
       theta = theta.init - eta*(1/N)*grad.init
       l2loss = c()
      for(i in 1:iters){
          l2loss = c(l2loss,sqrt(sum((t(y) - theta%*%t(X))^2)))
          e = t(y) - theta%*%t(X)
          grad = -(2/N)%*%e%*%X
          theta = theta - eta*(2/N)*grad
            if(sqrt(sum(grad^2)) <= epsilon){
              break
            }
        }
  print("Algorithm converged")
  print(paste("Final gradient norm is",sqrt(sum(grad^2))))
  values<-list("coef" = t(theta), "l2loss" = l2loss)
  return(values)
}

Let’s also make a function that estimates the parameters with the normal equations: \[ \theta = (X^{T}X)^{-1}X^{T}Y \]

normalest <- function(y, X){
    X = data.frame(rep(1,length(y)),X)
    X = as.matrix(X)
    theta = solve(t(X)%*%X)%*%t(X)%*%y
    return(theta)
}

Now let’s make up some fake data and see gradient descent in action with \(\eta = 100\) and 1000 epochs:

y = rnorm(n = 10000, mean = 0, sd = 1)
x1 = rnorm(n = 10000, mean = 0, sd = 1)
x2 = rnorm(n = 10000, mean = 0, sd = 1)
x3 = rnorm(n = 10000, mean = 0, sd = 1)
x4 = rnorm(n = 10000, mean = 0, sd = 1)
x5 = rnorm(n = 10000, mean = 0, sd = 1)

ptm <- proc.time()
gdec.eta1 = gradientR(y = y, X = data.frame(x1,x2,x3, x4,x5), eta = 100, iters = 1000)
## [1] "Initialize parameters..."
## [1] "Algorithm converged"
## [1] "Final gradient norm is 9.84293888300476e-05"
proc.time() - ptm
##    user  system elapsed 
##   0.232   0.004   0.237

Let’s check if we got the correct parameter values

normalest(y=y, X = data.frame(x1,x2,x3,x4,x5))
##                           [,1]
## rep.1..length.y..  0.006146655
## x1                 0.011377991
## x2                -0.005618812
## x3                 0.004558885
## x4                -0.003782078
## x5                -0.001489584
gdec.eta1$coef #Coefficients from gradient descent
##                           [,1]
## rep.1..length.y..  0.006162578
## x1                 0.011349010
## x2                -0.005620819
## x3                 0.004591632
## x4                -0.003783501
## x5                -0.001495606

Pretty close.

Let’s take a look at the L2-loss for each epoch:

plot(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss,xlab = "Epoch", ylab = "L2-loss")
lines(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss)

What if we decreased the learning rate to 10?:

## [1] "Initialize parameters..."
## [1] "Algorithm converged"
## [1] "Final gradient norm is 0.0493919754917037"
plot(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss,xlab = "Epoch", ylab = "L2-loss")
lines(1:length(gdec.eta1$l2loss),gdec.eta1$l2loss)

Stochastic Gradient Descent

Gradient descent can often have slow convergence because each iteration requires calculation of the gradient for every single training example.

If we update the parameters each time by iterating through each training example, we can actually get excellent estimates despite the fact that we’ve done less work.

For stochastic gradient descent, thus:

\[ \nabla J(\theta) = \frac{1}{N}(y^{T} - \theta X^{T})X \]

Becomes:

\[ \nabla J(\theta)_{i} = \frac{1}{N}(y_{i} - \theta^{T} X_{i})X_{i} \]

Where i is each row of the data set.

This the stochastic gradient descent algorithm proceeds as follows for the case of linear regression:

Step 1

Randomly shuffle the data

Step 2

repeat \[\{\] for \[i := 1, \cdots,N\{\] \[ \theta := \theta - \eta \nabla J(\theta)_{i} \] \[\}\] \[\}\]

Part of the homework assignment will be to write a R function that performs stochastic gradient descent.