Introduction to Neural Networks

Neural networks are capable of approximating arbitrary functions, and as such, have gained popularity in applications where methods to extract principled features from data are lacking. Here, we will discuss the basic structure of a neural network, and consider a few guidelines for training neural networks on straightforward supervised learning problems. Consider a classification task in which we wish to predict a vector of $P$ labels $\vec{y} = {\{}y^{(p)}{\}}_{p=1}^{P}$ given a matrix of $P$ examples $\vec{X}_0 = {\{}\vec{x}^{(p)}_0{\}}_{p=1}^{P}$.

The subscript 0 in $\vec{X}_0$ and $\vec{x}_0$ is used to identify inputs to the neurons in the first layer of a multilayer neural network. $\vec{X}$ and $\vec{x}$, with no subscript, are used to denote inputs to generic neurons. The notation we will use for the predictions generated by the model is $\hat{y}$. Our goal is to train a model for which the predicted class labels $\hat{y}$ are as close as possible to the true class labels $\vec{y}$.

Structure of a Neural Network

 

First, we consider a single neuron, which takes a fixed number of inputs. Each neuron is characterized by weights $\vec{w}$ that are used to compute an input sum $s$ as a function of its inputs $\vec{x}$. The response of a neuron to input $\vec{x}$ is then given by activation $a$, whose dependence on $s$ is determined by the activation function of the neuron $f(s)$. \begin{align} s &= \vec{w} \cdot \vec{x} \newline a &= f(s) \end{align}

A schematic of a single neuron with three inputs is shown in Figure 1.

Figure 1: Schematic of a single neuron.

Figure 1: Schematic of a single neuron.

Typically, the number of neurons in the input layer is equal to the length of the input $\vec{x}$ for a single example, while the number of neurons in the output layer is equal to the number of desired predictions. The optimal number of hidden layers and number of neurons in each hidden layer will depend on the nature of the machine learning problem.

Once an architecture has been chosen, models are trained by updating the model parameters to minimize a chosen loss function $G$. This is accomplished by continuously refining the weights connecting pairs of neurons in adjacent layers of the neural network $w^l_{m,n}$, which constitute the variable parameters in our model. We have introduced the notation $w^{l}_{m, n}$, where $m$ indexes a neuron in a layer $l-1$ of the neural network, and $n$ indexes a neuron in the layer $l$. The neural network is trained in an iterative manner by repeatedly computing gradient of the loss function with respect to each weight $\partial G/ \partial w^l_{m,n}$, and then updating each weight in the direction opposite to the gradient.

A schematic of of a neural network with $i$ inputs and $k$ outputs is shown in Figure 2.

Minimizing a Loss Function

 

Figure 2: Schematic of a neural network.

Figure 2: Schematic of a neural network.

To understand how we minimize our loss function, we first observe that the change in our loss function $\Delta G$ that results when we increment the weights of neuron $n$ in layer $l$ of our network by $\Delta \vec{w}^l_n$ in the direction of the gradient $\nabla G$, can be approximated by the following, assuming that $\Delta \vec{w}^l_n$ is sufficiently small. [ \Delta G = \nabla \vec{G} \cdot \Delta \vec{w}^l_n ] We can show that under this approximation, $\Delta G = \nabla \vec{G} \cdot \Delta \vec{w}^l_n$ is minimized when $\Delta \vec{w}^l_n = - \eta \, \nabla \vec{G}$. Then to train a neural network, we compute the gradient $\nabla \vec{G}$ for some number of training samples $b$. When $b = P$ we are performing true gradient descent. However, computing the gradient over all of the training examples can be expensive. In practice, the gradient can be well approximated by the gradient of a random subset of the samples $b < P$. When we use these random subsets to perform each update, we call this stochastic gradient descent with mini-batches. The value of $b$, the size of each mini-batch used in each update, is chosen to balance the use of sufficient examples to accurately approximate the gradient, with the computational cost of computing the gradient for many examples. Typically, a $b$ of $\mathcal{O}(100)$ is chosen such that $N \bmod b = 0$. Once the gradient has been computed, each weight is updated incrementally in the direction $\nabla G$, with a step size that is determined by the learning rate, $\eta$. The update rule for weights $\vec{w}^l_n$ for a single neuron in the network is then given by \begin{equation} \vec{w}^l_n = \vec{w}^l_n - \eta \frac{1}{b} \sum_b \nabla \vec{G}_b \end{equation}

A single pass through all of the $N$ training examples will result in $N/b$ updates to the weights, and this is called an epoch of training. The learning rate $\eta$, which determines the step size, must be carefully determined. If $\eta$ is too large, it may lead to instability and difficulty converging. This is because we are using a first-order approximation in our cost function, and as we increase our step size, higher order terms may start to dominate. However, a value of $\eta$ that is too small may cause the rate of convergence to become prohibitively slow. To overcome this problem, a learning rate decay, $d_{\eta}$ can be used to decrease the learning rate by a factor of $d_{\eta}$ based on a predetermined schedule such that the network is able to take large steps towards a minimum initially, but is then able to settle into the minimum by taking small steps at the end of training. Usually, we terminate training after updating the weights a fixed maximum number of epochs, or when the loss function stops decreasing, whichever occurs first. We typically select an initial learning rate that is $\mathcal{O}(10^{-1})$, and then choose $d_{\eta}$ such that the learning rate will decay to $\mathcal{O}(10^{-6})$ by the time the maximum number of epochs is reached.