Neural Networks: The Backward Pass

Tashfeen, Ahmad

Figure 1: Learning from mistakes

"Good judgement comes from experience. Experience comes from bad judgement." [4] – Mullah Nasruddin

This is the second part of the article1 Neural Networks in which we demonstrate the mathematics and code of a simple Multi-layer Perceptron. In the first part we talked about the forward pass, now we shall talk about the backward pass2.

1 Error in the Network

We denote our sample examples passed to the Network as X, y Numpy arrays with sets \(\mathcal{X} \subset \mathbb{R}^n\) and \(\mathbf{y}\subset \{0, 1\}^m\) (one-hot encodings), \(n,m \in \mathbb{N}\). With random \((\mathcal{W}, \mathbf{b})\), we have no reason to expect the network to behave in any reasonable way. Right now, it is as good as a guessing machine. How good is a guessing machine though? We need a way to measure the error in the network after we forward-feed some \(\vec{x}\). Let's denote the output of the network at this initial stage for some \(\vec{x} \in \mathcal{X}\) to be \(f(\vec{x}) = \vec{a}^{(L)} = \hat{\vec{y}}\), remember that \(\hat{\vec{y}} \neq \vec{y}\). One way to evaluate the error in the network for some \(\vec{x}\) is as a function of weights and biases \((\mathcal{W}, \mathbf{b})\). We do this by calculating the squared difference between the \(\hat{\vec{y}}\) we get and the \(\vec{y}\) we expect. If we sum up this error and calculate the mean then we have an overall error estimate known as the Mean Squared Error (\(MSE\)).

\begin{align*} E = E(\mathcal{W}, \mathbf{b}, \vec{x}) & = (\vec{y} - f(\vec{x}))^2 \\ & = (\vec{y} - \vec{a}^{(L)})^2 \\ & = (\vec{y} - \hat{\vec{y}})^2 \\ MSE & = \sum_{\vec{x} \in \mathcal{X}}\frac{E(\mathcal{W}, \mathbf{b}, \vec{x})}{N} \\ & = \frac{1}{N}\sum_{\vec{x} \in \mathcal{X}} (\vec{y} - \hat{\vec{y}})^2 \end{align*}

Adding this to the Python class,

def error(self, y, a):
  return np.sum(np.square(y - a))

2 Minimising the Error

Before our problem was to find a pair \((\mathcal{W}, \mathbf{b})\) that enables our network to perform well. We can now state this goal precisely: we need \((\mathcal{W}, \mathbf{b})\) such that \(MSE \approx 0\)3. We have boiled the problem down to minimising the function \(E\) through its parameters. Fortunately for us, this is no new problem in the world of mathematics.

2.1 Negative Rate of Change

Imagine someone tells you to minimise some random function \(g(x)\) you know nothing about. The only thing you can ask is, for an \(x\) you pick, what is \(g(x)\)? Most people will start with some random \(x\) and nudge it in different directions to see which direction produces a smaller \(g(x)\) and hope that if they keep nudging \(x\) in that direction, \(g(x)\) will keep going down.

Russell and Norvig explain a similar algorithm in their book Artificial Intelligence–A Modern Approach with the statement, "trying to find the top of Mount Everest in a thick fog while suffering from amnesia" [1]. Except, in our case, we are trying to find the base-camp in a thick fog with amnesia, i. e., going down. We can't see too far, we don't remember where we came from, but we have a sense of downwards direction. We just know that if we keep stepping (making nudges) in the downwards direction, we shall eventually reach a local minimum. This idea is the very key to finding the correct weights and biases. Calculus students know this rate of change as the derivative and most algebra students know it as the slope. We never get into algorithms to minimise the simple derivatives or slopes because we can just set them equal to zero and solve. But, in order to illustrate the algorithm, let's consider a simple polynomial as an example.

2.2 Descending the Derivative of a Parabola

Take this function \(g(x) = x^2\), we know that the derivative is \(2x\) and \(2x = 0 \iff x = 0\) hence \(x = 0 \Rightarrow g(x) = 0\). We can do this because \(g\) has such a nice and simple definition. What \(g\) it did not have this nice and simple definition like our network \(f\). We could do what we just talked about, start at some random \(x_0\) and make small \(\eta\) sized nudges to it using the negative derivative: \(x_{n+1} = x_n - \eta g'(x_n)\) as shown in figure 2. The small nudge to the initial \(x_0\) gives us \(x_1\) and then we calculate \(-\eta g'(x_1)\) to figure out the nudge we want to make to \(x_1\) and the series continues till we find some \(x_i\) for which \(g(x_i)\) is small enough.

Sorry, your browser does not support SVG.
Figure 2: Derivative descent

3 Gradient Descent

If our function is a polynomial of more than two variables: \(\vec{p} = [p_1,p_2,...,p_n]\) then to figure out the nudges we use gradients instead of the derivatives. The gradient of some function \(h(\vec{p})\) is denoted as a vector of partial derivatives using the symbol nabla: \(\nabla\).

\[ \nabla h(\vec{p}) = \Bigg[\frac{\partial h(\vec{p})}{\partial p_1}, \frac{\partial h(\vec{p})}{\partial p_2}, \frac{\partial h(\vec{p})}{\partial p_3}, ... , \frac{\partial h(\vec{p})}{\partial p_n}\Bigg], \quad p_i \in \vec{p} \]

Depending upon how rusty your multi-variable calculus is, this video by Grant Sanderson may serve as a nice refresher. Negative gradients tell us the downwards direction vector and as per our algorithm, we make a small \(\eta\) sized nudge in this direction. Keep in mind that this "making a small \(\eta\) sized nudge", just means subtracting \(\eta \nabla h(\vec{p})\) from \(\vec{p}\). Similar to above, we then compute the gradient again at the new \(\vec{p}\) and so forth. We call a single nudge a step. Remember we passed the __init__ definition in our Network class an eta? This \(\eta\) is what we meant by it, known by other names such as the learning rate or the step size.

3.1 Descending the Gradient of a Paraboloid

For \(\vec{p} = [a, b]\), take \(g(\vec{p}) = g(a, b) = a^2 + b^2\), imagine this to be our error function where \(a\) is analogous to \(\mathcal{W}\) and \(b\) is analogous to \(\mathbf{b}\). Let's write the gradient descent algorithm,

  1. Pick a random \(\vec{p}_{0}\).
  2. Let \(\vec{p}_{i+1} = \vec{p}_{i} - \eta \nabla g(\vec{p}_i)\).
  3. Repeat step 2 for all \(i\).

Can you implement the gradient descent for the function \(g\) in your favourite programming language? Here, I'll do it in Python, but first let's write the gradient of \(g\).

\[ \nabla g(\vec{p}_i) = \Bigg[\frac{\partial g(\vec{p})}{\partial p_1}, \frac{\partial g(\vec{p})}{\partial p_2}\Bigg] = \Bigg[\frac{\partial g(a,b)}{\partial a}, \frac{\partial g(a,b)}{\partial b}\Bigg] = \Bigg[\frac{\partial (a^2 + b^2)}{\partial a}, \frac{\partial (a^2 + b^2)}{\partial b}\Bigg] = [2a, 2b] \]

def g(a, b, get_gr=False):
  val, nabla_g = a**2 + b**2, [2 * a, 2 * b]
  error = (0 - a)**2 + (0 - b)**2
  return nabla_g, error if get_gr else val

def GD(a=10.0, b=10.0, eta=0.3, print_steps=True):  # Gradient Descent
  step, step_error_sum = 0, 0
  for step in range(7):
    gradient, step_error = g(a, b, get_gr=True)
    step_error_sum += step_error
    a = a - (eta * gradient[0])
    b = b - (eta * gradient[1])
    if print_steps:
      to_print = 'Step: {:>2}, (a, b): ({:>1.2}, {:>1.2}) Error: {:>3.5}'
      print(to_print.format(step, a, b, step_error))
  return step_error_sum / step


We know that \(g(a, b) = 0\) for \((a,b)=(0,0)\). As we can see in the printed output (also in figure 3), at each gradient descent step, the parameters \((a, b)\) get closer to \((0,0)\) alongside the error approaching \(0\). This means that we converge towards \((a,b)\) that minimise \(g(a,b)\) while descending on the gradient.

Step:  0, (a, b): (4.0, 4.0)     Error: 200.0
Step:  1, (a, b): (1.6, 1.6)     Error: 32.0
Step:  2, (a, b): (0.64, 0.64)   Error: 5.12
Step:  3, (a, b): (0.26, 0.26)   Error: 0.8192
Step:  4, (a, b): (0.1, 0.1)     Error: 0.13107
Step:  5, (a, b): (0.041, 0.041) Error: 0.020972
Step:  6, (a, b): (0.016, 0.016) Error: 0.0033554
Sorry, your browser does not support SVG.
Figure 3: Gradient Descent

4 Stochastic Gradient Descent

Instead of the simple gradient descent, we use the Stochastic Gradient Descent to find out the desired weights and biases. Stochastic gradient descent is just a small optimisation on the vanilla gradient descent. Let's first write down the gradient of the error function. Since \(E(\mathcal{W}, \mathbf{b}, \vec{x})\) is defined in terms of all weights \(W^{(l)} \in \mathcal{W}\) and biases \(\vec{b}^{(l)} \in \mathbf{b}\), its gradient vector will consist of the partial derivatives with respect to weights \(W^{(l)}\) and biases \(\vec{b}^{(l)}\).

\[ \nabla E(\mathcal W, \mathbf b, \vec{x}) = \Bigg[\frac{\partial E}{\partial W^{(L)}}, \frac{\partial E}{\partial \vec{b}^{(L)}}, \frac{\partial E}{\partial W^{(L-1)}}, \frac{\partial E}{\partial \vec{b}^{(L-1)}}, ... , \frac{\partial E}{\partial W^{(2)}}, \frac{\partial E}{\partial \vec{b}^{(2)}}\Bigg] \]

The vanilla gradient descent to find the weights and biases is:

  1. Pick \(\vec{x}_{0} \in \mathcal{X} = \{\vec{x}_{1}, \vec{x}_{2}, \vec{x}_{3}, ..., \vec{x}_{n}\}\).
  2. Let \((\mathcal{W}, \mathbf{b})_{i+1} = (\mathcal{W}, \mathbf{b})_{i} \circleddash \eta \nabla E(\mathcal W_i, \mathbf b_i, \vec{x}_i)\)4.
  3. Repeat step 2 for all \(i\).

For the stochastic gradient descent, we first shuffle \(\mathcal{X}\) and then split it down into smaller subsets we call mini-batches. We'll denote a single mini-batch as \(\mathcal{B}_i \subset \mathcal{X}\). We passed the size of a single mini-batch to our Network class with the variable bt_size. Let's include a generator function in the Python class which will shuffle X and then yield \(\mathcal{B}_i\)'s until it runs out.

def batches(self):
  shuffle_ind = np.arange(len(self.X))
  shuffle_X, shuffle_y = self.X[shuffle_ind], self.y[shuffle_ind]
  i, num_batches = 0, int(len(shuffle_X) / self.bt_size)
  for i in range(num_batches - 1):
    l, u = i * self.bt_size, (i + 1) * self.bt_size
    mini_batch_X = shuffle_X[l:u]
    mini_batch_y = shuffle_y[l:u]
    yield zip(mini_batch_X, mini_batch_y)
  mini_batch_X = shuffle_X[(i + 1) * self.bt_size:]
  mini_batch_y = shuffle_y[(i + 1) * self.bt_size:]
  yield zip(mini_batch_X, mini_batch_y)

Each stochastic gradient descent step corresponds to a mini-batch. For all \(\vec{x} \in \mathcal{B_i}\), we compute \(\nabla E(\mathcal{W}, \mathbf{b}, \vec{x})\), sum them, and calculate their mean. Now we use this average gradient from \(\mathcal{B_i}\) to make our nudges. We write the stochastic gradient descent algorithm:

  1. Start with \(\mathcal{B}_0\).
  2. Compute \((\mathcal{W}, \mathbf{b})_{i+1}\), \[ (\mathcal{W}, \mathbf{b})_{i+1} = (\mathcal{W}, \mathbf{b})_{i} \circleddash \eta \Bigg(\frac{1}{|\mathcal{B}_i|} \sum_{\vec{x} \in \mathcal{B}_i} \nabla E(\mathcal W_i, \mathbf b_i, \vec{x}) \Bigg) \]
  3. Repeat step 2 for all \(i\).

The step two above looks a bit intimidating; but the code that implements it looks a lot better. Let's implement the stochastic gradient descent.

def SGD(self, print_steps):  # Stochastic Gradient Descent
  step, step_error_sum = 0, 0
  for mini_batch in self.batches():
    gradient, step_error = self.average_gradient(mini_batch)
    step_error_sum += step_error
    self.Wb = self.Wb - (self.eta * gradient)
    self.W, self.b = self.Wb
    if print_steps:
      to_print = 'SGD step: {:>7}, Error: {:>3.5}'
      print(to_print.format(step, step_error))
    step += 1
  return step_error_sum / step

def average_gradient(self, mini_batch):
  g_sum, error_sum = self.backpropagation(*next(mini_batch))
  for x, y in mini_batch:
    batch_gradient, error = self.backpropagation(x, y)
    error_sum += error
    g_sum += batch_gradient
  return g_sum / self.bt_size, error_sum / self.bt_size

Observe how the definition of the function SGD(...) is not much different than the vanilla implementation of GD(...) from before? In-fact, we only changed a few lines. What is this self.backpropagation(x, y)?

5 Propagate Backwards

We have arrived at the belly of the beast. Gradient descent depends on our ability to calculate the gradient of the error function \(E\)5. Consequently, we can only calculate the gradient if we can calculate the partial derivatives of \(E\) with respect to some \(W^{(l)} \in \mathcal{W}\) and \(\vec{b}^{(l)} \in \mathbf{b}\). Let's remind ourselves of the \(\nabla E\).

\[ \nabla E(\mathcal W, \mathbf b, \vec{x}) = \Bigg[\frac{\partial E}{\partial W^{(L)}}, \frac{\partial E}{\partial \vec{b}^{(L)}}, \frac{\partial E}{\partial W^{(L-1)}}, \frac{\partial E}{\partial \vec{b}^{(L-1)}}, ... , \frac{\partial E}{\partial W^{(2)}}, \frac{\partial E}{\partial \vec{b}^{(2)}}\Bigg] \]

Backpropagation is the algorithm we use to calculate the above gradient. We'll demonstrate it with a small example. Let's architect a network to classify a ten bit binary number as even (True) or odd (False). Note that some of the choices in this architecture will be made not for the sake of solving the problem optimally, but for the illustration of the appropriate concepts. For a total of \(L = 3\) layers, put ten neurons in the input layer to hold the activation caused by the ten bits. There are three neurons in the hidden layer. Finally, two neurons in the output layer to output the probability of an input \(\vec{x}\) (which is a binary number) being even or odd, i. e., \(\vec{a}^{(3)} = [0, 1]\) for even and \(\vec{a}^{(3)} = [1, 0]\) for odd.

Sorry, your browser does not support SVG.
Figure 4: Multi-layer Perceptron to classify ten bit binary numbers per parity

For the network in figure 4, we have the following weights, biases and activations:

\begin{align*} (\mathcal{W}, \mathbf{b}) & = (\{W^{(2)}_{3,10},W^{(3)}_{2,3}\}, \{\vec{b}^{(2)}, \vec{b}^{(3)}\}) \\ \vec{a}^{(1)} & = \vec{x} \\ \vec{z}^{(2)} & = W^{(2)}\vec{a}^{(1)} + \vec{b}^{(2)} \\ & = W^{(2)}\vec{x} + \vec{b}^{(2)} && \text{and} \quad \vec{a}^{(2)} = \sigma(\vec{z}^{(2)}) \\ \vec{z}^{(3)} & = W^{(3)}\vec{a}^{(2)} + \vec{b}^{(3)} \\ & = W^{(3)}\sigma(W^{(2)}\vec{x} + \vec{b}^{(2)}) + \vec{b}^{(3)} && \text{and} \quad \vec{a}^{(3)} = \sigma(\vec{z}^{(3)}) \\ \end{align*}

We can also write out the gradient \(\nabla E\),

\begin{align*} \nabla E(\mathcal W, \mathbf b, \vec{x}) & = \Bigg[\frac{\partial E}{\partial W^{(L)}}, \frac{\partial E}{\partial \vec{b}^{(L)}}, \frac{\partial E}{\partial W^{(L-1)}}, \frac{\partial E}{\partial \vec{b}^{(L-1)}}\Bigg] \\ & = \Bigg[\frac{\partial E}{\partial W^{(3)}}, \frac{\partial E}{\partial \vec{b}^{(3)}}, \frac{\partial E}{\partial W^{(2)}}, \frac{\partial E}{\partial \vec{b}^{(2)}}\Bigg] \\ \end{align*}

5.1 Chain Rule and Composite Functions

Let's start with the \(\frac{\partial E}{\partial W^{(L)}}\).

\begin{align*} \frac{\partial E}{\partial W^{(L)}} & = \frac{\partial (\vec{y} - \hat{\vec{y}})^2}{\partial W^{(L)}} \\ & = \frac{\partial (\vec{y} - \vec{a}^{(L)})^2}{\partial W^{(L)}} \\ \end{align*}

We have a composite function: \((\vec{y} - \vec{a}^{(L)})^2\). We'll need the chain rule to move on. The application of chain rule here needs a little care. You might already be able to take polynomial derivatives using the chain rule without thinking much of it; but, in order to truly understand its application here, the reader should not only be able to apply the chain rule to the polynomials but also be aware of the steps they are taking in terms of its notation. Here is a nice example that walks us through the chain rule in Leibniz's notation6, which is what we'll be using. We state the chain rule.

\[ \frac{\operatorname{d} f \circ g (x)}{\operatorname{d} x} = \frac{\operatorname{d} f(g(x))}{\operatorname{d} x} = \frac{\operatorname{d} f(g(x))}{\operatorname{d} g(x)} \times \frac{\operatorname{d} g(x)}{\operatorname{d} x} = \frac{\operatorname{d} f}{\operatorname{d} g} \times \frac{\operatorname{d} g}{\operatorname{d} x} \]

Following is an intuitive way to think about it [3].

\[\begin{array}{c} \text{The change in }f\circ g\text{ caused by}\\ \text{a small unit change in }x \end{array}=\begin{array}{c} \text{The change in }f\text{ caused by}\\ \text{a small unit change in }g \end{array}\times\begin{array}{c} \text{The change in }g\text{ caused by}\\ \text{a small unit change in }x. \end{array}\]


\begin{align*} \frac{\partial E}{\partial W^{(L)}} & = \frac{\partial (\vec{y} - \vec{a}^{(L)})^2}{\partial W^{(L)}} \\ & = \frac{\partial (\vec{y} - \vec{a}^{(L)})^2}{\partial (\vec{y} - \vec{a}^{(L)})} \frac{\partial (\vec{y} - \vec{a}^{(L)})}{\partial W^{(L)}} && \text{Chain Rule} \\ & = 2(\vec{y} - \vec{a}^{(L)}) \Bigg( \frac{\partial \vec{y}}{\partial W^{(L)}} - \frac{\partial \vec{a}^{(L)}}{\partial W^{(L)}} \Bigg) && \text{Since } \frac{\partial (\vec{y} - \vec{a}^{(L)})^2}{\partial (\vec{y} - \vec{a}^{(L)})} = 2(\vec{y} - \vec{a}^{(L)})\\ & = 2(\vec{y} - \vec{a}^{(L)}) \Bigg(0 - \frac{\partial (\vec{a}^{(L)})}{\partial W^{(L)}} \Bigg) && \text{Since } \frac{\partial (\vec{y})}{\partial W^{(L)}} = 0 \\ & = 2(\vec{y} - \vec{a}^{(L)}) \Bigg(- \frac{\partial \vec{a}^{(L)}}{\partial W^{(L)}} \Bigg) \\ & = -2(\vec{y} - \vec{a}^{(L)}) \frac{\partial \vec{a}^{(L)}}{\partial W^{(L)}} \\ & = -2(\vec{y} - \vec{a}^{(L)}) \frac{\partial \sigma(\vec{z}^{(L)})}{\partial W^{(L)}} \\ & = -2(\vec{y} - \vec{a}^{(L)}) \frac{\partial \sigma(\vec{z}^{(L)})}{\partial \vec{z}^{(L)}} \frac{\partial \vec{z}^{(L)}}{\partial W^{(L)}} && \text{Chain Rule} \\ & = -2(\vec{y} - \vec{a}^{(L)}) \sigma'(\vec{z}^{(L)}) \frac{\partial (W^{L}\vec{a}^{(L-1)}+\vec{b}^{(L)})}{\partial W^{(L)}} && \text{Since } \frac{\partial \sigma(\vec{z}^{(L)})}{\partial \vec{z}^{(L)}} = \sigma'(\vec{z}^{(L)}) \\ & = -2(\vec{y} - \vec{a}^{(L)}) \sigma'(\vec{z}^{(L)}) \frac{\partial (W^{L}\vec{a}^{(L-1)})}{\partial W^{(L)}} + \frac{\partial (\vec{b}^{(L)})}{\partial W^{(L)}} \\ & = -2(\vec{y} - \vec{a}^{(L)}) \sigma'(\vec{z}^{(L)}) \frac{\partial (W^{L}\vec{a}^{(L-1)})}{\partial W^{(L)}} && \text{Since } \frac{\partial (\vec{b}^{(L)})}{\partial W^{(L)}} = 0 \\ & = -2(\vec{y} - \vec{a}^{(L)}) \sigma'(\vec{z}^{(L)}) \vec{a}^{(L-1)} && \text{Since } \frac{\partial (W^{L}\vec{a}^{(L-1)})}{\partial W^{(L)}} = \vec{a}^{(L-1)} \end{align*}

Similarly for \(\frac{\partial E}{\partial \vec{b}^{(L)}}\), the same derivation will apply except at the end we'll have,

\[ \frac{\partial (\vec{b}^{(L)})}{\partial \vec{b}^{(L)}} = 1 \Rightarrow \frac{\partial E}{\partial \vec{b}^{(L)}} = -\frac{\partial (\vec{y} - \vec{a}^{(L)})^2}{\partial (\vec{y}-\vec{a}^{(L)})} \frac{\partial (\sigma(\vec{z}^{(L)}))}{\partial \vec{z}^{(L)}} \frac{\partial \vec{z}^{(L)}}{\partial \vec{b}^{(L)}} = -2(\vec{y}-\vec{a}^{(L)})\sigma'(\vec{z}^{(L)})(1) \]

You'll often see the term \(-2(\vec{y} - \vec{a}^{(L)}) \sigma'(\vec{z}^{(L)})\) written as \(\delta^{(L)}\). This is known as the delta rule or error in the layer \(l\). We have the three grand equations (each arranged correctly per dimension) as a result7:

\begin{align} \frac{\partial E}{\partial W^{(L)}} & = \vec{a}^{(L-1)}\delta^{(L)} \\ \delta^{(L)} & = -2(\vec{y} - \vec{a}^{(L)}) \odot \sigma'(\vec{z}^{(L)}) \\ \frac{\partial E}{\partial \vec{b}^{(L)}} & = \delta^{(L)} \end{align}

But what about all the rest of the partial derivatives for layers \(l < L\)? Well, we just keep applying the chain rule in the above derivation instead of stopping at \(\partial (W^{L}\vec{a}^{(L-1)})\). Let's do it for \(l = L-1, \frac{\partial E}{\partial W^{(L-1)}}\).

\begin{align*} \frac{\partial E}{\partial W^{(L-1)}} & = \delta^{(L)}\frac{\partial (W^{L}\vec{a}^{(L-1)})}{\partial W^{(L-1)}} \\ & = \delta^{(L)}\frac{\partial (W^{L}\vec{a}^{(L-1)})}{\partial a^{(L-1)}} \frac{\partial \vec{a}^{(L-1)}}{\partial W^{(L-1)}} && \text{Chain Rule}\\ & = \delta^{(L)}\frac{\partial (W^{L}\vec{a}^{(L-1)})}{\partial a^{(L-1)}} \frac{\partial \sigma(\vec{z}^{(L-1)})}{\partial W^{(L-1)}} \\ & = \delta^{(L)}\frac{\partial (W^{L}\vec{a}^{(L-1)})}{\partial a^{(L-1)}} \frac{\partial \sigma(\vec{z}^{(L-1)})}{\partial \vec{z}^{(L-1)}} \frac{\partial \vec{z}^{(L-1)}}{\partial W^{(L-1)}} && \text{Chain Rule}\\ & = \delta^{(L)}\frac{\partial (W^{L}\vec{a}^{(L-1)})}{\partial a^{(L-1)}} \frac{\partial \sigma(\vec{z}^{(L-1)})}{\partial \vec{z}^{(L-1)}} \frac{\partial (W^{(L-1)}\vec{a}^{(L-2)} + \vec{b}^{(L-1)})}{\partial W^{(L-1)}} \\ & = \delta^{(L)}W^{L} \sigma'(\vec{z}^{(L-1)}) \vec{a}^{(L-2)} \\ & = \delta^{(L-1)} \vec{a}^{(L-2)} && \text{Letting } \delta^{(L-1)} = \delta^{(L)}W^{L} \sigma'(\vec{z}^{(L-1)})\\ \end{align*}

We now generalise (correcting dimensions) through induction, for all \(l = L - i\),

\begin{align} \frac{\partial E}{\partial W^{(l)}} & = \vec{a}^{(l-1)}\delta^{(l)} && \text{Same as }\frac{\partial E}{\partial W^{(L-i)}} = \vec{a}^{(L-i-1)}\delta^{(L-i)} \\ \delta^{(l)} & = \Big(\big(W^{(l+1)}\big)^T \delta^{(l+1)}\Big) \odot \sigma'(\vec{z}^{(l)}) && \text{Same as }\delta^{(L-i)} = \Big(\big(W^{(L-i+1)}\big)^T \delta^{(L-i+1)}\Big) \odot \sigma'(\vec{z}^{(L-i)}) \\ \frac{\partial E}{\partial \vec{b}^{(l)}} & = \delta^{(l)} && \text{Same as }\frac{\partial E}{\partial \vec{b}^{(L-i)}} = \delta^{(L-i)} \\ \end{align}

5.2 Backpropagation

The only thing keeping us from running the implementation of the stochastic gradient descent was the lack of a way to calculate the gradient of the error function \(E\). That is no longer true! We have a plan of attack. We do a forward pass using some \(\vec{x} \in \mathcal{B}_i\) and then we use equations (1), (2) and (3) to calculate the layer \(L\) partial derivative in the gradient. From there, we use the \(\delta^{(L)}\) to recursively keep calculating \(\delta^{(l)}\) to be used in equations (4), (5) and (6) in order to calculate the rest of the partial derivatives backwards. This backwards pass is why we call this algorithm Backpropagation. Let's implement backpropagation for \(\vec{x} \in \mathcal{B}_i\).

def backpropagation(self, x, y):
  outputs, activations = self.forward_pass(x)
  gradient = self.backward_pass(outputs, activations, y)
  return gradient, self.error(y, activations[-1])

def backward_pass(self, outputs, activations, y):
  gradient_W = np.empty(shape=self.L - 1, dtype=np.object)
  gradient_b = np.empty(shape=self.L - 1, dtype=np.object)
  z, a = outputs[-1], activations[-1]  # z^L, a^L
  delta = -2 * (y - a) * self.sigmoid(z, derivative=True)  # delta^L eq 2
  delta = delta.reshape((1, len(delta)))
  for l in range(self.L - 1, 0, -1):
    a_prev = activations[l - 1]
    a_prev = a_prev.reshape((len(a_prev), 1)).T
    pC_w =, a_prev)  # eq 1 or 4
    pC_b = delta.flatten()  # eq 3 or 6
    gradient_W[l - 1], gradient_b[l - 1] = pC_w, pC_b
    if l == 1:
    z, a = outputs[l - 2], activations[l - 1]
    delta =, self.W[l - 1]) * self.sigmoid(z, derivative=True)  # eq 5
  gradient = np.empty(shape=2, dtype=np.object)
  gradient[0], gradient[1] = gradient_W, gradient_b
  return gradient

6 Wrapping Up the Python Class

We are done with the mathematics of the Multi-layer Perceptron. However, we still need to add a few functions to our Python class. In order to train the network, we run multiple rounds of the stochastic gradient decent using differently permuted X, y each time to produce the batches. The Network class figures out the number of self.SGD(...) calls by the variable epochs passed to the __init__ definition. This function is usually named something along the lines of train(...), but we'll make our network tango instead.

def tango(self, print_steps=False):  # train
  for epoch in range(self.epochs):
    error = self.SGD(print_steps)
    print('* Epoch: {:>4}, Error: {:>3.5}'.format(epoch, error))

Let's also add a function that will forward feed some input \(\vec{x}\) and return the index of the greatest activation in the output layer \(\vec{a}^{(L)}\).

def predict(self, input_layer):
  output_layer = self.forward_pass(input_layer, False)
  return output_layer.argmax()

Finally, say we trained the network for a while and have a pretty good pair of \((\mathcal{W}, \mathbf{b})\). We should have a way to save and load such weights and biases.

def save_weights_biases(self, path='./weights_biases.npy'):
  return, self.Wb)

def load_weights_biases(self, path='./weights_biases.npy'):
  self.Wb = np.load(path, allow_pickle=True)
  self.W, self.b = self.Wb
  return True

7 Testing Code and Assumptions

You can find the complete Network class we wrote in the file: at Github or here.

7.1 Classifying Binary Numbers per Parity

Let's first test by building and training the network in figure 4. We produce the training and testing examples (skipping validation). The training examples are what we pass as X, y. The testing examples are the ones we use to test the accuracy of a trained network. We don't train the network on this set for the sake of an unbiased measure of performance.

\[ \text{Accuracy} = ACC = \frac{\text{Number of Correctly Classified Examples in Test Set}}{\text{Number of Incorrectly Classified Examples in Test Set}} \]

from network import Network
import numpy as np

N, n = 700, 10
X = np.array([list(map(int, '{0:010b}'.format(e))) for e in range(2**n)])
y = np.array([int(e % 2 == 0) for e in range(2**n)])
X, y, X_test, y_test = X[:N], y[:N], X[N:], y[N:]

Since we have the data, we build and train the figure 4 network.

net = Network(X, y, structure=[10, 3, 2], epochs=1000, bt_size=8, eta=0.3)

# train

# test
predictions = np.array([net.predict(x.flatten()) for x in X_test])
acc = np.sum(predictions == y_test) / len(y_test)
print('Network Accuracy: {}'.format(acc))

We get the output:

* Epoch:    0, Error: 0.48236
* Epoch:    1, Error: 0.3234
* Epoch:    2, Error: 0.16235
* Epoch:    3, Error: 0.075001
* Epoch:    4, Error: 0.040232
* Epoch:    5, Error: 0.025256
* Epoch:    6, Error: 0.017793
* Epoch:    7, Error: 0.013523
* Epoch:    8, Error: 0.010801
* Epoch:    9, Error: 0.0089463
* Epoch:  998, Error: 0.00004031
* Epoch:  999, Error: 0.00004026
Network Accuracy: 1.0

Usually if you see an accuracy of 1, that should make you suspicious of having over-fit. However, in this case, the parity classification of a binary number is pretty easy, i. e., we can just look at the least significant bit. In-fact, let's plot our two weight matrices to see if that is what the network learned.

Sorry, your browser does not support SVG.
Figure 5: Weights \(\mathcal{W}\) of the network in figure 4

Looking at figure 5 and sure enough, training assigned the heaviest weight values to the least significant bit. This pattern shows that sometimes by looking at the weights, even humans can learn about new patterns or simply perform dimensionality reduction of the input \(\vec{x}\) (feature selection).

7.2 Classifying Handwritten Digits

The benchmark problem that all neural network tutorials solve is the classification of the handwritten digits. At the time of writing this article, the data is available for free at If you are anything like me, you'll open that link and say to yourself, "Okay, wow, now what?". To deal with that feeling, I wrote a little downloader class in Python that will download, read and reshape the data from that site into the format we want.

from mnsit_handwritten_digits import MNSIT
from network import Network
import numpy as np

# MNSIT Data
mn = MNSIT(path='../mnsit_data/')
mn.plot_image(999, source='training')
train_X, test_X, train_y, test_y = mn.get_data()

You'll need Matplotlib for plotting. The line mn.plot_image(999, source='training') will plot the \(999^{th}\) image (figure 6) in the training set.

Sorry, your browser does not support SVG.
Figure 6: \(999^{th}\) image in the MNSIT training set.

MNSIT images are normalised to dimensions \(28 \times 28 = 784\). We'll create a network with 3 layers:

  1. \(28^2 = 784\) neurons in the input layer to hold the activations caused by the pixel values in the flattened image.
  2. 32 neurons in the hidden layer, for no particular reason.
  3. 10 neurons in the output layer to hold the probabilities of each class 0 through 9.
# Network
net = Network(train_X,
              structure=[784, 32, 10],
# net.load_weights_biases(path='weights_biases.npy')

# train
# net.save_weights_biases(path='weights_biases.npy')

# test
predictions = np.array([net.predict(x.flatten()) for x in test_X])
acc = np.sum(predictions == test_y) / len(test_y)
print('Network Accuracy: {}'.format(acc))

Output after running,

* Epoch:    0, Error: 1.1664
* Epoch:    1, Error: 0.837
* Epoch:    2, Error: 0.74615
* Epoch:    3, Error: 0.66387
* Epoch:    4, Error: 0.61246
* Epoch:    5, Error: 0.57026
* Epoch:    6, Error: 0.54528
* Epoch:    7, Error: 0.52077
* Epoch:    8, Error: 0.49434
* Epoch:    9, Error: 0.47249
* Epoch:   10, Error: 0.44921
* Epoch:   49, Error: 0.21924
Network Accuracy: 0.9259

The training time here will be longer depending on your machine. With the network above, I was able to achieve an accuracy of \(93\%\). You may load my converged set of \((\mathcal{W}, \mathbf{b})\) by uncommenting the line after the network instantiation. Make sure you give the correct path to the file weights_biases.npy (and you should probably comment net.tango()).

8 Other Resources

Outside of the university classes, here are the resources I used to revise some of the relevant material.

  1. 3Blue1Brown's YouTube series on the Multi-layer Perceptrons.
  2. Michael Nielsen's book, Neural Networks and Deep Learning.

I'll leave with somewhat of a dichotomous sentiment:

Learn from me, if not by my precepts, at least by my example, how dangerous is the acquirement of knowledge, and how much happier that man is who believes his native town to be his world, than he who aspires to become greater than his nature will allow [2]. – Mary Shelley, Frankenstein


[1] Stuart J Russell and Peter Norvig. Artificial Intelligence-A Modern Approach, Third Edition., page 122. Pearson Education New Jersey, Upper Saddle River, New Jersey 07458, third edition, 2010. [ bib ]
[2] Mary W. Shelley. Frankenstein or, The Modern Prometheus, chapter chapter 4. PROJECT GUTENBERG, 2013. [ bib ]
[3] user1180576 ( Informal derivation (and interpretation) of substitution rule from chain rule? Mathematics Stack Exchange. URL: (version: 2019-09-13). [ bib | arXiv | http ]
[4] M Zins. What have we learnt from our mistakes? Diagnostic and interventional imaging, 7(94):675, 2013. [ bib ]



To report any mistakes or contact me, send an email with the appropriate subject to simurgh9(at)


We assume the reader to be familiar with the prerequisite notation and code from a proper background or having read the first part.


In reality we shoot for \(MSE\) to be small enough if not approximately zero.


By \(\circleddash\) we mean element wise subtraction, i. e., for some \((\mathcal{W}, \mathbf{b})\) and \((\mathcal{W}, \mathbf{b})'\),

\[ (\mathcal{W}, \mathbf{b}) \circleddash (\mathcal{W}, \mathbf{b})' = \{\{W^{(l)} \in \mathcal{W}, {W^{(l)}}' \in \mathcal{W}' : W^{(l)} - {W^{(l)}}'\}, \{\vec{b}^{(l)} \in \mathbf{b}, {\vec{b}^{(l)}}' \in \mathbf{b}' : \vec{b}^{(l)} - {\vec{b}^{(l)}}'\}\} \]


As one of my math teachers once said, "before looking for something, make sure it exists". We need a few assumptions about our error function before we can expect to calculate its gradient and use gradient descent. I omitted those here for the sake of the focus on the actual gradient itself than a discussion of why we can calculate it.


I have spent an embarrassing amount of time trying to truly understand the Leibniz's notation. You may also struggle with this switch from a ratio of infinitesimals to limits. Here is a helpful Mathematics Stack Exchange question that finally cleared things up for me.


By \(\odot\) we mean element wise multiplication. This is also known as the Hadamard product. Here is an example,

\[ \left[\begin{array}{c} 1 \\ 3 \end{array}\right] \odot \left[\begin{array}{c} 2 \\ 4\end{array} \right] = \left[ \begin{array}{c} 1 \times 2 \\ 3 \times 4 \end{array} \right] = \left[ \begin{array}{c} 2 \\ 12 \end{array} \right] \]

Author: Tashfeen, Ahmad

Created: 2020-08-16 Sun 17:41