Neural Networks: The Forward Pass

Tashfeen, Ahmad

Figure 1: Matryoshka dolls

"Believe nothing you hear, and only one-half that you see." [4] – Edgar Allan Poe

Many online resources are found with appendages: "in ten minutes", "made easy" and "from scratch". In hopes of getting through to the reader, most of such resources seem to either oversimplify, hide the fair complexity or only talk about the mathematics/code of the Neural Networks. Or, instead of explaining a Neural Network, they discuss how to use library code to quickly put together one. This alludes the reader into the miss-conception that the mathematics and code of a Neural Network should be of two different interests.

In this two-part article1, we shall look at the mathematics and the code of a simple Neural Network, a. k. a., a Multi-layer Perceptron. In this first part we start with the math of a Neural Network's input-output dynamics, explain how the equations come to be and then write a Python class which implements them. In the second part we will follow-up with how exactly do we find the correct parameters to use in the input-output dynamics we learned in the first part.

1 The Function

Let's start with a fact that is somewhat calming. A neural network is just a mathematical function which we will denote as \(f\). Even though a function, in our code, we shall implement a Python class2,

import numpy as np

class Network:
  def __init__(self, X, y, structure, epochs=20, bt_size=32, eta=0.3):

For now, you can ignore the input variables of the __init__ definition. We will talk more about them as they become relevant. Before we start thinking about this function \(f\), we need to think about how to model our problems mathematically.

2 A Simple Problem

Say the problem is to figure out a way of calculating the area of a square, given its side-length. This problem can be modeled with two numbers: the side-length \(x \in \mathbb{R}\) and an area \(f(x) = y \in \mathbb{R}\). Now we can write the domain and codomain like this \(f:\mathbb{R}\rightarrow \mathbb{R}\).

We have modelled our problem and are aware of what the input and output of \(f\) mean. Say I show you a few example side-lengths and area pairs \((x, y)\), e. g.,

\[ \{(0,0)(1,1)(2,4)(3,9)(4,16)(5,25), ...\} \]

Can you guess the function? Indeed! it's the polynomial \(f(x) = y = x^2\) and you just did a little bit of machine learning in your head. You looked at correct example inputs and outputs and generalised the idea into a general enough function.

3 General Problem Modellings

In the above problem, we could model the input and output of our problem with just one number. What if instead of a singular value, our problem is best modelled when the inputs and outputs are lists of ordered values: \((\vec{x}, \vec{y})\) vectors? Then the domain and codomain of \(f\) can be written as \(f:\mathbb{R}^n\rightarrow \mathbb{R}^m\).

A subset with \(N\) elements of this domain and codomain or input and output are passed to the Network definition above as the variables X, y. This is our sampled data of known examples. X \(\subset \mathbb{R}^n\) of dimensions/shape \((N, n)\) will be a 2D Numpy array of examples (inputs modelled as row arrays/vectors), where y \(\subset \mathbb{R}^m\) will a Numpy array containing the correct outputs which are also called labels.

3.1 Classification with One-hot Encoding

Sometimes we model problems as classification of examples into discrete classes. This classification is based on certain features of any given example input \(\vec{x} \in\) X. Some \(x_i \in \vec{x}\) is considered a single feature. Therefore, the input \(\vec{x}\) can be thought of as a feature vector. Say our goal is to classify \(\vec{x}\) into three discrete classes. For some \(\vec{x} \in \mathbb{R}^n\), we could have a label \(y \in \{0, 1, 2\}\). But what if we wanted to model the label as the probabilities of \((\vec{x}, y)\) for all \(y \in \{0, 1, 2\}\)? We could rewrite a single label as a vector of probabilities \(\vec{y}\) where the probability that \(\vec{x}\) belongs to class \(i\) is \(y_i \in \vec{y}\). Notice how if we are absolutely certain that for some \(\vec{x}, (\vec{x}, 0)\) then \(\vec{y} = [1, 0, 0]\). Similarly for labels \(1, 2\) the output vectors \(\vec{y}\) will be \([0, 1, 0], [0, 0, 1]\). Such an encoding of labels is also known as one-hot encoding. Let's add this to our Python class3.

def __init__(self, X, y, structure, epochs=20, bt_size=32, eta=0.3):
                      # labels to one-hot arrays for decimal labels
  self.X, self.y = X, np.eye(len(set(y)))[y.reshape(-1)]

4 An Example Network

Now that we understand the input and output of \(f\), the question is that for some \(\vec{x}\), how should \(f\) map it to the desired \(\vec{y}\)? This process uses a layered architecture with a set of weights \(\mathcal{W}\) and biases \(\mathbf{b}\) and we call it forward-feeding or the forward-pass. We'll learn about forward feeding, weights and biases with a small running example before we generalise. Don't let figure 2 intimidate you. We'll break it down.

Sorry, your browser does not support SVG.
Figure 2: Multi-layer Perceptron with 3 layers

5 Layered Architecture

For now ignore all the edges and labels and just look at the green, blue and red vertices. If without any further explanation I ask you to tell me how many layers are in this network, you might say three. Then, if I ask you to give me the number of neurons in each layer, you might say \([5, 4, 2]\). You'd be right in both cases!

The first layer in this three-layered Multi-layer Perceptron is the first green column on the left hand side. This layer corresponds to the length of our example input \(\vec{x} \in \mathbb{R}^5\). After the input layer, we have hidden layers. of which in the figure 2's network, there is only one: the blue one with four neurons. Note that even though we have only one hidden layer, it is entirely possible for some other network to have more! After the hidden layers, we'll see the output layer. This layer corresponds to the output \(\vec{y}\). Here we read what the output of our network is after a success forward pass. Thus, the network in figure 2 can be written in the function notation like this, \(f:\mathbb{R}^5\rightarrow \mathbb{R}^2\).

5.1 Layer Indices Notation

We use the variable \(l \in \mathbb{N}\) for the index of any particular layer where the \(l\) corresponding to the output layer is capitalised as \(L\). In short, the variable we use for the index of all but last layer is \(l\) and the index of the last layer is \(L\) (e. g., for the network in figure 2, we know that layer \(l=1\) has five neurons, layer \(l=2\) has four neurons and layer \(l=L=3\) has two neurons). I'll denote the number of neurons in a layer \(l\) as \(|l|\); consequently the number of neurons in the output layer is \(|L|\)4.

We pass this structure of layers about how many neurons we want per layer as list to the Python class with the variable structure. For the network in figure 2, structure = [5, 4, 2].

def __init__(self, X, y, structure, epochs=20, bt_size=32, eta=0.3):
  # labels to one-hot array
  self.X, self.y = X, np.eye(len(set(y)))[y.reshape(-1)]
  self.structure = structure
  self.epochs, self.bt_size = epochs, bt_size
  self.eta = 0.3
  self.L = len(structure)

You can ignore the variables epochs, bt_size, eta. We assigned the structure array and self.L. Remember that due the zero-based-indexing of arrays, the index of the last layer here will be self.L-1.

6 Weights and Biases

Now that we understand the vertices/neurons in the layers of a network. We are ready to see how the network \(f\) takes \(\vec{x}\) and feeds it forward through all the layers \(l < L\), arriving at the output layer \(L\). The heart of it all is in matrix multiplication. If you don't recall the basics of it, this is a good time to brush-up. The set \(\mathcal{W}\) is a set of matrices; similarly, the set \(\mathbf{b}\) is a set of vectors. For a network with \(L\) layers, we have \(L-1\) many matrices in \(\mathcal{W}\) and vectors in \(\mathbf{b}\).

6.1 Activations in Layers

All layers hold an activation vector \(\vec{a} \in \mathbb{R}^n\). We denote the activation vector of layer \(l\) as \(\vec{a}^{(l)}\). Be cautious. The \((l)\) here is not a power or exponent but the index of the layer whose activation vector is \(\vec{a}^{(l)}\). An activation being held in a certain neuron of a certain layer is then denoted as \(a^{(l)}_i \in \vec{a}^{(l)}\). Notice the edges (arrows) going from layer to layer in the figure 2? An edge that connects \(i^{th}\) neuron in layer \(l-1\) to \(j^{th}\) neuron in layer \(l\) is representing the element \(w_{ji} \in W^{(l)} \in \mathcal{W}\). By now, you should be feeling more familiar with the anatomy of the network shown in figure 2.

6.2 Propagating Activations Forward

How do we get the activations in the first (input) layer? We simply let it equal to our input vector,

\[ \vec{a}^{(1)} = \vec{x} \]

Now that we have \(\vec{a}^{(1)}\), how do we get \(\vec{a}^{(2)}\)? We write \(\vec{a}^{(2)}\) as a function5 of \(\vec{a}^{(1)}\), the weight matrix \(W^{(2)} \in \mathcal{W}\) and the first bias vector \(\vec{b}^{(2)} \in \mathbf{b}\)6.

\[ \vec{a}^{(2)} = \sigma\Big(W^{(2)}\vec{a}^{(1)} + \vec{b}^{(2)}\Big) \]

There are some subtle observations that must be made here that will help us write the code. We introduced another function \(\sigma\). Since we refer to \(W^{(2)}\vec{a}^{(1)} + \vec{b}^{(2)}\) on its own quite often and there are other notational benefits, let \(\vec{z}^{(2)} = W^{(2)}\vec{a}^{(1)} + \vec{b}^{(2)}\). Pause here and think about what will be the dimensions of \(\vec{z}^{(l)}\) for \(l=2\)? We are going from layer 1 to layer 2 so the dimensions of \(\vec{z}^{(l)}\) must be \((|l|, 1)\). This is just saying that \(\vec{a}^{(2)}\) is a vector with \(|l|\) neurons. This should remind you that for multiplication to be valid between two matrices, the first's number of columns should be equal to the second's number of rows! This means that any \(W^{(l)} \in \mathcal{W}\) that gets you from layer \(l-1\) to \(l\) has dimensions \((|l|, |l-1|)\) and \(\vec{b}^{(l)} \in \mathbf{b}\) has \(|l|\) many elements. Therefore, when we multiply \(W^{(l)}\) with dimensions \((|l|, |l-1|)\) to \(\vec{a}^{(l-1)}\) with dimensions \((|l-1|, 1)\) and add \(\vec{b}^{(l)}\) with dimensions \((|l|, 1)\), we get \(\vec{z}^{(l)}\) with dimensions \((|l|, 1)\).

\[ \overbrace{(|l|, \underbrace{|l-1|) \times (|l-1|}_{\text{Have to be equal.}}, 1)}^\text{Product Dimensions: $(|l|, 1)$} \]

At this point we let \(\vec{z}^{(l)} = W^{(l)}\vec{a}^{(l-1)} + \vec{b}^{(l)}\) then we have the following equations,

\begin{align} \vec{z}^{(l)} &= W^{(l)}\vec{a}^{(l-1)} + \vec{b}^{(l)} && \text{Outputs to layer $l$} \\ \vec{a}^{(l)} & = \sigma(\vec{z}^{(l)}) && \text{Activations of layer $l$} \\ \end{align}

6.3 Random Weights and Biases

A question that I have sleekly avoided so far is how do we find these so called weights and biases sets \((\mathcal{W}, \mathbf{b})\) that enable the network \(f\) to map \(\vec{x}\) to it's expected \(\vec{y}\). This is where machine learning and sample examples come in, which we passed to our network definition as X, y. For now, we just initialise \((\mathcal{W}, \mathbf{b})\) randomly from a normal distribution. We initialise \((\mathcal{W}, \mathbf{b})\) randomly though with correct dimensions, inferring them from self.structure. Remember how all \(W^{(l)} \in \mathcal{W}\) must have dimensions \((|l|, |l-1|)\) and \(\vec{b}^{(l)} \in \mathbf{b}\) must have \(|l|\) many elements? We just initialise them randomly. Let's finish the definition of __init__.

def __init__(self, X, y, structure, epochs=20, bt_size=32, eta=0.3):
  # labels to one-hot array
  self.X, self.y = X, np.eye(len(set(y)))[y.reshape(-1)]
  self.structure = structure
  self.epochs, self.bt_size = epochs, bt_size
  self.eta = 0.3
  self.L = len(structure)
  self.Wb = self.random_weights_biases()
  self.W, self.b = self.Wb

def random_weights_biases(self, sigma=1, mu=0):
  W = np.empty(self.L - 1, dtype=object)
  b = np.empty(self.L - 1, dtype=object)
  for i in range(self.L - 1):
    c, r = self.structure[i], self.structure[i + 1]
    W[i] = sigma * np.random.randn(r, c) + mu
    b[i] = sigma * np.random.randn(r) + mu

  Wb = np.empty(2, dtype=object)
  Wb[0], Wb[1] = W, b  # all weights and biases
  return Wb

6.4 Sigmoid Logistic Function

The new \(\sigma(x) = \frac{1}{1+e^{-x}}\) is the sigmoid function. Its job is to take any real value and map it to \((0,1)\). In other words, \(\sigma\) scales everything to a number between zero and one. This means that we want our activations to be between zero and one. When we pass a vector to \(\sigma\), we mean,

\[ \sigma(\vec{x}) = \big[\sigma(x_1), \sigma(x_2), \sigma(x_3), ... , \sigma(x_n) \big] \]

Sorry, your browser does not support SVG.
Figure 3: Sigmoid \(\sigma: \mathbb{R} \rightarrow (0,1)\)

Figure 3 shows a graph [3] of the sigmoid. Later we'll also be needing the first derivative of the sigmoid function \(\sigma'(x) = \sigma(x)(1-\sigma(x))\), so let's add the sigmoid function with a derivative flag to our code.

def sigmoid(self, x, derivative=False):
  s = lambda x: 1 / (1 + np.exp(-x))  # noqa: E731
  return s(x) * (1 - s(x)) if derivative else s(x)

7 Writing Activations Explicitly

We know that a simple neural network \(f:\mathbb{R}^n \rightarrow \mathbb{R}^m\) starts with \(\vec{a}^{(1)} = \vec{x} \in \mathbb{R}^n\) then performs \(L-1\) matrix multiplications as shown in equation (1) and (2) and arrives at \(\vec{a}^{(L)} = \vec{y} \in \mathbb{R}^m\). For a network as small as the one shown in figure 2, we can write out the equations for all of its activations. We will also write out the sets \((\mathcal{W}, \mathbf{b})\) with their correctly shaped elements. Since we have three \(L=3\) layers in the Multi-layer Perceptron of figure 2, we will have \(L-1 = 3-1 = 2\) weight matrices and bias vectors.

\begin{align*} (\mathcal{W}, \mathbf{b}) & = (\{W^{(2)}_{4,5},W^{(3)}_{2,4}\}, \{\vec{b}^{(2)}, \vec{b}^{(3)}\}) \\ \vec{a}^{(1)} & = \vec{x} \\ \vec{z}^{(2)} & = W^{(2)}\vec{a}^{(1)} + \vec{b}^{(2)} \\ & = W^{(2)}\vec{x} + \vec{b}^{(2)} && \text{and} \quad \vec{a}^{(2)} = \sigma(\vec{z}^{(2)}) \\ \vec{z}^{(3)} & = W^{(3)}\vec{a}^{(2)} + \vec{b}^{(3)} \\ & = W^{(3)}\sigma(W^{(2)}\vec{x} + \vec{b}^{(2)}) + \vec{b}^{(3)} && \text{and} \quad \vec{a}^{(3)} = \sigma(\vec{z}^{(3)}) \\ \end{align*}

Let's generalise and implement forward feeding of any given \(\vec{x}\). We'll write a subroutine with a flag. When the flag is true, the subroutine will return all the activations \(\vec{a}\) and outputs \(\vec{z}\) caused by forward feeding \(\vec{x}\), when false, it'll just return \(\vec{a}^{(L)}\).

def forward_pass(self, example, keep_track=True):
  input_layer = example.flatten()
  # if we only want the output of the network
  if keep_track is False:
    for W, b in zip(self.W, self.b):
      input_layer = self.sigmoid(, input_layer) + b)
    return input_layer
  outputs = np.empty(shape=self.L - 1, dtype=np.object)  # z^(l)
  activations = np.empty(shape=self.L, dtype=np.object)  # a^(l)
  activations[0] = input_layer
  for W, b, l in zip(self.W, self.b, range(self.L - 1)):
    outputs[l] =, input_layer) + b
    activations[l + 1] = self.sigmoid(outputs[l])
    input_layer = activations[l + 1]
  return outputs, activations

8 Testing Code and Assumptions

Beware of bugs in the above code; I have only proved it correct, not tried it. [5] – Donald E. Knuth

It's time to test the code we have so far and see if it performs as per our assumptions. Let's give our Network a string representation. We'll print out information about each layer in our network. This means: the number of neurons, shape of the associated weight matrix and number of elements in the associated bias vector.

def __repr__(self):
  ret = ''
  for l, W, b in zip(self.structure, self.W, self.b):
    ret += '({}: W{} + b{})\n'.format(l, W.shape, b.shape)
  return ret

def __str__(self):
  return self.__repr__()

Let's build the network in figure 2. We know the number of neurons in each of its layers is 5, 4 and 2. Therefore, we'll let structure = [5, 4, 2]. Even though we won't be training this network just yet, we'll still need to pass some mock examples so we can initialise it. For now we can just put this test code in the same file after the Network class definition so we don't have to figure out imports.

X, y = np.array([[2, 3, 4, 5, 7]]), np.array([1, 1, 0, 1, 1])
net = Network(X, y, structure=[5, 4, 2])

Running python3 path/to/ gets us,

(5: W(4, 5) + b(4,))
(4: W(2, 4) + b(2,))

This prints out the correct information about the weights and biases! (5: W(4, 5) + b(4,)) says that the first layer with five neurons connects with the second layer with 4 neurons using \(W_{4,5}\) and \(\vec{b}\) with 4 elements. Then, (4: W(2, 4) + b(2,)) means that the second layer with four neurons connects with the third layer with 2 neurons using \(W_{2,4}\) and \(\vec{b}\) with 2 elements. So far so good.

We can further investigate if the one-hot encoding was done properly and if the shapes of each activation vector \(\vec{a}^{(l)}\) is correct.

outputs, activations = net.forward_pass(np.array([1, 0, 1, 0, 1]))
for a in activations:

Running python3 path/to/ gets us,

[[0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [0. 1.]]

Can you tell why this output is correct?

9 Matryoshka Dolls

The banner [1] of this article is of Matryoshka Dolls. Why? They resemble a Multi-layer Perceptron in structure [2]. Think of the outermost doll as the input layer. She takes the input vector \(\vec{x}\), performs the forward feeding says her little magic spell to get the next activation and passes it onto the next doll within. When the propagation reaches the doll present at the innermost layer, we uncap all the dolls and ask the innermost doll for the output \(\vec{y}\). What if it's not correct \(\hat{\vec{y}} \neq \vec{y}\)?

Figure 4: Matryoshka Layers


[1] Russian cultural experiences. [Online; accessed June 28, 2020]. [ bib | http ]
[2] Russian nesting dolls. [Online; accessed June 28, 2020]. [ bib | http ]
[3] Martin Thoma. Wikimedia commons, 2014. [Online; accessed June 26, 2020]. [ bib | http ]
[4] Edgar Allan Poe. The System of Dr. Tarr and Prof. Fether. Virginia Tech, 1850. [ bib ]
[5] Walter F Tichy, Paul Lukowicz, Lutz Prechelt, and Ernst A Heinz. Experimental evaluation in computer science: A quantitative study. Journal of Systems and Software, 28(1):9--18, 1995. [ bib ]



To report any mistakes or contact me, send an email with the appropriate subject to simurgh9(at)


I will use only vanilla Python 3, with the famous library Numpy for fast vectorised array operations and linear algebra.


We're making this Network class keeping classification in mind. Though we can do regression with the final product by changing only a few line.


I am not sure if denoting the number of neurons in layer \(l\) as \(|l|\) is the standard notation. But, indexing layers with \(l\) is.


You maybe confused that at the start of we called \(\vec{a}\) a vector and now we are calling it a function? Think of it like this: activations are functions that evaluate to a vector \(\vec{a}^{(l)}\).


Note that \(W^{l} \in \mathcal{W}\) and \(b^{(l)} \in \mathbf{b}\) are written with the index of a layer in the superscript and not an exponent. Just like the activation \(\vec{a}^{(l)}\).

Author: Tashfeen, Ahmad

Created: 2020-07-07 Tue 02:37