Backpropagation is a technique used to teach a neural network that has at least one hidden layer.
- part 1 - simplest network
- part 2 - backpropagation (you are here)
- part 3 - backpropagation-continued
A perceptron is a processing unit that takes an input
Within a neural network, its input is the sum of the previous layer node outputs times their corresponding weight, plus the previous layer bias:
If we treat the bias as an additional node in a layer with a constant value of
$$ x_j = \sum^{I + 1}{i = 1} x_iw{ij} $$
Why do we need an activation function? Without it the output of every node will be linear, making the neural network output a linear function of the inputs. Since the combination of two linear functions is also a linear function, you can't compute more interesting functions without non-linear ones. This means that the network will only be able to solve problems that can be solved with linear regression.
If
-
Sigmoid
$\quad y = \frac{1}{1 + e^{-x}}$ -
ReLU or rectified linear unit
$\quad y = max(0, x)$ -
tanh
$\quad y = tanh(x)$
The backpropagation algorithm is used to train artificial neural networks, more specifically those with more than two layers.
It's using a forward pass to compute the outputs of the network, calculates the error and then goes backwards towards the input layer to update each weight based on the error gradient.
-
$x_i, x_j, x_k$ , are inputs to a node for layers$I, J, K$ respectively. -
$y_i, y_j, y_k$ , are the outputs from a node for layers$I, J, K$ respectively. -
$y\prime_k$ is the expected output of a node of the$K$ output layer. -
$w_{ij}, w_{jk}$ are weights of node connections from layer$I$ to$J$ and from layer$J$ to$K$ respectively. -
$t$ is the current association out of$T$ associations.
We will assign the following activation functions to each layer nodes for all following examples:
- input layer -> identity function
- hidden layer -> sigmoid function
- output layer -> identity function
During the forward pass, we feed the inputs to the input layer and get the results in the output layer.
The input to each node in the hidden layer
Since the hidden layer's activation function for each node is the sigmoid, then their output will be:
In the same manner, the input to the output layer nodes are
$$ x_{k} = \sum^{J}{j = 1} w{jk}y_{j} $$
and their output is the same since we assigned them the identity activation function.
Once the inputs have been propagated through the network, we can calculate the error. If we have multiple associations, we simply sum the error of each association.
$$ E = \sum^{T}{t = 1} E_t = \frac{1}{2T} \sum^{T}{t = 1} (y_{kt} - y\prime_{kt})^2 $$
Now that we have the error, we can use it to update each weight of the network by going backwards layer by layer.
We know from part 1 of this series that the change of a weight is the negative of that weight's component in the error gradient times the learning rate. For a weight between the last hidden layer and the output layer, we then have
We can find the error gradient by using the chain rule
Therefore the change in weight is
For multiple associations, then the change in weight is the sum of each association
Similarly, for a weight between hidden layers, in our case between the input layer and our first hidden layer, we have
Here the calculations are slightly more complex. Let's analyze the delta term
Remember that our activation function
Again, the change in weight for all associations is the sum of each association
First, initialize network weights to a small random value.
Repeat the steps below until the error is about 0
- for each association, propagate the network forward and get the outputs
- calculate the
$\delta$ term for each output layer node ($\delta_k = y_k - y\prime_k$ ) - accumulate the gradient for each output weight (
$\nabla_{w_{jk}}E = \delta_k y_j$ ) - calculate the
$\delta$ term for each hidden layer node ($\delta_j = y_j(1 - y_{j})\sum^K_{k = 1}\delta_k w_{jk}$ ) - accumulate the gradient for each hidden layer weight (
$\nabla_{w_{ij}}E = \delta_j y_i$ )
- calculate the
- update all weights and reset accumulated gradients (
$w = w - \epsilon \nabla E$ )
In this example, we'll use actual numbers to follow each step of the network. We'll feed our 2x2x1 network with inputs
We start by setting all of the nodes of the input layer with the input values;
Since the input layer nodes have the identity activation function, then
We then propagate the network forward by setting the
We then activate the
And we propagate those results to the final layer
Since the activation function of our output nodes is the identity, then
On the way back, we first calculate the
And using the
We then do the same thing for each hidden layer (just the one in our case):
And calculate the gradient for each weight between
The last step is to update all of our weights using the calculated gradients. Note that if we had more than one association, then we would first accumulate the gradients for each association and then update the weights.
As you can see the weights changed by a very little amount, but if we were run a forward pass again using the updated weights, we should normally get a smaller error than before. Let's check...
We had
We had
We successfully reduced the error! Although these numbers are very small, they are much more representative of a real scenario. Running the algorithm many times over would normally reduce the error down to almost 0 and we'd have completed training the network.
The example teaches a 2x2x1 network the XOR operator.
Where
Note that the XOR operation could not be solved with the linear network used in part 1 because the dataset is distributed non-linearly. Meaning you could not pass a straight line between the four XOR inputs to divide them into the correct two categories. If we replaced the hidden layer node activation functions from sigmoid to identity, this network wouldn't be able to solve the XOR problem as well.
Feel free to try it out yourself and experiment with different activation functions, learning rates and network topologies.
- Artificial intelligence engines by James V Stone (2019)
- Complete guide on deep learning: http://neuralnetworksanddeeplearning.com/chap2.html
- Flow of backpropagation visualized: https://google-developers.appspot.com/machine-learning/crash-course/backprop-scroll/
- Activation functions: https://medium.com/the-theory-of-everything/understanding-activation-functions-in-neural-networks-9491262884e0