This is a from-scratch implementation of neural networks depending only on numpy. It is for educational purposes only.
Linear equation
y = w1*x1 + ... + wn*xn + b
Input is a vector
X = [x1 ... xn]
and the output is a scalar
y
List of linear equations in the same dimension n.
y1 = w11*x1 + w12*x2 + ... + w1n*xn + b1
...
ym = wm1*x1 + wm2*x2 + ... + wmn*xn + bm
Therefore the output of m linear equations for one sample, i.e. same input [x1 ... x2]
, is
y = [y1 ... ym]
We want to calculate results for multiple samples at once. In other words, we want to feed a list of samples and receive a list of results from a layer.
Input for k samples
X = [[x11 ... x1n] ... [xk1 ... xkn]]
Output for k samples is a list of results for each m equation
y = [[y11 ... y1m] ... [yk1 ... ykm]]
A layer stores its weights and biases
w = [[w11 .. wm1] [w12 .. wm2] ... [w1n .. wmn]]
b = [b1 ... bm]
Shapes of input X, weights w, biases b, and output y is (k, n), (n, m), (k, ), (k, m)
The calculation of the puts is then done with a simple dot product of inputs and weights
y = X*w + b
We get a dimension of (k, n) * (n, m) + (k,) -> (k, m)
Functional representation of a layer
layer(X) = X*w + b
An activation function a is applied to the output of each neuron in a layer
a(y)
We will use the same activation function for all neurons in a layer.
Input is the output of a layer of shape (k, m), where k is the number of samples and m the number of neurons.
Output is usually the same shape (k, m), which will be used as input for the next layer or for the loss function.
The ReLU activation function is extremely close to being a linear activation function while remaining nonlinear, due to that bend after 0.
a(y) = max(0, y)
Putting neurons, layers, activation functions together we can get an output (ŷ) from our network
ŷ = a(layer(X))
Loss function (aka cost function) is the algorithm that quantifies how wrong a network is.
Loss is the measure of this metric.
The input for a loss function C is the output of the activation function of the last layer (ŷ).
C(ŷ, y) = C(a(layer(X)), y)
The output of a lost function is the loss for each sample.
L = [l1 ... lk]
The shape of the output is (k,)
C(ŷ, y) - > L
The overall loss of a network is the mean of the elements in L plus some regulation errors L1 and L2 (if applied).
You square the difference between the predicted (ŷ) and true values (y) (as the model can have multiple regression outputs) and average those squared values.
C(ŷ, y) = sum((y - ŷ)**2)/k
where k is the number of samples
The only values we can change to get a better loss (optimally zero) in a network is to change the weights and biases in the layers.
We need the gradient of the loss function C with respect to the weights to get the direction and magnitude to move the weights towards.
w = w - r*∇(dC/dw)
Where r is a value between (0,1) and is called the learning rate.
Recall, to calculate loss (l stands for layer)
C(ŷ, y) = C(a(l(X)), y)
According to chain rule, the gradient of C with respect to w can be written as
dC/dw = (dl/dw)*(da/dl)*(dC/da)
The derivative of Mean Squared Error function is (shape (k, m))
dC_da = -2 * (y - ŷ) / k
The derivative of ReLU activation function is (shape (k, m))
da_dl = dC_da
da_dl[layer(X) <= 0] = 0
The derivative of the layer is
dl_dw = X.T * da_dl
Shape of dl_dw
is (n, k) * (k, m) -> (n, m)
This means update m weights for each input in n.
To update the biases we will aggregate all samples and get a list of sum of weights for each input.
dl_db = sum(da_dl) through axis=0
Shape of dl_db
is (n, m).
We apply the partial derivatives using gradient descent and the chain rule.
w = w - r*dl_dw
b = b - r*dl_db