A fully connected neural network (MLP) implemented from scratch in Python. The only library used in the making of this neural net is NumPy.
I trained this model over the MNIST dataset with:
- Testing Accuracy: 98.5%
- Training Accuracy: 99.4%
and over the Fashion MNIST dataset with:
- Testing Accuracy: 90.0%
- Training Accuracy: 97.4%
For the MNIST Dataset:
- Input layer: 784 neurons, ReLU
- Hidden layer 1: 800 neurons, RelU
- Hidden layer 2: 800 neurons, ReLU
- Output layer: 10 neurons, Softmax
For the Fashion MNIST Dataset:
- Input layer: 784 neurons, RelU
- Hidden layer 1: 1024 neurons, ReLU
- Hidden layer 2: 512 neurons, ReLU
- Hidden layer 3: 256 neurons, ReLU
- Output layer: 10 neurons, Softmax
The neural network uses ReLU activation for hidden layers and Softmax for the output layer to produce a probability distribution for classification.
Note
The code is not dependent on a predefined architecture.
The weights are initialized with random values drawn from a normal distribution np.random.randn
. This ensures that the weights start with small random values, preventing symmetry. The weights are scaled by 0.01
to ensure they are small. This avoids exploding gradients in the early stages of training. The biases are initialized to zero np.zeros
.
The network is constructed layer-by-layer, connecting each layer to the previous one through the weights and biases. For each hidden layer, we create a weight matrix and bias vector.
def __init__(self, input_nodes=784, hidden_layers=[800, 800], output_nodes=10, activation='relu', learning_rate=0.001):
if activation == 'relu':
self.activation = self.ReLU
elif activation == 'sigmoid':
self.activation = self.sigmoid
self.input_nodes = input_nodes
self.hidden_layers = hidden_layers
self.output_nodes = output_nodes
self.initial_lr = learning_rate
self.weights = []
self.biases = []
# input layer to first hidden layer
w = 0.01 * np.random.randn(self.input_nodes, self.hidden_layers[0])
b = np.zeros((1, self.hidden_layers[0]))
self.weights.append(w)
self.biases.append(b)
# between all hidden layers
for i in range(1, len(hidden_layers)):
w = 0.01 * np.random.randn(self.hidden_layers[i-1], self.hidden_layers[i])
b = np.zeros((1, self.hidden_layers[i]))
self.weights.append(w)
self.biases.append(b)
# last hidden layer to output layer
w = 0.01 * np.random.randn(self.hidden_layers[-1], self.output_nodes)
b = np.zeros((1, self.output_nodes))
self.weights.append(w)
self.biases.append(b)
return
Each layer computes a linear transformation followed by an activation function, except the output layer where we apply softmax for classification tasks. The input vector a
is passed through the first hidden layer. The operation np.matmul(a, self.weights[i]) + self.biases[i]
computes a weighted sum of the inputs, which is then passed through the activation function self.activation(z)
. For the output layer, the weighted sum is computed just as for the hidden layers and we apply the Softmax function to transforms the output into a probability distribution over the classes, which is useful for classification problems like the MNIST dataset.
def feed_forward(self, inputs):
self.outputs_activation = []
self.outputs_linear = []
a = inputs
self.outputs_activation.append(a)
# All layers other than the output layer have their activation function
for i in range (len(self.hidden_layers)):
z = np.matmul(a, self.weights[i]) + self.biases[i]
self.outputs_linear.append(z)
a = self.activation(z)
self.outputs_activation.append(a)
# The last layer has a softmax function
z = np.matmul(a, self.weights[-1]) + self.biases[-1]
self.outputs_linear.append(z)
a = self.softmax(z)
self.outputs_activation.append(a)
return self.outputs_activation[-1]
Backpropagation allows the model to adjust its parameters based on the error in its predictions. This function computes the gradients of the loss function with respect to each weight and bias using the chain rule. The difference between the predicted output y_hat
and the true labels y
is used to calculate the gradient of the loss function with respect to the output layer's inputs dZ
. The gradient with respect to the weights dW
is computed as the product of the activations from the previous layer self.outputs_activation[-2].T
and dZ
. The bias gradient db
is the average of the dZ
values across the batch np.sum(dZ, axis=0, keepdims=True)
. For each hidden layer, the error is propagated backward to compute the gradients with respect to the weights and biases. This is done by calculating dA = np.matmul(dZ, self.weights[i+1].T)
which gives the gradient with respect to the previous layer's output. The derivative of the activation function is used to compute dZ
(which adjusts for the non-linearity). After calculating the gradients for each layer, the order of gradients is reversed since backpropagation computes gradients starting from the output layer and moving backward to the input layer.
def back_propagation(self, y, y_hat):
m = y.shape[0] # batch size
self.dW = []
self.db = []
# for the output layer
dZ = self.outputs_activation[-1] - y
dW = (1. / m) * np.matmul(self.outputs_activation[-2].T, dZ)
db = (1. / m) * np.sum(dZ, axis=0, keepdims=True)
self.dW.append(dW)
self.db.append(db)
# for layers before the output layer
for i in range(len(self.hidden_layers) - 1, -1, -1):
dA = np.matmul(dZ, self.weights[i+1].T)
dZ = dA * self.activation(self.outputs_linear[i], d=True)
dW = (1. / m) * np.matmul(self.outputs_activation[i].T, dZ)
db = (1. / m) * np.sum(dZ, axis=0, keepdims=True)
self.dW.append(dW)
self.db.append(db)
self.dW.reverse()
self.db.reverse()
return self.dW, self.db
The classic stochastic gradient descent. The parameters are updated by subtracting the product of the learning rate and the gradient of the loss function with respect to each parameter.
def optimize(self, learning_rate):
for i in range(len(self.hidden_layers) + 1):
self.weights[i] -= learning_rate * self.dW[i]
self.biases[i] -= learning_rate * self.db[i]
return self.weights, self.biases
The Adam optimizer adapts the learning rates for each parameter based on the moving averages of past gradients and squared gradients. The bias correction ensures that the moving averages are unbiased. The hyperparameters are initialized when the network parameters are initialized.
# Adam hyperparameters
self.beta1 = beta1
self.beta2 = beta2
self.epsilon = epsilon
self.m_w = [np.zeros_like(w) for w in self.weights]
self.m_b = [np.zeros_like(b) for b in self.biases]
self.v_w = [np.zeros_like(w) for w in self.weights]
self.v_b = [np.zeros_like(b) for b in self.biases]
self.t = 0 # time step for Adam
The algorithm maintains moving averages of both the gradients m_w, m_b
and the squared gradients v_w, v_b
. These averages help smooth out the updates and improve convergence. At the beginning of training, the moving averages are biased toward zero, so bias correction is applied m_w_hat, m_b_hat, v_w_hat, v_b_hat
. The weights and biases are updated using the moving averages, scaled by a corrected learning rate np.sqrt(v_w_hat) + self.epsilon
, where epsilon prevents division by zero.
def optimize(self, learning_rate):
if learning_rate is None:
learning_rate = self.initial_lr # default learning rate
self.t += 1 # increment time step
for i in range(len(self.hidden_layers) + 1):
# moving averages of the gradients
self.m_w[i] = self.beta1 * self.m_w[i] + (1 - self.beta1) * self.dW[i]
self.m_b[i] = self.beta1 * self.m_b[i] + (1 - self.beta1) * self.db[i]
# moving averages of the squared gradients
self.v_w[i] = self.beta2 * self.v_w[i] + (1 - self.beta2) * np.square(self.dW[i])
self.v_b[i] = self.beta2 * self.v_b[i] + (1 - self.beta2) * np.square(self.db[i])
# bias correction
m_w_hat = self.m_w[i] / (1 - self.beta1**self.t)
m_b_hat = self.m_b[i] / (1 - self.beta1**self.t)
v_w_hat = self.v_w[i] / (1 - self.beta2**self.t)
v_b_hat = self.v_b[i] / (1 - self.beta2**self.t)
self.weights[i] -= learning_rate * m_w_hat / (np.sqrt(v_w_hat) + self.epsilon)
self.biases[i] -= learning_rate * m_b_hat / (np.sqrt(v_b_hat) + self.epsilon)
return self.weights, self.biases
The data is divided into mini-batches. This makes gradient updates more frequent, which often speeds up training. The training data is shuffled at the start of each epoch to prevent the model from learning patterns based solely on the order of the data. The input batch is passed through the network to obtain predictions. The gradients are computed via backpropagation, and the parameters are updated using the a optimizer.
def train(self, x_train, y_train, x_test, y_test, epochs, batch_size):
learning_rate = self.initial_lr
for epoch in range(epochs):
# shuffle data
indices = np.random.permutation(x_train.shape[0])
x_train = x_train[indices]
y_train = y_train[indices]
num_batches = x_train.shape[0] // batch_size
for i in range(0, x_train.shape[0], batch_size):
x = x_train[i:i+batch_size]
y = y_train[i:i+batch_size]
# forward pass
y_hat = self.feed_forward(x)
# back propagation and optimization
self.back_propagation(y, y_hat)
self.optimize(learning_rate)
return self.weights, self.biases
Learning rate = 0.1
Epoch 1/10 - 30.60s - Train Loss: 0.6609, Test Loss: 0.2441, Train Accuracy: 80.21%, Test Accuracy: 92.83%
Epoch 2/10 - 65.93s - Train Loss: 0.2001, Test Loss: 0.1495, Train Accuracy: 94.11%, Test Accuracy: 95.66%
Epoch 3/10 - 96.51s - Train Loss: 0.1304, Test Loss: 0.1261, Train Accuracy: 96.20%, Test Accuracy: 96.21%
Epoch 4/10 - 127.26s - Train Loss: 0.0955, Test Loss: 0.1118, Train Accuracy: 97.15%, Test Accuracy: 96.49%
Epoch 5/10 - 157.62s - Train Loss: 0.0739, Test Loss: 0.0773, Train Accuracy: 97.86%, Test Accuracy: 97.61%
Epoch 6/10 - 188.26s - Train Loss: 0.0594, Test Loss: 0.0706, Train Accuracy: 98.24%, Test Accuracy: 97.82%
Epoch 7/10 - 219.10s - Train Loss: 0.0481, Test Loss: 0.1097, Train Accuracy: 98.57%, Test Accuracy: 96.50%
Epoch 8/10 - 250.14s - Train Loss: 0.0392, Test Loss: 0.0635, Train Accuracy: 98.82%, Test Accuracy: 97.97%
Epoch 9/10 - 280.92s - Train Loss: 0.0318, Test Loss: 0.0603, Train Accuracy: 99.09%, Test Accuracy: 98.18%
Epoch 10/10 - 311.33s - Train Loss: 0.0256, Test Loss: 0.0621, Train Accuracy: 99.30%, Test Accuracy: 97.95%
Training Accuracy: 0.9942333333333333
Testing Accuracy: 0.9795
def exponential_decay(self, learning_rate, epoch, decay_rate=0.01):
return learning_rate * np.exp(-decay_rate * epoch)
- Initial learning rate = 0.1
- Decay rate = 0.01
Epoch 1/10 - 30.45s - Train Loss: 0.6614, Test Loss: 0.2482, Train Accuracy: 80.44%, Test Accuracy: 92.47%
Epoch 2/10 - 61.30s - Train Loss: 0.2027, Test Loss: 0.1516, Train Accuracy: 94.04%, Test Accuracy: 95.51%
Epoch 3/10 - 91.79s - Train Loss: 0.1323, Test Loss: 0.1161, Train Accuracy: 96.09%, Test Accuracy: 96.37%
Epoch 4/10 - 122.31s - Train Loss: 0.0971, Test Loss: 0.0918, Train Accuracy: 97.16%, Test Accuracy: 97.13%
Epoch 5/10 - 152.95s - Train Loss: 0.0747, Test Loss: 0.0843, Train Accuracy: 97.78%, Test Accuracy: 97.41%
Epoch 6/10 - 183.43s - Train Loss: 0.0593, Test Loss: 0.0762, Train Accuracy: 98.26%, Test Accuracy: 97.61%
Epoch 7/10 - 214.11s - Train Loss: 0.0481, Test Loss: 0.0670, Train Accuracy: 98.62%, Test Accuracy: 97.91%
Epoch 8/10 - 245.07s - Train Loss: 0.0396, Test Loss: 0.0796, Train Accuracy: 98.88%, Test Accuracy: 97.50%
Epoch 9/10 - 275.68s - Train Loss: 0.0322, Test Loss: 0.0682, Train Accuracy: 99.11%, Test Accuracy: 97.85%
Epoch 10/10 - 306.51s - Train Loss: 0.0262, Test Loss: 0.0658, Train Accuracy: 99.33%, Test Accuracy: 97.95%
Training Accuracy: 0.9948333333333333
Testing Accuracy: 0.9795
- Initial learning rate = 0.001
Epoch 1/10 - 53.63s - Train Loss: 0.2336, Test Loss: 0.1095, Train Accuracy: 92.92%, Test Accuracy: 96.60%
Epoch 2/10 - 107.06s - Train Loss: 0.0891, Test Loss: 0.0866, Train Accuracy: 97.28%, Test Accuracy: 97.32%
Epoch 3/10 - 162.39s - Train Loss: 0.0588, Test Loss: 0.0711, Train Accuracy: 98.08%, Test Accuracy: 97.83%
Epoch 4/10 - 215.41s - Train Loss: 0.0424, Test Loss: 0.0702, Train Accuracy: 98.63%, Test Accuracy: 97.82%
Epoch 5/10 - 268.94s - Train Loss: 0.0332, Test Loss: 0.0951, Train Accuracy: 98.94%, Test Accuracy: 97.47%
Epoch 6/10 - 323.32s - Train Loss: 0.0279, Test Loss: 0.0863, Train Accuracy: 99.07%, Test Accuracy: 97.56%
Epoch 7/10 - 376.90s - Train Loss: 0.0225, Test Loss: 0.0884, Train Accuracy: 99.27%, Test Accuracy: 97.84%
Epoch 8/10 - 445.38s - Train Loss: 0.0209, Test Loss: 0.0922, Train Accuracy: 99.37%, Test Accuracy: 97.93%
Epoch 9/10 - 515.89s - Train Loss: 0.0180, Test Loss: 0.0719, Train Accuracy: 99.40%, Test Accuracy: 98.33%
Epoch 10/10 - 589.08s - Train Loss: 0.0162, Test Loss: 0.0837, Train Accuracy: 99.48%, Test Accuracy: 98.17%
Training Accuracy: 0.9968833333333333
Testing Accuracy: 0.9817
- Initial learning rate = 0.001
- Decay rate = 0.01
Epoch 1/10 - 66.67s - Train Loss: 0.2351, Test Loss: 0.1225, Train Accuracy: 92.92%, Test Accuracy: 96.06%
Epoch 2/10 - 131.17s - Train Loss: 0.0877, Test Loss: 0.0802, Train Accuracy: 97.31%, Test Accuracy: 97.36%
Epoch 3/10 - 190.21s - Train Loss: 0.0579, Test Loss: 0.0688, Train Accuracy: 98.17%, Test Accuracy: 97.92%
Epoch 4/10 - 251.77s - Train Loss: 0.0409, Test Loss: 0.0770, Train Accuracy: 98.69%, Test Accuracy: 97.83%
Epoch 5/10 - 311.24s - Train Loss: 0.0312, Test Loss: 0.0801, Train Accuracy: 98.99%, Test Accuracy: 97.94%
Epoch 6/10 - 374.08s - Train Loss: 0.0254, Test Loss: 0.0776, Train Accuracy: 99.14%, Test Accuracy: 98.05%
Epoch 7/10 - 434.38s - Train Loss: 0.0181, Test Loss: 0.0999, Train Accuracy: 99.42%, Test Accuracy: 97.77%
Epoch 8/10 - 514.73s - Train Loss: 0.0143, Test Loss: 0.0877, Train Accuracy: 99.53%, Test Accuracy: 98.10%
Epoch 9/10 - 592.68s - Train Loss: 0.0133, Test Loss: 0.0866, Train Accuracy: 99.60%, Test Accuracy: 98.14%
Epoch 10/10 - 674.66s - Train Loss: 0.0097, Test Loss: 0.0790, Train Accuracy: 99.69%, Test Accuracy: 98.21%
Training Accuracy: 0.99605
Testing Accuracy: 0.9821
Training is stopped when there is no improvement over a number of epochs to prevent overfitting or wasting compute time.
if test_loss < best_test_loss:
best_test_loss = test_loss
best_weights = self.weights
best_biases = self.biases
epochs_without_improvement = 0 # reset counter
else:
epochs_without_improvement += 1
if epochs_without_improvement >= patience:
self.weights = best_weights
self.biases = best_biases
print(f"Early stopping at epoch {epoch + 1} due to no improvement in validation loss.")
break
- Learning rate = 0.001
- Decay rate = 0.01
- Patience = 5 epochs
Epoch 1/10 - 57.06s - Train Loss: 0.2188, Test Loss: 0.0938, Train Accuracy: 93.38%, Test Accuracy: 97.23%
Epoch 2/10 - 113.69s - Train Loss: 0.0886, Test Loss: 0.0735, Train Accuracy: 97.25%, Test Accuracy: 97.68%
Epoch 3/10 - 167.89s - Train Loss: 0.0564, Test Loss: 0.0785, Train Accuracy: 98.25%, Test Accuracy: 97.59%
Epoch 4/10 - 224.47s - Train Loss: 0.0428, Test Loss: 0.0744, Train Accuracy: 98.63%, Test Accuracy: 97.83%
Epoch 5/10 - 281.49s - Train Loss: 0.0330, Test Loss: 0.0754, Train Accuracy: 98.95%, Test Accuracy: 97.96%
Epoch 6/10 - 338.37s - Train Loss: 0.0260, Test Loss: 0.0966, Train Accuracy: 99.14%, Test Accuracy: 97.56%
Early stopping at epoch 7 due to no improvement in validation loss.
Training Accuracy: 0.99545
Testing Accuracy: 0.9825
The labels on the X axis are Epoch numbers offset by 1. The best test loss observed was at epoch 2 (0.0735) and the training stopped in 5 epochs as there were no improvements. By stopping early, the neural network was able to reach its minima much faster.
Testin the neural network over the CIFAR10 dataset shows the need for further improvement. Implementing momentum with the SGD optimizer to improve convergence speed and implementing regularization to control overfitting would be the next steps to improving this neural network.
I have built and experimented with deep neural networks from scratch using Python and Numpy, testing various optimization techniques and training strategies. The results showed that the Adam optimizer, with its adaptive learning rate, outperformed basic Stochastic Gradient Descent (SGD) from the first epoch. It led to faster convergence and better accuracy overall.
Although the SGD optimizer improved with learning rate decay, the gains were relatively small compared to Adam’s performance. This suggests that Adam’s ability to adjust the learning rate for each parameter dynamically makes manual decay less critical.
I also found early stopping to be crucial in preventing overfitting and reducing unnecessary computation. By halting training once performance plateaued, I was able to reach near-optimal accuracy in fewer epochs, saving both time and resources.
These findings highlight the efficiency of the Adam optimizer, especially in terms of rapid convergence and minimizing the need for extensive hyperparameter tuning.
- Deep Learning - 3Blue1Brown
- Neural Networks and Deep Learning - Michael Nielsen
- Building a Neural Network from Scratch (Parts 1 and 2) - Jonathan Weisberg
- MNIST Dataset
- Fashion MNIST Dataset
- CIFAR10 Dataset