Skip to content
/ mynn Public

A from-scratch implementation of neural networks depending only on numpy

Notifications You must be signed in to change notification settings

macoun/mynn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is a from-scratch implementation of neural networks depending only on numpy. It is for educational purposes only.

Concepts

Neuron

Linear equation

y = w1*x1 + ... + wn*xn + b

Input is a vector

X = [x1 ... xn]

and the output is a scalar

y

Layer

List of linear equations in the same dimension n.

y1 = w11*x1 + w12*x2 + ... + w1n*xn + b1
...
ym = wm1*x1 + wm2*x2 + ... + wmn*xn + bm

Therefore the output of m linear equations for one sample, i.e. same input [x1 ... x2], is

y = [y1 ... ym]

We want to calculate results for multiple samples at once. In other words, we want to feed a list of samples and receive a list of results from a layer.

Input for k samples

X = [[x11 ... x1n] ... [xk1 ... xkn]]

Output for k samples is a list of results for each m equation

y = [[y11 ... y1m] ... [yk1 ... ykm]]

A layer stores its weights and biases

w = [[w11 .. wm1] [w12 .. wm2] ... [w1n .. wmn]]
b = [b1 ... bm]

Shapes of input X, weights w, biases b, and output y is (k, n), (n, m), (k, ), (k, m)

The calculation of the puts is then done with a simple dot product of inputs and weights

y = X*w + b

We get a dimension of (k, n) * (n, m) + (k,) -> (k, m)

Functional representation of a layer

layer(X) = X*w + b

Activation Functions

An activation function a is applied to the output of each neuron in a layer

a(y)

We will use the same activation function for all neurons in a layer.

Input is the output of a layer of shape (k, m), where k is the number of samples and m the number of neurons.

Output is usually the same shape (k, m), which will be used as input for the next layer or for the loss function.

Rectified Linear Activation Function (ReLU)

The ReLU activation function is extremely close to being a linear activation function while remaining nonlinear, due to that bend after 0.

a(y) = max(0, y)

Feed forward

Putting neurons, layers, activation functions together we can get an output (ŷ) from our network

ŷ = a(layer(X))

Loss Functions

Loss function (aka cost function) is the algorithm that quantifies how wrong a network is.

Loss is the measure of this metric.

The input for a loss function C is the output of the activation function of the last layer (ŷ).

C(ŷ, y) = C(a(layer(X)), y)

The output of a lost function is the loss for each sample.

L = [l1 ... lk]

The shape of the output is (k,)

C(ŷ, y) - > L

The overall loss of a network is the mean of the elements in L plus some regulation errors L1 and L2 (if applied).

Mean Squared Error Loss

You square the difference between the predicted (ŷ) and true values (y) (as the model can have multiple regression outputs) and average those squared values.

C(ŷ, y) = sum((y - ŷ)**2)/k

where k is the number of samples

Optimization

The only values we can change to get a better loss (optimally zero) in a network is to change the weights and biases in the layers.

Gradient Descent

We need the gradient of the loss function C with respect to the weights to get the direction and magnitude to move the weights towards.

w = w - r*∇(dC/dw)

Where r is a value between (0,1) and is called the learning rate.

Chain Rule

Recall, to calculate loss (l stands for layer)

C(ŷ, y) = C(a(l(X)), y)

According to chain rule, the gradient of C with respect to w can be written as

dC/dw = (dl/dw)*(da/dl)*(dC/da)

Partial Derivatives

The derivative of Mean Squared Error function is (shape (k, m))

dC_da = -2 * (y - ŷ) / k

The derivative of ReLU activation function is (shape (k, m))

da_dl = dC_da
da_dl[layer(X) <= 0] = 0

The derivative of the layer is

dl_dw = X.T * da_dl

Shape of dl_dw is (n, k) * (k, m) -> (n, m)

This means update m weights for each input in n.

To update the biases we will aggregate all samples and get a list of sum of weights for each input.

dl_db = sum(da_dl) through axis=0

Shape of dl_db is (n, m).

Vanilla SGD (Stochastic Gradient Descent) Optimizer

We apply the partial derivatives using gradient descent and the chain rule.

w = w - r*dl_dw
b = b - r*dl_db

Implementation

Metrics

Feed Forward

Feed Backward

Model

Predictiion

Evaluation

About

A from-scratch implementation of neural networks depending only on numpy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages