Hello, my name is Andre and I've been training deep neural networks for a bit more than a decade
and in this lecture I'd like to show you what neural network training looks like under the hood.
So in particular we are going to start with a blank Jupyter notebook and by the end of this
lecture we will define and train a neural net and you'll get to see everything that goes on
under the hood and exactly sort of how that works on an intuitive level. Now specifically what I
would like to do is I would like to take you through building of micrograd. Now micrograd
is this library that I released on github about two years ago but at the time I only uploaded the
source code and you'd have to go in by yourself and really figure out how it works. So in this
lecture I will take you through it step by step and kind of comment on all the pieces of it.
So what is micrograd and why is it interesting? Thank you. Micrograd is basically an autograd
engine. Autograd is short for automatic gradient and really what it does is it implements back
propagation. Now back propagation is this algorithm
that you can use to create a neural network and you can use it to create a neural network
and you can use it to create a neural network and you can use it to create a neural network.
That allows you to efficiently evaluate the gradient of some kind of a loss function with
respect to the weights of a neural network and what that allows us to do then is we can
iteratively tune the weights of that neural network to minimize the loss function and
therefore improve the accuracy of the network. So back propagation would be at the mathematical
core of any modern deep neural network library like say PyTorch or JAX. So the functionality
of micrograd is I think best illustrated by an example. So if we just scroll down here
you'll see that micrograd basically allows you to build out mathematical expressions
and here what we are doing is we have an expression that we're building out where you have two
inputs a and b and you'll see that a and b are negative four and two but we are wrapping those
values into this value object that we are going to build out as part of micrograd. So this value
object will wrap the numbers themselves and then we are going to build out a mathematical expression
here where a and b are the values that we are going to build out as part of micrograd.
are transformed into C, D, and eventually E, F, and G.
And I'm showing some of the functionality of micrograph
and the operations that it supports.
So you can add two value objects.
You can multiply them.
You can raise them to a constant power.
You can offset by one, negate, squash at zero,
square, divide by constant, divide by it, et cetera.
And so we're building out an expression graph
with these two inputs, A and B,
and we're creating an output value of G.
And micrograph will, in the background,
build out this entire mathematical expression.
So it will, for example, know that C is also a value.
C was a result of an addition operation.
And the child nodes of C are A and B
because it will maintain pointers to A and B value objects.
So we'll basically know exactly how all of this is laid out.
And then not only can we do what we call the forward pass,
where we actually,
if we look at the value of G, of course,
that's pretty straightforward,
we will access that using the dot data attribute.
And so the output of the forward pass,
the value of G, is 24.7, it turns out.
But the big deal is that we can also take this G value object
and we can call dot backward.
And this will basically initialize backpropagation at the node G.
And what backpropagation is going to do
is it's going to start at G
and it's going to go backwards through that expression graph
and it's going to recurve.
So we're going to recursively apply the chain rule from Calculus.
And what that allows us to do then
is we're going to evaluate basically the derivative of G
with respect to all the internal nodes like E, D, and C,
but also with respect to the inputs A and B.
And then we can actually query this derivative of G
with respect to A, for example, that's A dot grad.
In this case, it happens to be 138.
And the derivative of G with respect to B,
which also happens to be here, 645.
And this derivative, we'll see soon,
is very important information
because it's telling us how A and B are affecting G
through this mathematical expression.
So in particular, A dot grad is 138.
So if we slightly nudge A and make it slightly larger,
138 is telling us that G will grow
and the slope of that growth is going to be 138.
And the slope of growth of B is going to be 645.
So that's going to tell us about how G will respond,
if A and B get tweaked a tiny amount
in a positive direction, okay?
Now, you might be confused about what this expression is
that we built out here.
And this expression, by the way, is completely meaningless.
I just made it up.
I'm just flexing about the kinds of operations
that are supported by micrograd.
What we actually really care about are neural networks.
But it turns out that neural networks
are just mathematical expressions, just like this one,
but actually a slightly bit less crazy even.
Neural networks are just a mathematical expression,
they take the input data as an input,
and they take the weights of a neural network as an input,
and it's a mathematical expression,
and the output are your predictions of your neural net
or the loss function.
We'll see this in a bit.
But basically, neural networks just happen to be
a certain class of mathematical expressions.
But backpropagation is actually significantly more general.
It doesn't actually care about neural networks at all.
It only cares about arbitrary mathematical expressions.
And then we happen to use that machinery
for training of neural networks.
Now, one more.
Another note I would like to make at this stage
is that, as you see here,
micrograd is a scalar-valued autograd engine.
So it's working on the level of individual scalars,
like negative four and two.
And we're taking neural nets
and we're breaking them down
all the way to these atoms of individual scalars
and all the little pluses and times,
and it's just excessive.
And so, obviously,
you would never be doing any of this in production.
It's really just done for pedagogical reasons
because it allows us to not have to deal
with these n-dimensional tensors
that you would use in a modern deep neural network library.
So this is really done so that you understand
and refactor out the background application
and chain rule and understanding of neural training.
And then, if you actually want to train bigger networks,
you have to be using these tensors,
but none of the math changes.
This is done purely for efficiency.
We are basically taking all the scalar values,
we're packaging them up into tensors,
which are just arrays of these scalars.
And then, because we have these large arrays,
we're making operations on those large arrays
that allows us to take advantage of the parallelism
in a computer.
And all those operations can be done in parallel,
and then the whole thing runs faster.
But really, none of the math changes,
and they're done purely for efficiency.
So I don't think that it's pedagogically useful
to be dealing with tensors from scratch.
And that's why I fundamentally wrote micrograd,
because you can understand how things work
at the fundamental level,
and then you can speed it up later.
Okay, so here's the fun part.
My claim is that micrograd is what you need
to train neural networks,
and everything else is just efficiency.
So you'd think that micrograd would be
a very complex piece of code.
And that turns out to not be the case.
So if we just go to micrograd,
and you'll see that there's only two files here in micrograd.
This is the actual engine.
It doesn't know anything about neural nets.
And this is the entire neural nets library
on top of micrograd.
So engine and nn.py.
So the actual backpropagation autograd engine
that gives you the power of neural networks
is literally
100 lines of code.
100 lines of code.
Of, like, very simple Python,
which we'll understand by the end of this lecture.
And then nn.py,
this neural network library
built on top of the autograd engine,
is like a joke.
It's like, we have to define what is a neuron,
and then we have to define what is a layer of neurons,
and then we define what is a multilayer perceptron,
which is just a sequence of layers of neurons.
And so it's just a total joke.
So basically,
there's a lot of power
that comes from only 150 lines of code.
And then we have to define what is a multilayer perceptron,
which is 150 lines of code.
And that's all you need to understand
to understand neural network training.
And everything else is just efficiency.
And of course, there's a lot to efficiency.
But fundamentally, that's all that's happening.
Okay, so now let's dive right in
and implement micrograd step by step.
The first thing I'd like to do is I'd like to make sure
that you have a very good understanding, intuitively,
of what a derivative is
and exactly what information it gives you.
So let's start with some basic imports
that I copy-paste in every Jupyter Notebook, always.
And let's define a derivative.
So let's define a function,
a scalar-valued function,
f of x, as follows.
So I just made this up randomly.
I just wanted a scalar-valued function
that takes a single scalar x
and returns a single scalar y.
And we can call this function, of course,
so we can pass in, say, 3.0
and get 20 back.
Now, we can also plot this function
to get a sense of its shape.
You can tell from the mathematical expression
that this is probably a parabola.
It's a quadratic.
It's a scalar-value that we can feed in
using, for example, a range
from negative 5 to 5
in steps of 0.25.
So x is just
from negative 5 to 5
not including 5
in steps of 0.25.
And we can actually call this function
on this numpy array as well.
So we get a set of y's
if we call f on x.
And these y's are basically
also applying the function
on every one of these elements independently.
Let's talk about this using Mathplotlib.
So plt.plot, x's and y's
and we get a nice parabola.
So previously here we fed in 3.0
somewhere here, and we received
20 back, which is here
the y coordinate.
So now I'd like to think through
what is the derivative of this function
at any single input point x?
So what is the derivative at different points x
of this function?
Now if you remember back to your calculus class
you've probably derived derivatives.
So we take this mathematical expression
for x plus 5, and you would write it out
on a piece of paper and you would
apply the product rule and all the other rules
and derive the mathematical expression
of the great derivative of the original function.
And then you could plug in different x's
and see what the derivative is.
We're not going to actually do that
because no one in neural networks
actually writes out the expression for the neural net.
It would be a massive expression.
It would be thousands, tens of thousands of terms.
No one actually derives the derivative
of course.
And so we're not going to take this kind of symbolic approach
instead what I'd like to do is I'd like to look at the
definition of derivative and just make sure
that we really understand what the derivative is measuring
what it's telling you about the function.
And so if we just look up
derivative
we see that
this is not a very good definition of derivative
this is a definition of what it means to be differentiable
but if you remember from your calculus
it is the limit as h goes to 0
of f of x plus h minus f of x
over h.
And basically what it's saying is
if you slightly bump up
at some point x that you're interested in
or a, and if you slightly bump up
you slightly increase it by
a small number h
how does the function respond?
With what sensitivity does it respond?
What is the slope at that point?
Does the function go up or does it go down?
And by how much?
And that's the slope of that function
the slope of that response at that point.
And so we can basically evaluate
the derivative here numerically
by taking a very small h
of course the definition would ask us to take h to 0
we're just going to pick a very small h
0.001
and let's say we're interested in point 3.0
so we can look at f of x of course as 20
and now f of x plus h
so if we slightly nudge
x in a positive direction
how is the function going to respond?
And just looking at this
do you expect f of x plus h to be slightly greater
than 20?
Or do you expect it to be slightly lower than 20?
And so since 3 is here
and this is 20
if we slightly go positively
the function will respond positively
so you'd expect this to be slightly greater than 20
and now by how much
is telling you the
strength of that slope
the size of that slope
so f of x plus h minus f of x
this is how much the function responded
in a positive direction
and we have to normalize by the run
so we have the rise over run
to get the slope
so this of course is just a numerical approximation
of the slope
because we have to make h very very small
to converge to the exact amount
now if I'm doing too many zeros
at some point
I'm going to get an incorrect answer
because we're using floating point arithmetic
and the representations of all these numbers
in computer memory is finite
and at some point we get into trouble
so we can converge towards the right answer
with this approach
but basically
at 3 the slope is 14
and you can see that by taking
x squared minus 4x plus 5
and differentiating it in our head
so 3x squared would be
6x minus 4
and then we plug in x equals 3
so that's 18 minus 4 is 14
so this is correct
so that's at 3
now how about
the slope at say negative 3
would you expect
what would you expect for the slope
now telling the exact value is really hard
but what is the sign of that slope
so at negative 3
if we slightly go in the positive direction
at x
the function would actually go down
and so that tells you that the slope would be negative
so we'll get a slight number below
below 20
and so if we take the slope
we expect something negative
negative 22
and at some point here of course
the slope would be 0
now for this specific function
I looked it up previously
and it's at point 2 over 3
so at roughly 2 over 3
this derivative would be 0
so basically
at that precise point
at that precise point
if we nudge in a positive direction
the function doesn't respond
this stays the same almost
and so that's why the slope is 0
ok now let's look at a bit more complex case
so we're going to start complexifying a bit
so now we have a function
here
with output variable b
that is a function of 3 scalar inputs
so a, b and c are some specific values
3 inputs into our expression graph
and a single output d
and so if we just print d
we get 4
and now what I'd like to do is
I'd like to again look at the derivatives of d
with respect to a, b and c
and think through
again just the intuition of what this derivative
is telling us
so in order to evaluate this derivative
we're going to get a bit hacky here
we're going to again have a very small
value of h and then we're going to
fix the inputs at some
values that we're interested in
so these are the
this is the point a, b, c at which we're going to be evaluating
the derivative of d
with respect to all a, b and c
at that point
so there's the inputs and now we have d1
is that expression
and then we're going to for example look at the derivative of d
with respect to a
so we'll take a and we'll bump it by h
and then we'll get d2 to be the exact same
function
and now we're going to print
you know
d1 is d1
d2 is d2
and print slope
so the derivative
or slope here
will be of course
d2
minus d1 divided by h
so d2 minus d1 is how
much the function increased
when we bumped
the specific
input that we're interested in
by a tiny amount
and this is then normalized by
h to get the slope
so
yeah
so this
so I just run this
we're going to print d1
which we know is
4
now d2 will be bumped
a will be bumped by h
so let's just think through
a little bit
what d2 will be
printed out here in particular
d1 will be 4
will d2 be
a number slightly greater than 4
or slightly lower than 4
and that's going to tell us the
sign of the derivative
so
we're bumping a by h
b is minus 3
c is 10
so you can just intuitively think through
this derivative and what it's doing
a will be slightly more positive
and but b is a negative
number so if a is
slightly more positive
because b is negative 3
we're actually going to be
adding less to
d
so you'd actually expect that
the value of the function will go
down so let's
just see this
yeah and so we went from 4
to 3.996
and that tells you that the slope will
be negative and then
will be a negative number
because we went down and
then the exact number of slope
will be the exact amount of slope
is negative 3 and you can
also convince yourself that negative 3 is the right
answer mathematically and analytically
because if you have a times b plus
c and you are you know you have
calculus then
differentiating a times b plus c with
respect to a gives you just b
and indeed the value of b
is negative 3 which is the derivative that we have
so you can tell that that's correct
so now if we do this
with b so if we
bump b by a little bit in a positive
direction we'd get different
slopes so what is the influence of b
on the output d
so if we bump b by a tiny amount
in a positive direction then because a
is positive we'll be
adding more to d right
so and now
what is the sensitivity what is the
slope of that addition and
it might not surprise you that this should be
2
and why is it 2 because
d of d by db
differentiating with respect to b
would be would give us a and
the value of a is 2 so that's also
working well and then if c
gets bumped a tiny amount in h
by h then
of course a times b is unaffected and
now c becomes slightly bit higher
what does that do to the function it
makes it slightly bit higher because we're simply adding
c and it makes it slightly bit
higher by the exact same amount that we
added to c and so that tells you
that the slope is 1
that will be the
the rate at which
d will increase
as we scale
c okay so we now have some
intuitive sense of what this derivative is telling you
about the function and we'd like to move to
neural networks now as i mentioned neural networks
will be pretty massive expressions mathematical
expressions so we need some data structures
that maintain these expressions and that's what
we're going to start to build out now
so we're going to
build out this value object that i
showed you in the readme page
of micrograd so let me
copy paste a skeleton
of the first very simple value
object so class
value takes a single
scalar value that it wraps and
keeps track of and that's
it so we can for example
do value of 2.0 and then we can
get
we can look at its content and
python will internally
use the wrapper function
to return
this string
like that
so this is a value object with
data equals two that we're creating
here now what we'd like to do is
like we'd like to be able to
have not just like
two values but
we'd like to do a plus b right we'd like
to add them so currently
you would get an error because python
doesn't know how to add two value
objects so we have to tell it
so here's
addition
so
you have to basically use these special
double underscore methods in python to
define these operators for these
objects so if we call
the
if we use this plus
operator python will internally
call a dot
add of b that's
what will happen internally and so
b will be the other
and self will be
a and so we see that what we're going
to return is a new value object and
it's just it's going to be wrapping
the plus of
their data but remember
now because data is the actual
like numbered python number so
this operator here is just the
typical floating point plus
addition now it's not an addition of value
objects and we'll return
a new value so now a
plus b should work and it should print value
of negative one
because that's two plus minus three
there we go okay let's
now implement multiply
just so we can recreate this expression here
so multiply i think it won't
surprise you will be fairly similar
so instead
of add we're going to be using mul
and then here of course we want to do times
and so now we can create a
c value object which will be 10.0
and now we should be able to do
a times b
well let's just do a times b first
um
that's value of negative six now
and by the way i skipped over this a little
bit suppose that i didn't have the wrapper
function here then
it's just that you'll get some kind of an ugly expression
so what wrapper is doing
is it's providing us a way to
print out like a nicer looking expression in
python so we
don't just have something cryptic we
actually are you know it's value of
negative six
so this gives us a times
and then this we should now be able
to add c to it because we've defined and
told the python how to do mul and add
and so this will call
this will basically be equivalent to a dot
mul
of b and then
this new value object will be dot
add of c
and so let's see if that worked
yep so that worked well that gave
us four which is what we expect from before
and i
believe we can just call them manually as well
there we go so
yeah okay so now what we are
missing is the connected tissue of this
expression as i mentioned we want to keep
these expression graphs so we need to
know and keep pointers about
what values produce what other values
produce so here for example we are
going to introduce a new variable which
we'll call children and by default it
will be an empty tuple and then we're
actually going to keep a slightly
different variable in the class which
we'll call underscore prev which will be
the set of children
this is how i done i did it in the
original micrograd looking at my code
here i can't remember exactly the reason
i believe it was efficiency but this
underscore children will be a tuple for
convenience but then when we actually
maintain it in the class it will be just
efficiency
so now when
we are creating a value like this with a
constructor children will be empty and
prev will be the empty set but when we
are creating a value through addition or
multiplication we're going to feed in
the children of this
value which in this case is self
another
so those are the children
here
so now we can do d dot
prev and we'll see that
the children of the we know
now know are this a value of
negative six and value of ten
and this of course is the value resulting
from a times b and the
c value which is ten
now the last piece of information
we don't know so we know now the
children of every single value but we don't know
what operation created this value
so we need one more element
here let's call it underscore pop
and by default this
is the empty set for leaves
and then we'll just maintain it here
and now the
operation will be just a simple string
and in the case of addition it's
plus in the case of multiplication
it's times so
now we not just have d dot
prev we also have a d dot op
and we know that d was produced by
an addition of those two values
and so now we have the full
mathematical expression and we're
building out this data structure and we know exactly
how each value came to be
by what expression and from what other values
now because these expressions are about
to get quite a bit larger we'd like a
way to nicely visualize
these expressions that we're building out
so for that i'm going to copy paste a bunch of
slightly scary code that's
going to visualize this these
expression graphs for us so here's the
code and i'll explain it in a bit
but first let me just show you what this code does
basically what it does is it creates
a new function draw dot
that we can call on some root node
and then it's going to visualize it
so if we call draw dot on d
which is this final value here
that is a times b plus c
it creates
something like this so this is d
and you see that this is a times b
creating an interpret value
plus c gives us this output
node d
so that's draw dot of d
and i'm not going to go through this
in complete detail you can take a look at
graphvis and its api
graphvis is an open source graph visualization
software and what we're doing here
is we're building out this graph in graphvis
api and
you can basically see that
trace is this helper function that
enumerates all the nodes and edges in the graph
so that just builds a set of all
the nodes and edges and then we iterate through
all the nodes and we create special node
objects for them in
using dot
node and then we also
create edges using dot dot edge
and the only thing that's like slightly
tricky here is you'll notice that i
basically add these fake nodes
which are these operation nodes
so for example this node here is just
like a plus node and
i create these
special
op nodes here
and i connect them accordingly
so these nodes of course
are not actual nodes
in the original graph they're not
actually a value object the only
value objects here are the things
in squares those are actual value
objects or representations thereof
and these op nodes are just created in
this draw dot routine so that
it looks nice let's also
add labels to these graphs just so we
know what variables are where
so let's create a special underscore
label
or let's just do label equals
empty by default and save it
in each node
and then here
we're going to do label is a
label is b
label is c
and then
let's create a special
um e equals
a times b
and e dot label will
be e
it's kind of naughty and e
will be e plus c
and a d dot label will be
b
okay so nothing really changes i just
added this new e function
a new e variable
and then here when we are
printing this i'm going
to print the label here
so this will be a percent s
bar and this will be n dot
label
and so now
we have the label
on the left here so it says a b
creating e and then e plus c creates
d just like we have it
here and finally let's make this
expression just one layer deeper
so d will not be the final output
node instead
after d we are going to create a
new value object called
f we're going to start running out of
variables soon f will be negative two
point zero and its label
will of course just be f
and then l
capital l will be the output
of our graph and l will be
d times f
okay so l will be negative eight
is the output
uh so
now we don't just draw a
d we draw l
okay
and somehow the label of
l is undefined oops
the label has to be explicitly
given to it
there we go so l is the output
so let's quickly recap what we've done so far
we are able to build out mathematical
expressions using only plus and times
so far they are scalar
valued along the way and we can
do this forward pass
and build out a mathematical expression
so we have multiple inputs here
a b c and f going into
a mathematical expression that produces
a single output l
and this here is visualizing the
forward pass so the output of the
forward pass is negative eight
that's the value now
what we'd like to do next is we'd like to run
back propagation and in back
propagation we are going to start here at the end
and we're going to reverse
and calculate the gradient
along all these intermediate
values and really what we're
computing for every single value here
um we're going to compute
the derivative of that node
with respect to
l so
the derivative of l with respect to l
is just one
and then we're going to derive what is the
derivative of l with respect to f with
respect to d with respect to c
with respect to e with respect
to b and with respect to a
and in a neural network setting you'd
be very interested in the derivative of basically
this loss function l
with respect to the weights of
a neural network and here of course
we have just these variables a b c and f
but some of these will eventually represent
the weights of a neural net and so
we'll need to know how those weights are impacting
the loss function
so we'll be interested basically in the derivative of
the output with respect to some of its
leaf nodes and those leaf nodes will
be the weights of the neural net
and the other leaf nodes of course will be the data
itself but usually we will not want
or use the derivative of the
loss function with respect to data because
the data is fixed but the weights
will be iterated on
using the gradient information
so next we are going to create a variable inside
the value class that maintains
the derivative of
l with respect to that value
and we will call this variable
grad so there
is a dot data and there is a self.grad
and initially
it will be zero and remember that
zero is basically means no
effect so at initialization
we are assuming that every value does not
impact does not affect the
output right because
if the gradient is zero that means that changing
this variable is not changing the
loss function so by
default we assume that the gradient is zero
and then
now that we have grad
and it's zero point zero
we are going to be able to visualize
it here after data so here
grad is point four f
and this will be end of grad
and now
we are going to be showing both the data
and the grad
initialized at zero
and we are
just about getting ready to calculate the
back propagation and of course this
grad again as i mentioned is representing
the derivative of the output in
this case l with respect to this
value so with respect to
so this is the derivative of l with respect to
f with respect to d and so on
so let's now fill in those gradients
and actually do back propagation manually
so let's start filling in these gradients and
start all the way at the end as i mentioned here
first we are interested to fill in this
gradient here so
what is the derivative of l with respect to
l in other words if i change
l by a tiny amount h
how much does
l change
it changes by h so
it's proportional and therefore the derivative will be
one we can of course
measure these or estimate these numerical
gradients numerically just like
we've seen before so if i take this
expression and i create a
def lol function here
and put this here
now the reason i'm creating a gating function
lol here is because i don't want
to pollute or mess up the global scope
here this is just kind of like a little staging
area and as you know in python all of these
will be local variables to this function
so i'm not changing any of the
global scope here so here
l1 will be l
and then copy pasting this expression
we're going to add a small
amount h
in
for example a
right and this would be measuring
the derivative of l with respect
to a so here
this will be l2
and then we want to print test derivatives
so print l2 minus
l1 which is how much l
changed and then normalize it
by h so this is the rise
over run and we have to be
careful because l is a valid node
so we actually want its data
so that these are floats dividing
by h and this should print
the derivative of l with respect to a
because a is the one that we bumped a
little bit by h so what is
the derivative of l with respect
to a it's six
okay and obviously
if we change
l by h
then that would be
here
effectively
this looks really awkward but
changing l by h
you see the derivative here is one
that's kind of like the base case
of what we are doing here
so basically we can come up here
and we can manually set
l.grad to one this is our
manual backpropagation
l.grad is one and let's redraw
and we'll see
that we filled in grad is one
for l we're now going to continue
the backpropagation so let's here look at
the derivatives of l with respect to
d and f let's do
d first so what
we are interested in if i create a markdown on
here is we'd like to know
basically we have that l is d times f
and we'd like to know what is
d l by
d d
what is that and if you know
your calculus l is d times f
so what is d l by d d
it would be f
and if you don't believe me we can also
just derive it because the proof would be
fairly straightforward we go
to the definition
of the derivative which is
f of x plus h minus f of x
divide h
as a limit of h goes to zero
of this kind of expression so
when we have l is d times f
then increasing
d by h would give us
the output of d plus h times
f that's
basically f of x plus h right
minus d times
f
and then divide h and
symbolically expanding out here we
would have basically d times f
plus h times f minus
d times f divide h
and then you see how the df minus
df cancels so you're left with h times
f divide h
which is f so
in the limit as h goes to zero
of you know
derivative
definition we just
get f in the case of
d times f
so symmetrically
d l by d f
will just be d
so what we have is that
f dot grad we see now
is just the value of d
which is four
and we see that
d dot grad is just
the value of f
and so the value of f
is negative two
so we'll set those
manually
let me erase this markdown
node and then let's redraw what we
have
okay and let's
just make sure that these were correct
so we seem to think that
d l by d d is negative two so let's
double check
let me erase this plus h from before
and now we want the derivative with respect to f
so let's just come here
when i create f and let's do a plus h here
and this should print a derivative of
l with respect to f so we expect
to see four
yeah and this is four up to
floating point funkiness
and then d l
by d d should be
f which is negative two
grad is negative two
so if we again
come here and we change d
d dot
data plus equals h right
here so we expect
so we've added a little h and then we see
how l changed and we
expect to print
negative two
there we go
so we've numerically
verified what we're doing here is
kind of like an inline gradient check
gradient check is when we
are deriving this like back propagation
and getting the derivative with respect to all the
intermediate results and
then numerical gradient is just you know
estimating it using
small step size
now we're getting to the crux of
back propagation so this will be the
most important node to understand
because if you understand the gradient for
this node you understand all of back
propagation and all training of neural nets
basically so we need
to derive d l by
d c in other words the derivative
of l with respect to c
because we've computed all these other
gradients already now we're coming
here and we're continuing the back propagation
manually so we want
d l by d c and then we'll also
derive d l by d e
now here's the problem
how do we derive d l by
d c
we actually know the derivative l
with respect to d so we know how
l is sensitive to d
but how is l sensitive to
c so if we wiggle c how does
that impact l through d
so we know d l by d c
and we
also here know how c impacts d
and so just very intuitively if you
know the impact that c is having
on d and the impact that d is having
on l then you should be able to
somehow put that information together to
figure out how c impacts l
and indeed this is what we can actually
do so in particular we
know just concentrating on d first
let's look at how what is the derivative
basically of d with respect to c
so in other words what is d d by d
c
so here
we know that d is c times
c plus e that's what we
know and now we're interested in d d
by d c if you
just know your calculus again and you remember
then differentiating c plus e with
respect to c you know that that gives you
1.0 and
we can also go back to the basics and derive
this because again we can go to our
f of x plus h minus f of x
divide by h
that's the definition of a derivative
as h goes to zero and
so here focusing on c
and its effect on d
we can basically do the f of x plus h
will be c is
incremented by h plus c
that's the first evaluation of our
function minus
c plus e
and then divide h
and so what is this
just expanding this out this will be c plus
h plus e minus c minus
e divide h
and then you see here how c minus c
cancels e minus e cancels
we're left with h over h which is 1.0
and so
by symmetry also
d d by d
e will be
1.0 as well
so basically the derivative of
a sum expression is very simple
and this is the local derivative
so i call this the local derivative because
we have the final output value all the
way at the end of this graph and we're now
like a small node here and
this is a little plus node and
the little plus node doesn't know
anything about the rest of the graph
that it's embedded in all it knows
is that it did a plus it took a c
and an e added them and created
d and this plus node
also knows the local influence of
c on d or rather
the derivative of d with respect to c
and it also knows the derivative of d
with respect to e but
that's not what we want that's just a local derivative
what we actually want is
dl by dc and
l could l is here just one
step away but in the general case
this little plus node is could be
embedded in like a massive graph
so again
we know how l impacts d and
now we know how c and e impact
d how do we put that information together
to write dl by dc
and the answer of course is the chain rule
in calculus and so
i pulled up chain rule here from wikipedia
and i'm going
to go through this very briefly so chain
rule wikipedia sometimes
can be very confusing and calculus can
can be very confusing like
this is the way i learned
chain rule and it was very
confusing like what is happening
it's just complicated so i like
this expression much better
if a variable z depends
on a variable y which itself depends
on a variable x
then z depends on x as well obviously
through the intermediate variable y
and in this case the chain rule is expressed
as if you want
dz by dx
then you take the dz by dy
and you multiply it by dy
by dx so the chain
rule fundamentally is telling you
how we chain
these derivatives
together correctly
so to differentiate through
a function composition
we have to apply a multiplication
of those derivatives
so that's
really what chain rule is telling us
and there's a nice little
intuitive explanation here which i also think is
kind of cute the chain rule states that
knowing the instantaneous rate of change of z with respect
to y and y relative to x allows
one to calculate the instantaneous rate of change of z
relative to x as a
product of those two rates of change
simply the product of those two
so here's a good one
if a car travels twice as fast as a bicycle
and the bicycle is four times as
fast as a walking man
then the car travels two times four
eight times as fast as a man
and so this makes it
very clear that the correct thing to do
sort of is to multiply
so car is
twice as fast as bicycle and bicycle
is four times as fast as man
so the car will be eight
times as fast as the man
and so we can take these
intermediate rates of change if you will
and multiply them together
and that justifies the
chain rule intuitively
so have a look at chain rule but here
really what it means for us is
there's a very simple recipe for deriving
what we want
which is dl by dc
and what we have so far
is we know
want
and we know
what is the
impact of d on l
so we know dl by dd
the derivative of l with respect to dd
we know that that's negative two
and now because of this local
reasoning that we've done here
we know dd by dc
so how does c impact d
and in particular
this is a plus node so the local derivative
is simply 1.0 it's very simple
and so
the chain rule tells us that dl by dc
going through this intermediate
variable
will just be simply dl by
dd
times
dd
by dc that's chain rule
so this is identical
to what's happening here
except
z is rl
y is rd and x is
rc
so we literally just have to multiply these
and because
these local derivatives like dd by dc
are just one
we basically just copy over
dl by dd because this is just
times one
so because dl by dd
is negative two what is dl
by dc
well it's the local gradient
1.0 times dl by dd
which is negative two so literally
what a plus node does you can look
at it that way is it literally just routes
the gradient because the
plus nodes local derivatives are just
one and so in the chain rule
one times dl by
dd is
is
is just dl by dd
and so that derivative just gets routed
to both c and to e
in this case so basically
we have that e.grad
or let's start with c
since that's the one we looked at
is negative
two times one
negative two
and in the same way by
symmetry e.grad will be negative two
that's the claim
so we can set those
we can redraw
and you see how
we just assigned negative two negative two
so this back propagating signal which is
carrying the information of like what is the derivative
of l with respect to all the intermediate nodes
we can imagine it almost like
flowing backwards through the graph and a
plus node will simply distribute
the derivative to all the leaf nodes
sorry to all the children nodes of it
so this is the claim
and now let's verify it
so let me remove the plus h here from before
and now instead what we want to
do is we want to increment c so
c.data will be incremented by h
and when i run this we expect
to see negative two
negative two
and then of course for e
so e.data plus equals h
and we expect to see negative two
simple
so those are the derivatives
of these internal nodes
and now we're going to
recurse our way backwards
again and we're again
going to apply the chain rule
so here we go our second application of chain rule
and we will apply it all the way through the
graph we just happen to only have one more node
remaining we have that
derivative of l
so we know that
the derivative of l
as we have just calculated
is negative two
so we know that
so we know the derivative of l
with respect to e
and now we want
dL by dA
right
and the chain rule is telling us
that that's just dL by dE
negative two
so that's basically
dE by dA
we have to look at that
so I'm a little times node
inside a massive graph
and I only know that I did
a times b and I produced an e
so now what is
dE by dA
and dE by dB
that's the only thing that I sort of know about
that's my local gradient
so because we have that e is a times b
we're asking what is dE
by dA
and of course we just did that here
we had a times
so I'm not going to re-derive it
but if you want to differentiate this
with respect to a you'll just get b
right the value of b
which in this case is
negative three point zero
so
basically we have that dL by dA
well let me just do it
right here we have that a dot grad
and we are applying chain rule here
is dL by dE
which we see here is
negative two
times
what is dE by dA
it's the value of b
which is negative three
that's it
and then we have b dot grad
is again dL by dE
which is negative two
just the same way
times
what is dE by dB
is the value of a
which is 2.0
so these are
our claimed derivatives
let's
re-draw
and we see here that
a dot grad turns out to be six
because that is negative two times negative three
and b dot grad is negative four
times
sorry is negative two times two
which is negative four
so those are our claims
let's delete this and let's verify them
we have
a here
plus equals h
so
the claim is that
a dot grad is six
let's verify
six
and we have b dot data
plus equals h
so nudging b by h
and looking at what happens
we claim it's negative four
and indeed it's negative four
plus minus again float
oddness
and that's it
that was the manual
back propagation
all the way from here
to all the leaf nodes
and we've done it piece by piece
and really all we've done is
as you saw we iterated through all the nodes
one by one
and locally applied the chain rule
we always know what is the derivative of l
with respect to this little output
and then we look at how this output was produced
this output was produced through some operation
and we have the pointers to the children nodes
and so in this little operation
we know what the local derivatives are
and we just multiply them onto the derivative
always
so we just go through and recursively multiply on
the local derivatives
and that's what back propagation is
it's just a recursive application of chain rule
backwards through the computation graph
let's see this power in action
just very briefly
what we're going to do is we're going to
nudge our inputs to try to make l
go up
so in particular what we're doing is
we're going to take that data
we're going to change it
and if we want l to go up
that means we just have to go in the direction of the gradient
so a should increase
in the direction of gradient
by like some small step amount
this is the step size
and we don't just want this for b
but also for b
also for c
also for f
those are leaf nodes
which we usually have control over
and if we nudge in
the direction of the gradient
we expect a positive influence on l
so we expect l to go up
positively
so it should become less negative
it should go up to say negative 6
or something like that
it's hard to tell exactly
and we have to rerun the forward pass
so let me just
do that here
this would be the forward pass
f would be unchanged
this is effectively the forward pass
but now if we print l.data
we expect
because we nudged all the values
all the inputs in the direction of the gradient
we expected less negative l
we expect it to go up
so maybe it's negative 6 or so
let's see what happens
ok negative 7
and this is basically one step
of an optimization that we'll end up running
and really this gradient
just gives us some power
because we know how to influence the final outcome
and this will be extremely useful for training NOLETs as we'll soon see
so now I would like to do
one more example
of manual backpropagation
using a bit more complex
and useful example
we are going to backpropagate
through a neuron
so we want to
eventually build out neural networks
and in the simplest case these are multilayer
perceptrons as they're called
so this is a two layer neural net
and it's got these hidden layers made up of neurons
and these neurons are fully connected to each other
now biologically neurons are very complicated
devices but we have very simple mathematical models
of them
and so this is a very simple mathematical model
of a neuron
you have some inputs, x's
and then you have these synapses
that have weights on them
so the w's are weights
and then
the synapse interacts with the input
to this neuron multiplicatively
so what flows to the cell body
of this neuron
is w times x
but there's multiple inputs
w times x is flowing to the cell body
the cell body then has
also like some bias
so this is kind of like the
innate sort of trigger happiness
of this neuron
so this bias can make it a bit more trigger happy
or a bit less trigger happy regardless of the input
but basically we're taking all the w times x
of all the inputs
adding the bias
and then we take it through an activation function
and this activation function
is usually some kind of a squashing function
like a sigmoid or 10H
or something like that
so as an example
we're going to use the 10H in this example
numpy has a
np.10H
so we can call it on a range
and we can plot it
this is the 10H function
and you see that the inputs
as they come in
get squashed on the y coordinate here
so right at 0
we're going to get exactly 0
and then as you go more positive in the input
then you'll see that
the activation function will only go up to 1
and then plateau out
and so if you pass in very positive inputs
we're going to cap it smoothly at 1
and on the negative side
we're going to cap it smoothly to negative 1
so that's 10H
and that's the squashing function
or an activation function
and what comes out of this neuron
is just the activation function applied to the
dot product of the weights
and the inputs
so let's write one out
um
I'm going to copy paste
because
I don't want to type too much
but okay so here we have the inputs
x1, x2
so this is a two dimensional neuron
so two inputs are going to come in
these are thought of as the weights of this neuron
weights w1, w2
and these weights again are the
synaptic strengths for each input
and this is the bias
of the neuron B
and now what we want to do
is according to this model
we need to multiply
x1 times w1
and x2 times w2
and then we need to add bias
on top of it
and it gets a little messy here
but all we are trying to do is
x1 w1 plus x2 w2 plus B
and these are multiplied here
except I'm doing it in small steps
so that we actually have pointers
to all these intermediate nodes
so we have x1 w1 variable
x2 w2 variable
and I'm also labeling them
so that we have the
n is now the cell body
raw activation
without the activation function for now
and this should be enough
to basically plot it
so draw dot of n
gives us x1 times w1
x2 times w2
being added
then the bias gets added on top of this
and this n is this sum
so we are now going to take it through
an activation function
And let's say we use the tanh
So that we produce the output. So what we'd like to do here is we'd like to do the output and I'll call it O is
N dot tanh
Okay, but we haven't yet written the tanh
now the reason that we need to implement another tanh function here is that
tanh is a
Hyperbolic function and we've only so far implemented a plus and a times and you can't make a tanh out of just pluses and times
You also need exponentiation. So tanh is this kind of a formula here
You can use either one of these and you see that there are exponentiation involved
Which we have not implemented yet for our little value node here
So we're not going to be able to produce tanh yet and we have to go back up and implement something like it
now one option here is
We could actually implement
Exponentiation right and we could return the exp of the value instead of a tanh
Of a value because if we had exp then we have everything else that we need so
because we know how to add and we know how to
We know how to add and we know how to multiply so we'd be able to create tanh if we knew how to exp
but for the purposes of this example, I specifically wanted to
Show you that we don't necessarily need to have the most atomic pieces in
In this value object we can actually like create functions at arbitrary
Points of abstraction they can be complicated functions
But they can be also very very simple functions like a plus and it's totally up to us
The only thing that matters is that we know how to differentiate through any one function
So we take some inputs and we make an output
The only thing that matters it can be arbitrarily complex function as long as you know
How to create the local derivative if you know the local derivative of how the inputs impact the output then that's all you need
So we're going to cluster up
all of this expression
And we're not going to break it down to its atomic pieces. We're just going to directly implement tanh. So let's do that
depth tanh and
then out will be a value of
And we need this expression here, so
Let me actually copy paste
Let's grab n which is a sol.theta and then this I believe is the tanh
math.exp of
2
You know n minus 1 over 2n plus 1
Maybe I can call this x
Just so that it matches exactly
okay, and now this will be t and
Children of this node. There's just one child and
I'm wrapping it in a tuple. So this is a tuple of one object just self and
here the name of this operation will be
10h
And we're going to return that
Okay
So now value should be
Implementing tanh and now we can scroll all the way down here and we can actually do n dot tanh
And that's going to return the tanh output of n
And now we should be able to draw it out of o not of n. So let's see how that worked
There we go n went through tanh
to produce this output
so now tanh is a
sort of
our little micro grad supported node here as an operation and
As long as we know the derivative of tanh then we'll be able to back propagate through it now
Let's see this tanh in action. Currently. It's not squashing too much because the input to it is pretty low
So the bias was increased to say 8
Then we'll see that what's flowing in
to the tanh now is 2 and
Tanh is squashing it to 0.96
So we're already hitting the tail of this tanh and it will sort of smoothly go up to 1 and then plateau out over there
Okay, so I'm going to do something slightly strange. I'm going to change this bias from 8 to this number
6.88 etc
and I'm going to do this for specific reasons because we're about to start back propagation and
I want to make sure that our numbers come out nice
They're not like very
Crazy numbers, they're nice numbers that we can sort of understand in our head. Let me also add those label
O is short for output here
So that's the R
Okay, so 0.88 flows into tanh comes out 0.7. So so now we're going to do back propagation
And we're going to fill in all the gradients
so what is the derivative O with respect to all the
inputs here and of course in a typical neural network setting what we really care about the most is the derivative of
these neurons on the weights
specifically the w2 and w1 because those are the weights that we're going to be changing part of the optimization and
The other thing that we have to remember is here
We have only a single neuron but in the neural net you typically have many neurons and they're connected
So this is only like a one small neuron a piece of a much bigger puzzle and eventually there's a loss function
That sort of measures the accuracy of the neural net and we're back propagating with respect to that accuracy and trying to increase it
So let's start off back propagation
Here in the end
What is the derivative of O with respect to O the base case sort of we know always is that the gradient is just 1.0
so let me fill it in and
then
Let me
split out
the drawing function
Here
And then here cell
Clear this output here, okay
So now when we draw O we'll see that or that grad is 1
So now we're going to back propagate through the tanh so to back propagate through tanh
We need to know the local derivative of tanh. So if we have that O is
tanh of n
Then what is do by dn?
Now what you could do is you could come here and you could take this expression and you could do your calculus derivative taking
and that would work but we can also just scroll down Wikipedia here into a section that hopefully tells us that derivative
d by dx of tanh of x is
Any of these I like this one 1 minus tanh square of x
So this is 1 minus tanh of x squared. So basically what this is saying is that d o by dn is
1 minus tanh
of n
squared. And we already have 10h of n. It's just o. So it's 1 minus o squared. So o is
the output here. So the output is this number. o.data is this number. And then what this
is saying is that do by dn is 1 minus this squared. So 1 minus o.data squared is 0.5
conveniently. So the local derivative of this 10h operation here is 0.5. And so that
would be do by dn. So we can fill in that n.grad is 0.5. We'll just fill it in. So this
is exactly 0.5, 1 half. So now we're going to continue the backprop.
This is 0.5. And this is a plus node. So what is backprop going to do here? And if you remember
our previous example, a plus is just a distributor of gradient. So this gradient will simply
flow to both of these equally. And that's because the local derivative of this operation
is 1 for every one of its nodes. So 1 times 0.5 is 0.5. So therefore, we know that this
node here, which we called this.
It's grad. It's just 0.5. And we know that b.grad is also 0.5. So let's set those and
let's draw. So those are 0.5. Continuing, we have another plus. 0.5, again, we'll just
distribute. So 0.5 will flow to both of these. So we can set theirs. x2w2 as well. .grad is
0.5.
And let's redraw. Pluses are my favorite operations to backpropagate through because it's very
simple. So now what's flowing into these expressions is 0.5. And so really, again, keep in mind
what the derivative is telling us at every point in time along here. This is saying that
if we want the output of this neuron to increase, then the influence on these expressions is
positive on the output. Both of them are positive.
So we can put a distribution to the output. So now, backpropagating to x2 and w2 first.
This is a times node. So we know that the local derivative is the other term. So if
we want to calculate x2.grad, then can you think through what it's going to be? So x2.grad
will be w2.data times this x2.grad.
.grad.
.grad.
w2.grad right and w2.grad will be x2.data times x2.w2.grad right so that's the little local piece
of chain rule let's set them and let's redraw so here we see that the gradient on our weight
2 is 0 because x2's data was 0 right but x2 will have the gradient 0.5 because data here was 1
and so what's interesting here right is because the input x2 was 0 then because of the way the
times works of course this gradient will be 0 and think about intuitively why that is
derivative always tells us the influence of this on the final output if i wiggle w2
how is the output changing
it's not changing because we're multiplying by 0 so because it's not changing there is no
derivative and 0 is the correct answer because we're squashing that 0 and let's do it here
0.5 should come here and flow through this times and so we'll have that x1.grad is
can you think through a little bit what what this should be
local derivative of times with respect to x1
is
going to be w1 so w1's data times x1 w1.grad and w1.grad will be x1.data times x1 w2 w1.grad
let's see what those came out to be so this is 0.5 so this would be negative 1.5 and this would be
1. and we've back propagated through this expression these are the actual final derivatives so if we
want this neurons to be negative 1.5 we're going to have to do this we're going to have to do this
bit of elaborating so actually we can do this byаци to here so this is negative 1.5 so if we
now want this neuron's output to increase we know that what's necessary is that
w2 we have no gradient w2 doesn't actually matter to this neuron right now
but this neuron this weight should go up so if this weight goes up then this neurones output
would have gone up and proportionally because the gradient is 1. okay so doing the back propagation
manually is obviously ridiculous so we are now going to put an end to this suffering and we're going to see how we can implement the back propagation's output Health classes method lambda.
self attack self acquire lerud and a random entunkered router operation will be still coercion equal to 0.25éro.
can implement the backward pass a bit more automatically. We're not going to be doing
all of it manually out here. It's now pretty obvious to us by example how these pluses and
times are back-propagating ingredients. So let's go up to the value object and we're going to start
codifying what we've seen in the examples below. So we're going to do this by storing a special
self.backward and underscore backward. And this will be a function which is going to do that
little piece of chain rule. At each little node that took inputs and produced output,
we're going to store how we are going to chain the outputs gradient into the inputs gradients.
So by default, this will be a function that doesn't do anything. And you can also see that
here in the value in my example.
Micrograd. So we have this backward function. By default, it doesn't do anything. This is a
empty function. And that would be sort of the case, for example, for a leaf node. For a leaf
node, there's nothing to do. But now when we're creating these out values, these out values are
an addition of self and other. And so we'll want to set out backward to be the function that
propagates the gradient.
So let's define what should happen. And we're going to store it in a closure. Let's define what
should happen when we call out's grad. For addition, our job is to take out's grad and
propagate it into self's grad and other.grad. So basically, we want to solve self.grad to
something. And we want to set out's grad to something. And we want to set out's grad to
that grad to something okay and the way we saw below how chain rule works we
want to take the local derivative times the sort of global derivative I should
call it which is the derivative of the final output of the expression with
respect to out's data with respect to out so the local derivative of self in an
addition is 1.0 so it's just 1.0 times out's grad that's the chain rule and
others.grad will be 1.0 times out.grad and what you basically what you're seeing
here is that out's grad will simply be copied onto self's grad and others grad
as we saw happens for an addition operation so we're going to later call
this function to propagate the gradient having done an addition let's now do
multiplication
we're going to also define and we're going to set its backward to be
backward and we want to chain out grad into self.grad and others.grad
and this will be a little piece of chain rule for multiplication so we'll have so
what should this be can you think through
scale it up a little bit more I think we can test it but okay so we've got
thatanche squared caught or else what should it be and this is going to be
a little better what should this be it's going to be a little bit better
so finally see here to the other side and this will be the off part second
time creative so where the version to copy to that I was off the plane or up
to the -, and then target my output time so let's go to case sickness
so here's the look of a general promotions of set for entire settings
we want a group this isn't going to come the other way we want to set the
You can also add in a你們 I think return method and even the previous employees
and I'm gonna do a little bit of what we're going to say for the SQL gameplay
to be just backward and here we need to back propagate we have out dot grad and we want to
chain it into salt dot grad and salt dot grad will be the local derivative of this operation
that we've done here which is 10h and so we saw that the local gradient is 1 minus the 10h of x
squared which here is t that's the local derivative because that's t is the output of this 10h
so 1 minus t squared is the local derivative and then gradient has to be multiplied because of the
chain rule so out grad is chained through the local gradient into salt dot grad and that should
be basically it so we're going to redefine our value node we're going to swing all the way down
here and we're going to redefine our expression make sure that all the grads are zero okay but
now we don't have to do this again we're just going to do this again and we're going to do this
to do this manually anymore. We are going to basically be calling the dot backward
in the right order. So first we want to call o's dot backward. So o was the
outcome of 10h, right? So calling o's backward will be this
function. This is what it will do. Now we have to be careful because there's a
times out dot grad and out dot grad remember is initialized to 0. So here we see
grad 0. So as a base case we need to set o's dot grad to 1.0 to initialize
this with 1
and then once this is 1, we can call o dot backward and what that should do is it should
propagate this grad through 10h. So the local derivative times the global derivative which
is initialized at 1. So this should
so I thought about redoing it but I figured I should just leave the error in here because
it's pretty funny. Why is an anti-object
not callable? It's because I screwed up. We're trying to save these functions. So this is
correct. This here, we don't want to call the function because that returns none. These
functions return none. We just want to store the function. So let me redefine the value
object and then we're going to come back in, redefine the expression, draw a dot. Everything
is great.
o dot grad is 1, o dot grad is 1 and now
this should work, of course. Okay. So o dot backward should have, this grad should now
be 0.5 if we redraw and if everything went correctly, 0.5. Yay. Okay. So now we need to
call ns dot grad, ns dot backward, sorry, ns backward. So that seems to have worked.
So ns dot backward routed the gradient to both of these. So this is looking great. So
now we could, of course, call b dot grad, b dot backward, sorry. What's going to happen?
Well b doesn't have a backward. b is backward because b is a leaf node. b is backward is
by initialization the empty function. So nothing would happen. But we can call it on it. But
when we call this one, it's backward.
M Normal entire value.
Let's do this behavior here. Then we expect this 0.5 to give further routed. Right? So
there we go, 0.5, 0.5. And then finally, we want to call it here on x2, w2. And on
x1, w1.
Let's do both of those. And there we go.
??
and one exactly as we did before but now we've done it through calling that backward
sort of manually so we have one last piece to get rid of which is us calling underscore
backward manually so let's think through what we are actually doing we've laid out a mathematical
expression and now we're trying to go backwards through that expression so going backwards through
the expression just means that we never want to call a dot backward for any node before
we've done sort of everything after it so we have to do everything after it before we're ever going
to call dot backward on any one node we have to get all of its full dependencies everything that
it depends on has to propagate to it before we can continue back-propagation so this ordering
of graphs can be achieved using something called topological sort
so topological
sort is basically a laying out of a graph such that all the edges go only from left to right
basically. So here we have a graph it's a directed acyclic graph a DAG and this is two different
topological orders of it I believe where basically you'll see that it's a laying out of the nodes
such that all the edges go only one way from left to right. And implementing topological sort you
can look in wikipedia and so on I'm not going to go through it in detail but basically this is what
builds a topological graph. We maintain a set of visited nodes and then we are going through
starting at some root node which for us is O that's where I want to start the topological sort
and starting at O we go through all of its children and we need to lay them out from left to
right and basically this starts at OH.
Oh, if it's not visited, then it marks it as visited.
And then it iterates through all of its children and calls build topological on them.
And then after it's gone through all the children, it adds itself.
So basically, this node that we're going to call it on, like say, oh,
is only going to add itself to the topo list after all of the children have been processed.
And that's how this function is guaranteeing that you're only going to be in the list
once all of your children are in the list.
And that's the invariant that is being maintained.
So if we build topo on O and then inspect this list,
we're going to see that it ordered our value objects.
And the last one is the value of 0.707, which is the output.
So this is O, and then this is N, and then all the other nodes get laid out before it.
So that builds the topological graph.
And really what we're doing now,
is we're just calling dot underscore backward on all of the nodes in a topological order.
So if we just reset the gradients, they're all 0, what did we do?
We started by setting O.grad to be 1.
That's the base case.
Then we built a topological order.
And then we went for node in reversed.
Of topo.
Now, in the reverse order, because this list goes from, you know,
we need to go through it in reversed order.
So starting at O, node dot backward.
And this should be it.
There we go.
Those are the correct derivatives.
Finally, we are going to hide this functionality.
So I'm going to copy this.
And we're going to hide this functionality.
And we're going to hide it inside the value class,
because we don't want to have all that code lying around.
So instead of an underscore backward,
we're now going to define an actual backward.
So that's backward, without the underscore.
And that's going to do all the stuff that we just derived.
So let me just clean this up a little bit.
So we're first going to build a topological graph,
starting at self.
So build topo of self.
We'll populate the topological order into the topo list,
which is a local variable.
Then we set self.grads to be one.
And then for each node in the reversed list,
so starting at S and going to all the children,
underscore backward.
And that should be it.
So save.
Come down here.
We define.
Okay, all the grads are zero.
And now what we can do is odot backward without the underscore.
And there we go.
And that's backpropagation.
Place for one neuron.
Now we shouldn't be too happy with ourselves, actually,
because we have a bad bug.
And we have not surfaced the bug
because of some specific conditions that we have to think about right now.
So here's the simplest case that shows the bug.
Say I create a single node A,
and then I create a B that is A plus A.
And then I call backward.
So what's going to happen is A is three,
and then B is A plus A.
So there's two arrows on top of each other here.
Then we can see that B is, of course, the forward pass works.
B is just A plus A, which is six.
But the gradient here is not actually correct.
That we calculated.
We can calculate it automatically.
And that's because, of course, just doing calculus in your head,
the derivative of B with respect to A should be two.
One plus one.
It's not one.
Intuitively, what's happening here, right?
So B is the result of A plus A, and then we call backward on it.
So let's go up and see what that does.
B is the result of addition, so out as B.
And then when we call backward, what happened is self.grad was set to one,
and then other.grad was set to one.
But because we're doing A plus A, self and other are actually the exact same object.
So we are overriding the gradient.
We are setting it to one, and then we are setting it again to one.
And that's why it stays at one.
So that's a good thing.
There's another way to see this in a little bit more complicated expression.
So here we have A and B.
And then D will be the multiplication of the two,
and E will be the addition of the two.
And then we multiply E times D to get F.
And then we call F dot backward.
And these gradients, if you check, will be incorrect.
So fundamentally what's happening here, again,
is basically we're going to see an issue any time we use a variable more than once.
Until now, in these expressions above, every variable is used exactly once.
So we didn't see the issue.
But here, if a variable is used more than once,
what's going to happen during backward pass?
We're back-propagating from F to E to D.
So far, so good.
But now E calls it backward, and it deposits its gradients to A and B.
But then we come back to D and call backward,
and it overwrites those gradients at A and B.
So that's obviously a problem.
And the solution here, if you look at the multivariate case of the chain rule
and its generalization there,
the solution there is basically that we have to accumulate these gradients.
These gradients add.
And so instead of setting those gradients,
we can simply do plus equals.
We need to accumulate those gradients.
Plus equals, plus equals, plus equals.
And this will be okay, remember, because we are initializing them at zero.
So they start at zero, and then any contribution that flows backwards will simply add.
So now if we redefine this one, because the plus equals, this now works.
Because A dot grad started at zero, and we called B dot backward,
we deposit one, and then we deposit one again.
And then we call B dot backward.
And now this is two, which is correct.
And here, this will also work, and we'll get correct gradients.
Because when we call E dot backward, we will deposit the gradients from this branch,
and then when we get to D dot backward, it will deposit its own gradients.
And then those gradients simply add on top of each other.
And so we just accumulate those gradients, and that fixes the issue.
Okay, now before we move on, let me actually do a bit of cleanup here
and delete some of this intermediate work.
So I'm not going to need any of this.
Now that we've derived all of it.
We are going to keep this, because I want to come back to it.
Delete the 10H, delete our modigating example, delete the step, delete this,
keep the code that draws, and then delete this example,
and leave behind only the definition of value.
And now let's come back to this non-linearity here that we implemented, the 10H.
Now I told you that we could have broken down 10H
into its explicit atoms in terms of other expressions if we had the exp function.
So if you remember, 10H is defined like this,
and we chose to develop 10H as a single function,
and we can do that because we know it's derivative,
and we can backpropagate through it.
But we can also break down 10H into an expressiveness, a function of exp.
And I would like to do that now, because I want to prove to you
that you get all the same results and all the same gradients,
but also because it forces us to implement a few more expressions.
It forces us to do exponentiation,
addition, subtraction, division, and things like that.
And I think it's a good exercise to go through a few more of these.
Okay, so let's scroll up to the definition of value.
And here, one thing that we currently can't do is,
we can do like a value of, say, 2.0.
But we can't do, you know, here, for example, we want to add a constant 1.
And we can't do something like this.
And we can't do it because it says int object has no attribute data.
That's because a plus 1 comes right here to add,
and then other is the integer 1.
And then here, Python is trying to access 1.data, and that's not a thing.
And that's because basically, 1 is not a value object,
and we only have addition for value objects.
So as a matter of convenience, so that we can create expressions like this
and make them make sense, we can simply do something like this.
Basically, we let other alone if other is an instance of value.
But if it's not an instance of value, we're going to assume that it's a number,
like an integer or a float, and we're going to simply
wrap it in value.
And then other will just become value of other,
and then other will have a data attribute, and this should work.
So if I just say this, redefine value, then this should work.
There we go.
Okay, now let's do the exact same thing for multiply,
because we can't do something like this, again, for the exact same reason.
So we just have to go to mol, and if other is not a value,
then let's wrap it in value.
Let's redefine value, and now this works.
Now, here's a kind of, unfortunately,
and not obvious part, a times two works, we saw that,
but two times a, is that going to work?
You'd expect it to, right?
But actually, it will not.
And the reason it won't is because Python doesn't know,
like when you do a times two, basically, so a times two,
Python will go and it will basically do something like a dot mol of two.
That's basically what it will call.
But to it, two times a is the same as two dot mol of a.
And it doesn't, two can't multiply value.
And so it's really confused about that.
So instead, what happens is in Python, the way this works is you are free to define
something called the rmol.
And rmol is kind of like a fallback.
So if Python can't do two times a, it will check if by any chance,
a knows how to multiply two, and that will be called into rmol.
So because Python can't do two times a,
it will check, is there an rmol in value?
And because there is, it will now call that.
And what we'll do here is we will swap the order of the operands.
So basically, two times a will redirect to rmol,
and rmol will basically call a times two.
And that's how that will work.
So redefining that with rmol, two times a becomes four.
Okay, now looking at the other elements that we still need,
we need to know how to exponentiate and how to divide.
So let's first do the exponentiation part.
We're going to introduce
a single function exp here.
And exp is going to mirror 10h in the sense that it's a single function
that transforms a single scalar value and outputs a single scalar value.
So we pop out the Python number.
We use math.exp to exponentiate it, create a new value object,
everything that we've seen before.
The tricky part, of course, is how do you backpropagate through e to the x?
And so here you can potentially pause the video and think about what should go here.
Okay, so basically, we need to know what is the local derivative of e to the x.
So d by dx of e to the x is famously just e to the x.
And we've already just calculated e to the x, and it's inside out.data.
So we can do out.data times and out.grad, that's the chain rule.
So we're just chaining on to the current running grad.
And this is what the expression looks like.
It looks a little confusing, but this is what it is.
And that's the exponentiation.
So redefining, we should now be able to call a.exp.
And hopefully the backward pass works as well.
Okay, and the last thing we'd like to do, of course, is we'd like to be able to divide.
Now, I actually will implement something slightly more powerful than division,
because division is just a special case of something a bit more powerful.
So in particular, just by rearranging, if we have some kind of a b equals value of 4.0 here,
we'd like to basically be able to do a divide b, and we'd like this to be able to give us 0.5.
Now, division actually can be reshuffled as follows.
If we have a divide b, that's actually the same as a multiplying 1 over b.
And that's the same as a multiplying b to the power of negative 1.
And so what I'd like to do instead is I basically like to implement the operation of x to the k for some constant k.
So it's an integer or a float.
And we would like to be able to differentiate this.
And then as a special case, negative 1 will be division.
And so I'm doing that.
Just because it's more general and you might as well do it that way.
So basically what I'm saying is we can redefine division, which we will put here somewhere.
You know, we can put it here somewhere.
What I'm saying is that we can redefine division.
So self divide other.
This can actually be rewritten as self times other to the power of negative 1.
And now, value raised to the power of negative 1, we have to now define that.
So here's, so we need to implement the pow function.
Where am I going to put the pow function?
Maybe here somewhere.
This is the skeleton for it.
So this function will be called when we try to raise a value to some power and other will be that power.
Now, I'd like to make sure that other is only an int or a float.
Usually other is some kind of a different value object.
But here other will be forced to be an int or a float.
Otherwise, the math won't work.
For what we're trying to achieve in this specific case.
That would be a different derivative expression if we wanted other to be a value.
So here we create the other value, which is just, you know, this data raised to the power of other.
And other here could be, for example, negative 1.
That's what we are hoping to achieve.
And then this is the backward stub.
And this is the fun part, which is what is the chain rule expression here for back propagating through
the power function where the power is to the power of some kind of a constant.
So this is the exercise and maybe pause the video here and see if you can figure it out yourself as to what we should put here.
Okay, so you can actually go here and look at derivative rules as an example.
And we see lots of derivative rules that you can hopefully know from calculus.
In particular, what we're looking for is the power rule because that's telling us that if we're trying to take
d by dx of x to the n, which is what we're doing here, then that is just n times x to the n minus 1, right?
Okay, so that's telling us about the local derivative of this power operation.
So all we want here basically n is now other and self.data is x.
And so this now becomes other which is n times self.data,
which is now another.
Python int or a float.
It's not a value object.
We're accessing the data attribute raised to the power of other minus 1 or n minus 1.
I can put brackets around this, but this doesn't matter because power takes precedence over multiply in pyhelm.
So that would have been okay.
And that's the local derivative only.
But now we have to chain it and we chain it just simply by multiplying by a path grad that's chain rule.
And this should technically work.
And we're going to find out soon.
But now if we do this, this should now work.
And we get 0.5.
So the forward pass works, but does the backward pass work?
And I realized that we actually also have to know how to subtract.
So right now a minus b will not work to make it work.
We need one more piece of code here.
And basically this is the subtraction and the way we're going to implement subtraction is we're going to implement it by addition of a negation.
And then to implement negation, we're going to multiply by negative one.
So just again using the stuff we've already built and just expressing it in terms of what we have and a minus b is not working.
Okay, so now let's scroll again to this expression here for this neuron.
And let's just compute the backward pass here.
Once we've defined O and let's draw it.
So here's the gradients for all these leaf nodes for this two-dimensional neuron that has a 10h that we've seen before.
So now what I'd like to do is I'd like to break up.
This 10h into this expression here.
So let me copy paste this here and now instead of will preserve the label and we will change how we define O.
So in particular we're going to implement this formula here.
So we need e to the 2x minus 1 over e to the x plus 1.
So e to the 2x we need to take 2 times n and we need to exponentiate it.
That's e to the 2x and then because we're using it twice.
Let's create an intermediate.
Variable e and then define O as e plus 1 over e minus 1 over e plus 1 e minus 1 over e plus 1 and that should be it.
And then we should be able to draw dot of O.
So now before I run this, what do we expect to see?
Number one, we're expecting to see a much longer graph here because we've broken up 10h into a bunch of other operations.
But those operations are mathematically equivalent.
And so what we're expecting.
To see is number one, the same result here.
So the forward pass works and number two because of that mathematical equivalence.
We expect to see the same backward pass and the same gradients on these leaf nodes.
So these gradients should be identical.
So let's run this.
So number one, let's verify that instead of a single 10h node.
We have now X and we have plus we have times negative one.
This is the division and we end up with the same forward pass.
Here and then the gradients.
We have to be careful because they're in slightly different order.
Potentially the gradients for W2 X2 should be 0 and 0.5 W2 and X2 are 0 and 0.5 and W1 X1 are 1 and negative 1.5 1 and negative 1.5.
So that means that both our forward passes and backward passes were correct because this turned out to be equivalent to 10h before.
And so the reason I wanted to go through this exercise is number one.
We got to practice a few more operations.
And writing more backwards passes and number two.
I wanted to illustrate the point that the the level at which you implement your operations is totally up to you.
You can implement backward passes for tiny expressions like a single individual plus or a single times.
Or you can implement them for say 10h which is a kind of a potential.
You can see it as a composite operation because it's made up of all these more atomic operations.
But really all of this is kind of like a fake concept.
All that matters is we have some kind of inputs.
And some kind of an output and this output is a function of the inputs in some way.
And as long as you can do forward pass and the backward pass of that little operation.
It doesn't matter what that operation is and how composite it is.
If you can write the local gradients you can chain the gradient and you can continue back propagation.
So the design of what those functions are is completely up to you.
So now I would like to show you how you can do the exact same thing but using a modern deep neural network library.
Like for example PyTorch.
Which I've roughly modeled.
Micrograd by.
And so PyTorch is something you would use in production.
And I'll show you how you can do the exact same thing but in PyTorch API.
So I'm just going to copy paste it in and walk you through it a little bit.
This is what it looks like.
So we're going to import PyTorch.
And then we need to define these value objects like we have here.
Now Micrograd is a scalar valued engine.
So we only have scalar values like 2.0.
But in PyTorch.
We only have around tensors.
And like I mentioned tensors are just n dimensional arrays of scalars.
So that's why things get a little bit more complicated here.
I just need a scalar valued tensor.
A tensor with just a single element.
But by default when you work with PyTorch you would use more complicated tensors like this.
So if I import PyTorch.
Then I can create tensors like this.
And this tensor for example.
Is a 2x3 array.
Of scalars in a single compact representation.
So we can check its shape.
We see that it's a 2x3 array and so on.
So this is usually what you would work with in the actual libraries.
So here I'm creating a tensor that has only a single element 2.0.
And then I'm casting it to be double.
Because Python is by default using double precision for its floating point numbers.
So I'd like everything to be identical.
By default the data type of these tensors will be float32.
So it's only using a single precision float.
So I'm casting it to double.
So that we have float64 just like in Python.
So I'm casting to double.
And then we get something similar to value of 2.
The next thing I have to do is because these are leaf nodes.
By default PyTorch assumes that they do not require gradients.
So I need to explicitly say that all of these nodes require gradients.
Okay.
So this is going to construct.
Scalar valued one element tensors.
Make sure that PyTorch knows that they require gradients.
Now by default these are set to false by the way because of efficiency reasons.
Because usually you would not want gradients for leaf nodes.
Like the inputs to the network.
And this is just trying to be efficient in the most common cases.
So once we've defined all of our values in PyTorch land.
We can perform arithmetic just like we can here in micrograd land.
So this would just work.
And then there's a torch.10h also.
And when we get back as a tensor again.
And we can just like in micrograd.
It's got a data attribute and it's got grad attributes.
So these tensor objects just like in micrograd have a dot data and a dot grad.
And the only difference here is that we need to call a dot item.
Because otherwise PyTorch dot item basically takes a single tensor of one element.
And it just returns that element stripping out the tensor.
So let me just run this.
And hopefully we are going to get.
This is going to print the forward pass which is 0.707.
And this will be the gradients which hopefully are 0.50, negative 1.5, and 1.
So if we just run this.
There we go.
0.7.
So the forward pass agrees.
And then 0.50, negative 1.5, and 1.
So PyTorch agrees with us.
And just to show you here basically.
Oh, here's a tensor with a single element.
And it's a double.
And we can call that item on it to just get the single number out.
So that's what item does.
And O is a tensor object like I mentioned.
And it's got a backward function just like we've implemented.
And then all of these also have a dot grad.
So like X2 for example has a grad.
And it's a tensor.
And we can pop out the individual number with dot item.
So basically Torch can do what we did in micrograd as a special case.
When your tensors are all single element tensors.
But the big deal with PyTorch is that everything is significantly more efficient.
Because we are working with these tensor objects.
And we can do lots of operations in parallel on all of these tensors.
But otherwise what we've built very much agrees with the API of PyTorch.
Okay, so now that we have some machinery to build out pretty complicated mathematical expressions.
We can also start building up neural nets.
And as I mentioned neural nets are just a specific class of mathematical expressions.
So we're going to start building out a neural net piece by piece.
And eventually we'll build out a two-layer multi-layer layer perceptron as it's called.
And I'll show you exactly what that means.
Let's start with a single individual neuron.
We've implemented one here.
But here I'm going to implement one that also subscribes to the PyTorch API.
And how it designs its neural network modules.
So just like we saw that we can like match the API of PyTorch on the autograd side.
We're going to try to do that on the neural network modules.
So here's class neuron.
And just for the sake of efficiency.
I'm going to copy paste some sections that are relatively straightforward.
So the constructor will take number of inputs to this neuron.
Which is how many inputs come to a neuron.
So this one for example has three inputs.
And then it's going to create a weight.
That is some random number between negative one and one for every one of those inputs.
And a bias that controls the overall trigger happiness of this neuron.
And then we're going to implement a def underscore underscore call of self and x.
Some input x.
And really what we don't want to do here is w times x plus b.
Where w times x here is a dot product specifically.
Now if you haven't seen call.
Let me just return 0.0 here for now.
The way this works now is we can have an x which is say like 2.0, 3.0.
Then we can initialize a neuron that is two-dimensional.
Because these are two numbers.
And then we can feed those two numbers into that neuron to get an output.
And so when you use this notation n of x.
Python will use call.
So currently call just returns 0.0.
Now we'd like to actually do the forward pass of this neuron instead.
So we're going to do here first.
Is we need to basically multiply all of the elements of w.
With all of the elements of x pairwise.
We need to multiply them.
So the first thing we're going to do.
Is we're going to zip up salta w and x.
And in Python zip takes two iterators.
And it creates a new iterator that iterates over the tuples of their corresponding entries.
So for example, just to show you we can print this list.
And still return 0.0 here.
Sorry.
I'm in life.
So we see that these w's are paired up with the x's.
W with x.
And now what we want to do is.
For wi xi in.
We want to multiply w times wi times xi.
And then we want to sum all of that together.
To come up with an activation.
And add also salta b on top.
So that's the raw activation.
And then of course we need to pass that through a null linearity.
So what we're going to be returning is act dot 10h.
And here's out.
So now we see that we are getting some outputs.
And we get a different output from a neuron each time.
Because we are initializing different weights and biases.
And then to be a bit more efficient here actually.
Sum by the way takes a second optional parameter.
Which is the start.
And by default the start is 0.
So these elements of this sum.
Will be added on top of 0 to begin with.
But actually we can just start with salta b.
And then we just have an expression like this.
And then the generator expression here must be parenthesized in python.
There we go.
Yep so now we can forward a single neuron.
Next up we're going to define a layer of neurons.
So here we have a schematic for a MLP.
So we see that.
These MLPs each layer.
This is one layer.
Has actually a number of neurons.
And they're not connected to each other.
But all of them are fully connected to the input.
So what is a layer of neurons?
It's just it's just a set of neurons evaluated independently.
So in the interest of time.
I'm going to do something fairly straightforward here.
It's literally a layer is just a list of neurons.
And then how many neurons do we have?
We take that as an input argument here.
How many neurons do you want in your layer number of outputs in this layer?
And so we just initialize completely independent neurons with this given dimensionality.
And we call on it.
We just independently evaluate them.
So now instead of a neuron we can make a layer of neurons.
They are two dimensional neurons and let's have three of them.
And now we see that we have three independent evaluations of three different neurons, right?
Okay.
And finally, let's complete this picture and define an entire multi-layer.
Perceptron or MLP.
And as we can see here in an MLP, these layers just feed into each other sequentially.
So let's come here and I'm just going to copy the code here in interest of time.
So an MLP is very similar.
We're taking the number of inputs as before.
But now instead of saying taking a single and out which is number of neurons in a single layer.
We're going to take a list of an outs and this list defines the sizes of all the layers that we want in our MLP.
So here we just put them all together and then iterate.
Over consecutive pairs of these sizes and create a layer objects for them.
And then in the call function, we are just calling them sequentially.
So that's an MLP really.
And let's actually re-implement this picture.
So we want three input neurons and then two layers of four and an output unit.
So we want three dimensional input.
Say this is an example input.
We want three inputs into two layers of four and one output.
And this of course is an MLP.
And there we go.
That's a forward pass of an MLP.
To make this a little bit nicer.
You see how we have just a single element, but it's wrapped in a list because layer always returns lists.
So for convenience, return outs at zero if len outs is exactly a single element.
Else return fullest.
And this will allow us to just get a single value out at the last layer that only has a single neuron.
And finally, we should be able to draw a dot of N of X.
As you might imagine, these expressions are now getting relatively involved.
So this is an entire MLP that we're defining now.
All the way until a single output.
Okay, and so obviously you would never differentiate on pen and paper these expressions.
But with micrograd, we will be able to back propagate all the way through this and back propagate into these weights of all these neurons.
So let's see how that works.
Okay, so let's create ourselves a very simple example data set here.
So this data set has four examples.
And so we have four possible inputs into the neural net.
And we have four desired targets.
So we'd like the neural net to assign or output 1.0 when it's fed this example.
Negative one when it's fed these examples.
And one when it's fed this example.
So it's a very simple binary classifier neural net basically that we would like here.
Now let's think what the neural net currently thinks about these four examples.
We can just get their predictions.
Basically, we can just call N of X for X and Xs.
And then we can print.
So these are the outputs of the neural net on those four examples.
So the first one is 0.91, but we'd like it to be one.
So we should push this one higher.
This one we want to be higher.
This one says 0.88, and we want this to be negative one.
This is 0.88, we want it to be negative one.
And this one is 0.88, we want it to be one.
So how do we make the neural net?
And how do we tune the weights to better predict the desired targets?
And the trick used in deep learning to achieve this is to calculate a single number
that somehow measures the total performance of your neural net.
And we call this single number the loss.
So the loss first is a single number
that we're going to define that basically measures how well the neural net is performing.
Right now, we have the intuitive sense that it's not performing very well
because we're not very much close to this.
So the loss will be high, and we'll want to minimize the loss.
So in particular, in this case, what we're going to do is we're going to implement the mean squared error loss.
So what this is doing is we're going to basically iterate
for Y ground truth and Y output in zip of Ys and Ybred.
So we're going to pair up
the ground truths with the predictions and the zip iterates over tuples of them.
And for each Y ground truth and Y output,
we're going to subtract them and square them.
So let's first see what these losses are. These are individual loss components.
And so basically for each one of the four,
we are taking the prediction and the ground truth.
We are subtracting them and squaring them.
So because this one is so close to its target,
0.91 is almost 1, subtracting them gives a very small number.
So here we would get like a negative 0.1,
and then squaring it just makes sure that regardless of
whether we are more negative or more positive,
we always get a positive number.
Instead of squaring, we could also take,
for example, the absolute value. We need to discard the sign.
And so you see that the expression is ranged so that you
only get 0 exactly when Y out is equal to Y ground truth.
When those two are equal,
so your prediction is exactly the target,
you are going to get 0. And if your prediction is not the target,
you are going to get some other number.
So here, for example, we are way off.
And so that's why the loss is quite high.
And the more off we are, the greater the loss will be.
So we don't want high loss, we want low loss.
And so the final loss here will be just the sum,
all of these numbers.
So you see that this should be 0 roughly plus 0 roughly,
but plus 7.
So loss should be about 7 here.
And now we want to minimize the loss.
We want the loss to be low because if loss is low,
then every one of the predictions is equal to its target.
So the loss, the lowest it can be is 0,
and the greater it is, the worse off the neural net is,
and the higher the risk of shifting.
So now, of course, if we do loss.backward,
something magical happened when I hit enter.
And the magical thing, of course, that happened is that we can look at
n.layers.neuron, n.layers at, say, like the first layer,
that neurons at 0,
because remember that MLP has the layers, which is a list,
and each layer has neurons, which is a list,
and that gives us an individual neuron,
and that gives us some weights.
And so we can, for example, look at the weights at 0.
Oops, it's not called weights, it's called w.
And that's a value, but now this value also has a grad
because of the backward pass.
And so we see that because this gradient here
on this particular weight of this particular neuron
of this particular layer is negative,
we see that its influence on the loss is also negative.
So slightly increasing this particular weight of this neuron of this layer
would make the loss go down.
And we actually have this information for every single one of our neurons
and all of their parameters.
Actually, it's worth looking at also the draw dot of loss, by the way.
So previously, we looked at the draw dot of a single neuron forward pass,
and that was already a large expression.
But what is this expression?
We actually forwarded every one of those four examples,
and then we have the loss on top of them,
with the mean squared error.
And so this is a really massive graph
because this graph that we've built up now,
oh my gosh,
this graph that we've built up now,
which is kind of excessive,
it's excessive because it has four forward passes of a neural net
for every one of the examples,
and then it has the loss on top,
and it ends with the value of the loss, which was 7.12.
And this loss will now back propagate through all the four forward passes
all the way through,
just every single intermediate value of the neural net,
all the way back to,
of course, the parameters of the weights,
which are the input.
So these weight parameters here are inputs to this neural net,
and these numbers here,
these scalars,
are inputs to the neural net.
So if we went around here,
we will probably find some of these examples,
this 1.0,
potentially maybe this 1.0,
or, you know, some of the others.
And you'll see that they all have gradients as well.
The thing is these gradients on the input data
are not that useful to us,
and that's because the input data seems to be not changeable.
It's a given to the problem,
and so it's a fixed input.
We're not going to be changing it or messing with it,
even though we do have gradients for it.
But some of these gradients here
will be for the neural network parameters,
the w's and the b's,
and those we, of course, we want to change.
Okay, so now we're going to want
some convenience codes to gather up
all of the parameters of the neural net
so that we can operate on all of them simultaneously.
And every one of them,
we will nudge a tiny amount
based on the gradient information.
So let's collect the parameters of the neural net
all in one array.
So let's create a parameters of self
that just returns
self.w, which is a list,
concatenated with a list of self.b.
So this will just return a list.
List plus list just gives you a list.
So that's parameters of neuron,
and I'm calling it this way
because also PyTorch has parameters
on every single NN module,
and it does exactly what we're doing here.
It just returns the parameter tensors.
For us, it's the parameter scalars.
Now, layer is also a module,
so it will have parameters, self,
and basically what we want to do here is
something like this, like
params is here,
and then for neuron in self.neurons,
we want to get neuron.parameters,
and we want to params.extend.
So these are the parameters of this neuron,
and then we want to put them on top of params,
so params.extend of piece,
and then we want to return params.
So this is way too much code,
so actually there's a way to simplify this,
which is return p for neuron in self.neurons
for p in neuron.parameters.
So it's a single list comprehension.
In Python, you can sort of nest them like this,
and you can then create the desired array.
So these are identical.
We can take this out.
And then let's do the same here.
dev.parameters self
and return a parameter for layer in self.layers
for p in layer.parameters.
And that should be good.
Now let me pop out this
so we don't reinitialize our network,
because we need to reinitialize our...
Okay, so unfortunately,
we will have to probably reinitialize the network
because we just added functionality.
Because this class, of course,
I want to get all the end.parameters,
but that's not going to work
because this is the old class.
Okay.
So unfortunately,
we do have to reinitialize the network,
which will change some of the numbers.
But let me do that so that we pick up the new API.
We can now do end.parameters.
And these are all the weights and biases
inside the entire neural net.
So in total, this MLP has 41 parameters.
And now we'll be able to change them.
If we recalculate the loss here,
we see that unfortunately,
we have slightly different predictions
and slightly different loss.
But that's okay.
Okay, so we see that this neuron's gradient
is slightly negative.
We can also look at its data right now,
which is 0.85.
So this is the current value of this neuron,
and this is its gradient on the loss.
So what we want to do now
is we want to iterate for every p in end.parameters.
So for all the 41 parameters in this neural net,
we actually want to change p.data slightly
according to the gradient information.
Okay, so dot dot dot to do here.
But this will be basically a tiny update
in this gradient descent scheme.
And gradient descent,
we are thinking of the gradient
as a vector pointing in the direction
of increased loss.
And so in gradient descent,
we are modifying p.data
by a small step size
in the direction of the gradient.
So the step size as an example
could be like a very small number,
like 0.01 is the step size,
times p.grad, right?
But we have to think through
some of the signs here.
So in particular,
working with this specific example here,
we see that if we just left it like this,
then this neuron's value
would be currently increased
by a tiny amount of the gradient.
The gradient is negative,
so this value of this neuron
would go slightly down.
It would become like 0.84
or something like that.
But if this neuron's value goes lower,
that would actually increase the loss.
That's because the derivative
of this neuron is negative.
So increasing this
makes the loss go down.
So increasing it is what we want to do
instead of decreasing it.
So basically what we're missing here
is we're actually missing a negative sign.
And again, this other interpretation,
and that's because we want to minimize the loss.
We don't want to maximize the loss.
We want to decrease it.
And the other interpretation, as I mentioned,
is you can think of the gradient vector,
so basically just the vector of all the gradients,
as pointing in the direction
of increasing the loss.
But then we want to decrease it.
So we actually want to go in the opposite direction.
And so you can convince yourself
that this does the right thing here with the negative
because we want to minimize the loss.
So if we nudge all the parameters by a tiny amount,
then we'll see that this data
will have changed a little bit.
So now this neuron is a tiny amount greater value.
So 0.854 went to 0.857.
And that's a good thing
because slightly increasing this neuron data
makes the loss go down according to the gradient.
And so the correcting has happened sign-wise.
And so now what we would expect, of course,
is that because we've changed all these parameters,
we expect that the loss should have gone down a bit.
So we want to reevaluate the loss.
Let me basically...
This is just a data definition that hasn't changed.
But the forward pass here,
of the network,
we can recalculate.
And actually, let me do it outside here
so that we can compare the two loss values.
So here, if I recalculate the loss,
we'd expect the new loss now
to be slightly lower than this number.
So hopefully, what we're getting now
is a tiny bit lower than 4.84.
4.36.
And remember, the way we've arranged this
is that low loss means that
our predictions are matching the targets.
So our predictions now
are probably slightly closer to the targets.
And now all we have to do
is we have to iterate this process.
So again, we've done the forward pass,
and this is the loss.
Now we can loss that backward.
Let me take these out.
And we can do a step size.
And now we should have a slightly lower loss.
4.36 goes to 3.9.
And okay, so we've done the forward pass.
Here's the backward pass.
Nudge.
And now the loss is 3.66.
3.47.
And you get the idea.
We just continue doing this.
And this is gradient descent.
We're just iteratively doing forward pass,
backward pass, update.
Forward pass, backward pass, update.
And the neural net is improving its predictions.
So here, if we look at ypred now,
ypred,
we see that this value should be getting closer to 1.
So this value should be getting more positive.
These should be getting more negative.
And this one should be also getting more positive.
So if we just iterate this a few more times,
actually, we may be able to afford to go a bit faster.
Let's try a slightly higher learning rate.
Oops, okay, there we go.
So now we're at 0.31.
If you go too fast, by the way,
if you try to make it too big of a step,
you may actually overstep.
It's overconfidence.
Because again, remember,
we don't actually know exactly about the loss function.
The loss function has all kinds of structure.
And we only know about the very local dependence
of all these parameters on the loss.
But if we step too far,
we may step into, you know,
a part of the loss that is completely different.
And that can destabilize training
and make your loss actually blow up even.
So the loss is now 0.04.
So actually, the predictions should be really quite close.
Let's take a look.
So you see how this is almost one,
almost negative one, almost one.
We can continue going.
So, yep, backward, update.
Oops, there we go.
So we went way too fast.
And we actually overstepped.
So we got too eager.
Where are we now?
Oops.
Okay.
7E-9.
So this is very, very low loss.
And the predictions are basically perfect.
So somehow we...
Basically, we were doing way too big updates
and we briefly exploded,
but then somehow we ended up getting into a really good spot.
So usually this learning rate
and the tuning of it is a subtle art.
You want to set your learning rate.
If it's too low,
you're going to take way too long to converge.
But if it's too high,
the whole thing gets unstable
and you might actually even explode the loss,
depending on your loss function.
So finding the step size to be just right,
it's a pretty subtle art sometimes
when you're using sort of vanilla gradient descent.
But we happened to get into a good spot.
We can look at n.parameters.
So this is the setting of weights and biases
that makes our network
predict the desired targets
very, very close.
And basically,
we've successfully trained a neural net.
Okay, let's make this a tiny bit more respectable
and implement an actual training loop
and what that looks like.
So this is the data definition that stays.
This is the forward pass.
So for k in range,
we're going to take a bunch of steps.
First, we do the forward pass.
We validate the loss.
Let's reinitialize the neural net from scratch.
And here's the data.
And we first do the forward pass.
Then we do the backward pass.
And then we do an update.
That's gradient descent.
And then we should be able to iterate this
and we should be able to print the current step,
the current loss.
Let's just print the sort of
number of the loss.
And that should be it.
And then the learning rate,
0.01 is a little too small.
0.1 we saw is like a little bit dangerous
and too high.
Let's go somewhere in between.
And we'll optimize this for
not 10 steps,
but let's go for say 20 steps.
Let me erase all of this junk.
And let's run the optimization.
And you see how we've actually converged slower
in a more controlled manner
and got to a loss that is very low.
So I expect YPred to be quite good.
There we go.
And that's it.
Okay, so this is kind of embarrassing,
but we actually have a really terrible bug in here.
And it's a subtle bug
and it's a very common bug.
And I can't believe I've done it
for the 20th time in my life,
especially on camera.
And I could have reshot the whole thing,
but I think it's pretty funny.
And you get to appreciate a bit
what working with neural nets
maybe is like sometimes.
We are guilty of
a common bug.
I've actually tweeted
the most common neural net mistakes
a long time ago now.
And I'm not really
going to explain any of these,
but remember we are guilty of number three.
You forgot to zero grad
before dot backward.
What is that?
Basically what's happening,
and it's a subtle bug
and I'm not sure if you saw it,
is that all of these weights here
have a dot data and a dot grad.
And dot grad starts at zero.
And then we do backward
and we fill in the gradients.
And then we do an update on the data,
but we don't flush the grad.
It stays there.
So when we do the second forward pass
and we do backward again,
remember that all the backward operations
do a plus equals on the grad.
And so these gradients just add up
and they never get reset to zero.
So basically we didn't zero grad.
So here's how we zero grad
before backward.
We need to iterate over all the parameters.
And we need to make sure that
p dot grad is set to zero.
We need to reset it to zero.
Just like it is in the constructor.
So remember all the way here
for all these value nodes,
grad is reset to zero.
And then all these backward passes
do a plus equals from that grad.
But we need to make sure that
we reset these grads to zero
so that when we do backward,
all of them start at zero
and the actual backward pass
accumulates the loss derivatives
into the grads.
So this is zero grad in PyTorch.
And
we will get a slightly different optimization.
Let's reset the neural net.
The data is the same.
This is now, I think, correct.
And we get a much more
slower descent.
We still end up with pretty good results.
And we can continue this a bit more
to get down lower
and lower
and lower.
Yeah.
So the only reason that the previous thing worked,
it's extremely buggy.
The only reason that worked
is that
this is a very, very simple problem.
And it's very easy for this neural net
to fit this data.
And so the grads ended up accumulating
and it effectively gave us
a massive step size.
And it made us converge extremely fast.
But basically now we have to do more steps
to get to very low values of loss
and get YPRED to be really good.
We can try to
step a bit greater.
Yeah.
We're going to get closer and closer
to one minus one and one.
So
working with neural nets is sometimes
tricky because
you may have lots of bugs in the code
and
your network might actually work
just like ours worked.
But chances are is that
if we had a more complex problem
then actually this bug would have
made us not optimize the loss very well.
And we were only able to get away with it
because
the problem is very simple.
So let's now bring everything together
and summarize what we learned.
What are neural nets?
Neural nets are these mathematical expressions.
Fairly simple mathematical expressions
in the case of multi-layer perceptron
that take input as the data
and they take input the weights
and the parameters of the neural net.
Mathematical expression for the forward pass
followed by a loss function.
And the loss function tries to measure
the accuracy of the predictions.
And usually the loss will be low
when your predictions are matching your targets
or where the network is basically behaving well.
So we manipulate the loss function
so that when the loss is low
the network is doing what you want it to do
on your problem.
And then we backward the loss.
Use back propagation to get the gradient
and then we know how to tune all the parameters
to decrease the loss locally.
But then we have to iterate that process
many times in what's called the gradient descent.
So we simply follow the gradient information
and that minimizes the loss
and the loss is arranged so that
when the loss is minimized
the network is doing what you want it to do.
And yeah, so we just have a blob of neural stuff
and we can make it do arbitrary things.
And that's what gives neural nets their power.
It's, you know, this is a very tiny network
with 41 parameters.
But you can build significantly more complicated
neural nets with billions
at this point almost trillions of parameters.
And it's a massive blob of neural tissue
simulated neural tissue
roughly speaking.
And you can make it do extremely complex problems.
And these neural nets then
have all kinds of very fascinating emergent properties
in when you try to make them do
significantly hard problems.
As in the case of GPT for example
we have massive amounts of text from the internet
and we're trying to get a neural net to predict
to take like a few words
and try to predict the next word in a sequence.
That's the learning problem.
And it turns out that when you train this
on all of internet
the neural net actually has like really remarkable
emergent properties.
But that neural net would have
hundreds of billions of parameters.
But it works on fundamentally the exact same principles.
The neural net of course will be a bit more complex.
But otherwise the evaluating the gradient
is there and will be identical.
And the gradient descent would be there
and basically identical.
But people usually use slightly different updates.
This is a very simple stochastic gradient descent update.
And the loss function would not be a mean squared error.
They would be using something called the cross entropy loss
for predicting the next token.
So there's a few more details
but fundamentally the neural network setup
and neural network training
is identical and pervasive.
And now you understand intuitively
how that works under the hood.
In the beginning of this video
I told you that by the end of it
you would understand everything in MicroGrad
and then we'd slowly build it up.
Let me briefly prove that to you.
So I'm going to step through all the code
that is in MicroGrad as of today.
Actually potentially some of the code will change
by the time you watch this video
because I intend to continue developing MicroGrad.
But let's look at what we have so far at least.
Init.py is empty.
When you go to engine.py that has the value.
Everything here you should mostly recognize.
So we have the data.data.grad attributes.
We have the backward function.
We have the previous set of children
and the operation that produced this value.
We have addition, multiplication
and raising to a scalar power.
We have the ReLU non-linearity
which is a slightly different type of non-linearity
than tanh that we used in this video.
Both of them are non-linearities
and notably tanh is not actually present
in MicroGrad as of right now
but I intend to add it later.
We have the backward which is identical
and then all of these other operations
which are built up on top of operations here.
So values should be very recognizable
except for the non-linearity used in this video.
There's no massive difference between ReLU and tanh
and sigmoid and these other non-linearities.
They're all roughly equivalent
and can be used in MLPs.
So I use tanh because it's a bit smoother
and because it's a little bit more complicated than ReLU
and therefore it's stressed a little bit more
the local gradients
and working with those derivatives
which I thought would be useful.
Init.py is the neural networks library
as I mentioned.
So you should recognize identical implementation
of neuron, layer and MLP.
Notably, or not so much
we have a class module here
that is a parent class of all these modules.
I did that because there's an nn.module class
in PyTorch
and so this exactly matches that API
and nn.module in PyTorch has also a 0 grad
which I refactored out here.
So that's the end of MicroGrad really.
Then there's a test
which you'll see basically creates
two chunks of code
one in MicroGrad and one in PyTorch
and we'll make sure that the forward
and the backward pass agree identically.
For a slightly less complicated expression
and slightly more complicated expression
everything agrees
so we agree with PyTorch on all of these operations.
And finally there's a demo.pypyymb
here and it's a bit more
complicated binary classification demo
than the one I covered in this lecture.
So we only had a tiny data set of four examples.
Here we have a bit more
complicated example with lots of
blue points and lots of red points
and we're trying to again build a binary classifier
to distinguish two-dimensional
points as red or blue.
It's a bit more complicated MLP here
with it's a bigger MLP.
The loss is a bit more complicated
because it supports batches
so because our data set
was so tiny we always did a forward pass
on the entire data set of four examples.
But when your data set is like a million
examples what we usually do in practice
is we basically
pick out some random subset, we call that a batch
and then we only process the batch
forward, backward and update.
So we don't have to forward the entire training set.
So this is
something that supports batching
because there's a lot more examples here.
We do a forward pass.
The loss is slightly more different.
This is a max margin loss that I implement here.
The one that we used was
the mean squared error loss
because it's the simplest one.
There's also the binary cross entropy loss.
All of them can be used for binary classification
and don't make too much of a difference
in the simple examples that we looked at so far.
There's something called L2 regularization
used here.
This has to do with generalization of the neural net
that controls the overfitting in machine learning setting
but I did not cover these concepts
in this video, potentially later.
And the training loop you should recognize.
So forward, backward,
with, zero grad
and update and so on.
You'll notice that in the update here
the learning rate is scaled as a function of
number of iterations and it
shrinks.
And this is something called learning rate decay.
So in the beginning you have a high learning rate
and as the network sort of stabilizes near the end
you bring down the learning rate
to get to some of the fine details in the end.
And in the end we see
the decision surface of the neural net
and we see that it learned to separate out the red
and the blue area based on
the data points.
So that's the slightly more complicated example
in the demo.hypiYMB
that you're free to go over.
But yeah, as of today, that is MicroGrad.
I also wanted to show you a little bit of real stuff
so that you get to see how this is actually implemented
in a production grade library like PyTorch.
So in particular I wanted to show
I wanted to find and show you
the backward pass for 10h in PyTorch.
So here in MicroGrad
we see that the backward pass for 10h
is 1 minus t squared
where t is the output of the 10h
of x
times of that grad
which is the chain rule.
So we're looking for something that looks like this.
Now, I went to PyTorch
which has
an open source GitHub codebase
and I looked through a lot of its code
and honestly
I spent about 15 minutes
and I couldn't find 10h.
And that's because these libraries, unfortunately
they grow in size and entropy.
And if you just search for 10h
you get apparently 2,800 results
and 406 files.
So I don't know what these files
are doing, honestly.
And why there are
so many mentions of 10h.
But unfortunately these libraries are quite complex
they're meant to be used, not really inspected.
Eventually I did
stumble on someone
who tries to change
the 10h backward code for some reason
and someone here pointed to the
CPU kernel and the CUDA kernel for
10h backward.
So basically it depends on if you're using
PyTorch on a CPU device or on a GPU
which these are different devices
and I haven't covered this.
But this is the 10h backward kernel
for CPU
and the reason it's so large
is that
number one, this is like if you're using a complex type
which we haven't even talked about
you're using a specific data type of bfloat16
which we haven't talked about
and then if you're not
then this is the kernel
and deep here we see something that resembles
our backward pass.
So they have a times one minus
b square
so this b here
must be the output of the 10h
and this is the out.grad
so here we found it
deep inside
PyTorch on this location
for some reason inside binary ops kernel
10h is not actually binary op
and then this is the
GPU kernel
we're not complex
we're here
and here we go with one line of code
so we did find it
but basically unfortunately
these code bases are very large
and micrograd is very very simple
but if you actually want to use real stuff
finding the code for it
you'll actually find that difficult
I also wanted to show you
a little example here where PyTorch is showing you
you can register a new type of function
that you want to add to PyTorch
as a lego building block
so here if you want to for example add
a gender polynomial 3
here's how you could do it
you will register it
as a class that
subclass says torch.rgrad.function
and then you have to tell PyTorch how to forward
your new function
and how to backward through it
so as long as you can do the forward pass
of this little function piece that you want to add
and as long as you know the local
derivative, the local gradients
which are implemented in the backward
PyTorch will be able to back propagate through your function
and then you can use this as a lego block
in a larger lego castle
of all the different lego blocks that PyTorch already has
and so that's the only thing
you have to tell PyTorch and everything will just work
and you can register new types of functions
in this way following this example
and that is everything that I wanted to cover
in this lecture
so I hope you enjoyed building out micrograd with me
I hope you find it interesting, insightful
and yeah
I will post a lot of the links
that are related to this video
in the video description below
I will also probably post a link to a discussion forum
or discussion group where you can ask
questions related to this video
and then I can answer or someone else can answer
your questions
and I may also do a follow up video
that answers some of the most common questions
but for now that's it
I hope you enjoyed it
if you did then please like and subscribe
so that YouTube knows to feature this video to more people
and that's it for now, I'll see you later
bye
I know what happened there