WEBVTT 00:00.240 --> 00:06.400 hi everyone hope you're well and next up what i'd like to do is i'd like to build out make more like 00:06.400 --> 00:12.960 micrograd before it make more is a repository that i have on my github webpage you can look at it but 00:12.960 --> 00:17.680 just like with micrograd i'm going to build it out step by step and i'm going to spell everything out 00:17.680 --> 00:23.520 so we're going to build it out slowly and together now what is make more make more as the name 00:23.520 --> 00:31.040 suggests makes more of things that you give it so here's an example names.txt is an example data set 00:31.040 --> 00:38.400 to make more and when you look at names.txt you'll find that it's a very large data set of names so 00:40.160 --> 00:44.880 here's lots of different types of names in fact i believe there are 32 000 names that i've sort 00:44.880 --> 00:50.720 of found randomly on the government website and if you train make more on this data set 00:50.720 --> 00:53.360 it will learn to make more of things like 00:53.520 --> 01:00.640 this and in particular in this case that will mean more things that sound name-like but are 01:00.640 --> 01:05.200 actually unique names and maybe if you have a baby and you're trying to assign a name maybe 01:05.200 --> 01:10.080 you're looking for a cool new sounding unique name make more might help you so here are some 01:10.080 --> 01:17.200 example generations from the neural network once we train it on our data set so here's some example 01:17.760 --> 01:22.240 unique names that it will generate don't tell i wrote 01:23.520 --> 01:29.200 zendy and so on and so all these sort of sound name-like but they're not of course names 01:30.640 --> 01:34.720 so under the hood make more is a character level language model 01:34.720 --> 01:40.320 so what that means is that it is treating every single line here as an example and within each 01:40.320 --> 01:48.880 example it's treating them all as sequences of individual characters so r e e s e is this example 01:48.880 --> 01:53.200 and that's the sequence of characters and that's the level on which we are building out make more 01:53.840 --> 01:57.520 and what it means to be a character level language model then is that it's just 01:58.160 --> 02:01.920 sort of modeling those sequences of characters and it knows how to predict the next character 02:01.920 --> 02:07.120 in the sequence now we're actually going to implement a large number of character level 02:07.120 --> 02:11.200 language models in terms of the neural networks that are involved in predicting the next character 02:11.200 --> 02:17.120 in a sequence so very simple bigram and bag of root models multilevel perceptrons recurring 02:17.120 --> 02:23.200 neural networks all the way to modern transformers in fact the transformer that we will build will be 02:24.480 --> 02:30.000 basically the equivalent transformer to gpt2 if you have heard of gpt so that's kind of a big 02:30.000 --> 02:34.800 deal it's a modern network and by the end of this series you will actually understand how that works 02:35.440 --> 02:41.440 on the level of characters now to give you a sense of the extensions here after characters 02:41.440 --> 02:45.200 we will probably spend some time on the word level so that we can generate documents of 02:45.200 --> 02:50.880 words not just little you know segments of characters but we can generate entire large much 02:50.880 --> 02:52.000 larger documents 02:52.000 --> 02:58.720 go into images and image text networks such as DALI stable diffusion and so on but for now we 02:58.720 --> 03:04.560 have to start here character level language modeling let's go so like before we are starting 03:04.560 --> 03:09.280 with a completely blank Jupyter notebook page the first thing is i would like to basically load up 03:09.280 --> 03:16.880 the data set names.txt so we're going to open up names.txt for reading and we're going to read in 03:16.880 --> 03:22.640 everything into a massive string and then because it's a massive string we only like the individual 03:22.640 --> 03:29.280 words and put them in the list so let's call split lines on that string to get all of our words as a 03:29.280 --> 03:37.040 python list of strings so basically we can look at for example the first 10 words and we have that 03:37.040 --> 03:45.600 it's a list of emma olivia ava and so on and if we look at the top of the page here that is indeed 03:45.600 --> 03:46.160 what we see 03:47.040 --> 03:53.920 um so that's good this list actually makes me feel that this is probably sorted by frequency 03:55.600 --> 04:01.040 but okay so these are the words now we'd like to actually like learn a little bit more about this 04:01.040 --> 04:06.880 data set let's look at the total number of words we expect this to be roughly 32 000 and then what 04:06.880 --> 04:15.440 is the for example shortest word so min of length of each word for w in words so the shortest word 04:15.440 --> 04:16.400 will be length 04:17.040 --> 04:24.000 two and max of one w for w in words so the longest word will be 15 characters 04:24.560 --> 04:29.040 so let's now think through our very first language model as i mentioned a character level language 04:29.040 --> 04:34.640 model is predicting the next character in a sequence given already some concrete sequence 04:34.640 --> 04:39.440 of characters before it now what we have to realize here is that every single word here 04:39.440 --> 04:46.560 like isabella is actually quite a few examples packed in to that single word because what is an 04:46.880 --> 04:52.000 instance of a word like isabella in the data set telling us really it's saying that the character 04:52.000 --> 05:00.800 i is a very likely character to come first in the sequence of a name the character s is likely to 05:00.800 --> 05:09.600 come after i the character a is likely to come after is the character b is very likely to come 05:09.600 --> 05:16.160 after isa and so on all the way to a following as a bell and then there's one more example actually 05:16.160 --> 05:16.800 packed in here 05:17.280 --> 05:25.040 and that is that after there's isabella the word is very likely to end so that's one more sort of 05:25.040 --> 05:30.720 explicit piece of information that we have here that we have to be careful with and so there's 05:30.720 --> 05:35.040 a lot packed into a single individual word in terms of the statistical structure of what's 05:35.040 --> 05:39.600 likely to follow in these character sequences and then of course we don't have just an individual 05:39.600 --> 05:43.840 word we actually have 32 000 of these and so there's a lot of structure here to model 05:44.800 --> 05:46.560 now in the beginning what i'd like to start with 05:46.880 --> 05:49.920 is I'd like to start with building a bigram language model. 05:51.060 --> 05:52.660 Now, in a bigram language model, 05:52.860 --> 05:56.000 we're always working with just two characters at a time. 05:56.560 --> 06:00.020 So we're only looking at one character that we are given, 06:00.420 --> 06:02.960 and we're trying to predict the next character in the sequence. 06:03.840 --> 06:07.000 So what characters are likely to follow R, 06:07.360 --> 06:09.700 what characters are likely to follow A, and so on. 06:09.740 --> 06:12.300 And we're just modeling that kind of a little local structure. 06:12.860 --> 06:16.520 And we're forgetting the fact that we may have a lot more information 06:16.520 --> 06:19.840 if we're always just looking at the previous character to predict the next one. 06:20.120 --> 06:21.980 So it's a very simple and weak language model, 06:22.200 --> 06:23.480 but I think it's a great place to start. 06:24.040 --> 06:27.040 So now let's begin by looking at these bigrams in our data set 06:27.040 --> 06:27.880 and what they look like. 06:27.980 --> 06:30.340 And these bigrams, again, are just two characters in a row. 06:30.960 --> 06:35.500 So for W in words, each W here is an individual word, a string. 06:36.100 --> 06:43.060 We want to iterate this word with consecutive characters. 06:43.700 --> 06:46.300 So two characters at a time, sliding it through the word. 06:46.520 --> 06:50.880 Now, an interesting, nice way, cute way to do this in Python, by the way, 06:51.080 --> 06:52.520 is doing something like this. 06:52.900 --> 06:58.140 For character1, character2, in, zip, off, W, and W at 1. 06:59.860 --> 07:00.560 One column. 07:01.720 --> 07:03.960 Print, character1, character2. 07:04.620 --> 07:05.740 And let's not do all the words. 07:05.840 --> 07:07.180 Let's just do the first three words. 07:07.380 --> 07:09.380 And I'm going to show you in a second how this works. 07:09.980 --> 07:13.960 But for now, basically, as an example, let's just do the very first word alone, MR. 07:13.960 --> 07:20.220 You see how we have a M up, and this will just print EM, MM, MA. 07:20.740 --> 07:24.980 And the reason this works is because W is the string M up, 07:25.440 --> 07:27.720 W at 1 column is the string MMA, 07:28.500 --> 07:33.080 and zip takes two iterators, and it pairs them up 07:33.080 --> 07:36.760 and then creates an iterator over the tuples of their consecutive entries. 07:37.400 --> 07:40.120 And if any one of these lists is shorter than the other, 07:40.120 --> 07:42.860 then it will just halt and return. 07:42.860 --> 07:49.340 So basically, that's why we return EM, MM, MM, MA. 07:50.000 --> 07:53.680 But then, because this iterator's second one here runs out of elements, 07:54.160 --> 07:57.200 zip just ends, and that's why we only get these tuples. 07:57.780 --> 07:58.440 So pretty cute. 07:59.520 --> 08:02.600 So these are the consecutive elements in the first word. 08:03.080 --> 08:05.600 Now, we have to be careful because we actually have more information here 08:05.600 --> 08:07.760 than just these three examples. 08:07.760 --> 08:12.120 As I mentioned, we know that E is very likely to come first, 08:12.860 --> 08:15.080 but that A, in this case, is coming last. 08:16.000 --> 08:18.080 So one way to do this is, basically, 08:18.080 --> 08:22.640 we're going to create a special array here, all characters, 08:23.320 --> 08:27.240 and we're going to hallucinate a special start token here. 08:28.760 --> 08:31.980 I'm going to call it like, special start. 08:32.780 --> 08:37.440 This is a list of one element plus W, 08:38.060 --> 08:40.520 and then plus a special end character. 08:40.520 --> 08:45.300 And the reason I'm wrapping the list of w here is because w is a string, Emma. 08:45.780 --> 08:49.780 List of w will just have the individual characters in the list. 08:50.560 --> 08:56.840 And then doing this again now, but not iterating over w's, but over the characters, 08:57.540 --> 08:59.240 will give us something like this. 09:00.180 --> 09:04.440 So e is likely, so this is a bigram of the start character and e, 09:04.640 --> 09:08.340 and this is a bigram of the a and the special end character. 09:08.340 --> 09:13.160 And now we can look at, for example, what this looks like for Olivia or Ava. 09:14.420 --> 09:17.780 And indeed, we can actually potentially do this for the entire dataset, 09:18.140 --> 09:19.160 but we won't print that. 09:19.220 --> 09:20.020 That's going to be too much. 09:20.800 --> 09:24.120 But these are the individual character bigrams, and we can print them. 09:25.000 --> 09:29.440 Now, in order to learn the statistics about which characters are likely to follow other characters, 09:29.740 --> 09:33.800 the simplest way in the bigram language models is to simply do it by counting. 09:34.220 --> 09:38.320 So we're basically just going to count how often any one of these combinations 09:38.440 --> 09:41.240 occurs in the training set in these words. 09:41.700 --> 09:45.320 So we're going to need some kind of a dictionary that's going to maintain some counts 09:45.320 --> 09:46.940 for every one of these bigrams. 09:46.940 --> 09:51.940 So let's use a dictionary b, and this will map these bigrams. 09:52.860 --> 09:55.060 So bigram is a tuple of character1, character2. 09:55.820 --> 10:03.700 And then b at bigram will be b.get of bigram, which is basically the same as b at bigram. 10:04.520 --> 10:08.280 But in the case that bigram is not in the dictionary b, 10:08.320 --> 10:12.360 we would like to, by default, return a 0, plus 1. 10:12.920 --> 10:17.560 So this will basically add up all the bigrams and count how often they occur. 10:18.140 --> 10:19.220 Let's get rid of printing. 10:20.000 --> 10:25.960 Or rather, let's keep the printing, and let's just inspect what b is in this case. 10:26.900 --> 10:29.940 And we see that many bigrams occur just a single time. 10:30.220 --> 10:32.300 This one allegedly occurred three times. 10:33.160 --> 10:37.300 So a was an ending character three times, and that's true for all of these words. 10:37.300 --> 10:40.660 All of Emma, Olivia, and Ava end with a. 10:41.760 --> 10:44.060 So that's why this occurred three times. 10:46.340 --> 10:48.540 Now let's do it for all the words. 10:51.040 --> 10:53.200 Oops, I should not have printed. 10:54.820 --> 10:56.080 I meant to erase that. 10:56.740 --> 10:57.800 Let's kill this. 10:58.720 --> 10:59.960 Let's just run. 11:00.640 --> 11:03.120 And now b will have the statistics of the entire dataset. 11:03.860 --> 11:07.120 So these are the counts across all the words of the individual bigrams. 11:07.300 --> 11:11.940 And we could, for example, look at some of the most common ones and least common ones. 11:13.240 --> 11:16.960 This kind of grows in Python, but the way to do this, the simplest way I like, 11:17.220 --> 11:18.880 is we just use b.items. 11:19.540 --> 11:25.020 b.items returns the tuples of key value. 11:25.320 --> 11:30.020 And in this case, the keys are the character bigrams, and the values are the counts. 11:30.660 --> 11:36.820 And so then what we want to do is we want to do sorted of this. 11:38.240 --> 11:45.280 But by default, sort is on the first item of a tuple. 11:45.580 --> 11:49.840 But we want to sort by the values, which are the second element of a tuple, that is the key value. 11:50.460 --> 11:56.500 So we want to use the key equals lambda that takes the key value 11:56.500 --> 12:03.620 and returns the key value at 1, not at 0, but at 1, which is the count. 12:03.620 --> 12:05.960 So we want to sort by the count. 12:07.300 --> 12:08.500 Well, these elements. 12:10.200 --> 12:11.960 And actually, we want it to go backwards. 12:12.800 --> 12:17.600 So here what we have is the bigram QNR occurs only a single time. 12:18.600 --> 12:20.180 DZ occurred only a single time. 12:20.620 --> 12:25.900 And when we sort this the other way around, we're going to see the most likely bigrams. 12:26.240 --> 12:31.420 So we see that N was very often an ending character, many, many times. 12:31.420 --> 12:36.380 And apparently, N almost always follows an A, and that's a very likely combination as well. 12:37.300 --> 12:42.680 So this is kind of the individual counts that we achieve over the entirely. 12:42.840 --> 12:49.040 Now it's actually going to be significantly more convenient for us to keep this information in one 12:49.060 --> 12:50.180 two-dimensional array 12:52.720 --> 12:59.340 So we're going to sort this information in two D array and the rose are going to be the 12:59.340 --> 13:04.000 first character of the Bank and the columns are going to be the second character, 13:04.000 --> 13:06.600 and each entry in this two-dimensional array will tell us. 13:06.600 --> 13:07.260 Um, 13:07.260 --> 13:13.420 us how often that first character follows the second character in the data set. So in particular 13:13.420 --> 13:19.540 the array representation that we're going to use or the library is that of PyTorch and PyTorch is 13:19.540 --> 13:25.900 a deep learning neural network framework but part of it is also this torch.tensor which allows us 13:25.900 --> 13:31.940 to create multi-dimensional arrays and manipulate them very efficiently. So let's import PyTorch 13:31.940 --> 13:39.720 which you can do by import torch and then we can create arrays. So let's create an array of zeros 13:39.720 --> 13:50.060 and we give it a size of this array. Let's create a 3x5 array as an example and this is a 3x5 array 13:50.060 --> 13:57.000 of zeros and by default you'll notice a.d type which is short for data type is float 32. So these 13:57.000 --> 14:01.440 are single precision floating point numbers. Because we are going to represent counts 14:01.440 --> 14:01.920 we're going to use a single precision floating point number. So we're going to use a single 14:01.920 --> 14:08.660 precision floating point number. Let's actually use d type as torch.in32. So these are 32-bit 14:08.660 --> 14:16.300 integers. So now you see that we have integer data inside this tensor. Now tensors allow us to really 14:16.300 --> 14:22.100 manipulate all the individual entries and do it very efficiently. So for example if we want to 14:22.100 --> 14:29.240 change this bit we have to index into the tensor and in particular here this is the first row 14:29.380 --> 14:31.900 and the because it's 14:31.920 --> 14:40.880 zero indexed. So this is row index one and column index zero one two three. So a at one comma three 14:40.880 --> 14:48.780 we can set that to one and then a will have a one over there. We can of course also do things like 14:48.780 --> 14:56.480 this. So now a will be two over there or three and also we can for example say a zero zero is five 14:56.960 --> 15:01.900 and then a will have a five over here. So that's how we can index into. 15:01.920 --> 15:06.840 the arrays. Now of course the array that we are interested in is much much bigger. So for our 15:06.840 --> 15:15.200 purposes we have 26 letters of the alphabet and then we have two special characters s and e. So we 15:15.200 --> 15:22.080 want 26 plus 2 or 28 by 28 array and let's call it the capital N because it's going to represent 15:22.080 --> 15:29.880 sort of the counts. Let me erase this stuff. So that's the array that starts at zeros 28 by 28 and 15:29.880 --> 15:31.880 now let's copy paste that into the array. So that's the array that starts at zeros 28 by 28 and now let's copy paste the 15:31.880 --> 15:41.280 this here. But instead of having a dictionary b which we're going to erase we now have an n. Now 15:41.280 --> 15:46.240 the problem here is that we have these characters which are strings but we have to now basically 15:46.240 --> 15:52.680 index into a array and we have to index using integers. So we need some kind of a lookup table 15:52.680 --> 15:58.780 from characters to integers. So let's construct such a character array and the way we're going 15:58.780 --> 16:01.860 to do this is we're going to take all the words which is a list of strings and we're going to 16:01.880 --> 16:07.680 concatenate all of it into a massive string. So this is just simply the entire data set as a single 16:07.680 --> 16:13.840 string. We're going to pass this to the set constructor which takes this massive string 16:14.400 --> 16:20.480 and throws out duplicates because sets do not allow duplicates. So set of this will just be 16:20.480 --> 16:26.160 the set of all the lowercase characters and there should be a total of 26 of them. 16:28.560 --> 16:30.640 And now we actually don't want a set we want a list. 16:31.880 --> 16:36.600 But we don't want a list sorted in some weird arbitrary way we want it to be sorted 16:37.560 --> 16:43.000 from a to z. So sorted list. So those are our characters. 16:45.560 --> 16:51.080 Now what we want is this lookup table as I mentioned. So let's create a special s to i 16:51.080 --> 17:01.560 I will call it. s is string or character and this will be an s to i mapping for is in enumerate 17:01.880 --> 17:09.960 of these characters. So enumerate basically gives us this iterator over the integer index and the 17:09.960 --> 17:17.000 actual element of the list and then we are mapping the character to the integer. So s to i is a 17:17.000 --> 17:25.640 mapping from a to 0 b to 1 etc all the way from z to 25. And that's going to be useful here but we 17:25.640 --> 17:31.240 actually also have to specifically set that s will be 26 and s to i at e. 17:32.040 --> 17:39.320 Will be 27 right because z was 25. So those are the lookups and now we can come here and we can map 17:39.880 --> 17:44.600 both character 1 and character 2 to their integers. So this will be s to i at character 1 17:45.240 --> 17:53.080 and i x 2 will be s to i of character 2. And now we should be able to do this line 17:53.080 --> 18:01.560 but using our array. So n at i x 1 i x 2 this is the two-dimensional array indexing I've shown you before and honestly just plus equals 1. 18:02.840 --> 18:12.120 Because everything starts at 0. So this should work and give us a large 28 by 28 array 18:12.920 --> 18:20.760 of all these counts. So if we print n this is the array but of course it looks ugly. So let's erase 18:20.760 --> 18:26.280 this ugly mess and let's try to visualize it a bit more nicer. So for that we're going to use 18:26.280 --> 18:31.160 a library called matplotlib. So matplotlib allows us to create figures. So we can do things like this. 18:31.880 --> 18:40.920 We can do things like plti and show of the count array. So this is the 28 by 28 array and this is the structure. 18:40.920 --> 18:46.040 But even this I would say is still pretty ugly. So we're going to try to create a much nicer 18:46.040 --> 18:51.160 visualization of it and I wrote a bunch of code for that. The first thing we're going to need is 18:51.880 --> 19:00.360 we're going to need to invert this array here, this dictionary. So s to i is a mapping from s to i and in i to s we're going to reverse the array. 19:01.880 --> 19:08.440 So iterating over all the items and just reverse that array. So i to s maps inversely from 0 to a, 19:08.440 --> 19:15.000 1 to b, etc. So we'll need that. And then here's the code that I came up with to try to make this a little bit nicer. 19:17.080 --> 19:23.640 We create a figure, we plot n and then we visualize a bunch of things later. 19:23.640 --> 19:26.200 Let me just run it so you get a sense of what this is. 19:29.880 --> 19:30.840 So we're going to do this. 19:31.880 --> 19:34.200 Okay, so you see here that we have 19:35.240 --> 19:41.640 the array spaced out and every one of these is basically like b follows g 0 times. 19:42.280 --> 19:49.880 b follows h 41 times. So a follows j 175 times. What you can see that I'm doing here is 19:49.880 --> 19:55.640 first I show that entire array and then I iterate over all the individual little cells here 19:56.680 --> 20:01.640 and I create a character string here which is the inverse mapping, i to s, 20:01.880 --> 20:04.740 of the integer i and the integer j. 20:04.740 --> 20:07.800 So those are the bigrams in a character representation. 20:08.660 --> 20:12.200 And then I plot just the bigram text. 20:12.200 --> 20:14.220 And then I plot the number of times 20:14.220 --> 20:16.160 that this bigram occurs. 20:16.160 --> 20:18.440 Now, the reason that there's a dot item here 20:18.440 --> 20:21.080 is because when you index into these arrays, 20:21.080 --> 20:23.100 these are torch tensors, 20:23.100 --> 20:26.080 you see that we still get a tensor back. 20:26.080 --> 20:27.740 So the type of this thing, 20:27.740 --> 20:29.780 you'd think it would be just an integer, 149, 20:29.780 --> 20:32.040 but it's actually a torch dot tensor. 20:32.040 --> 20:34.460 And so if you do dot item, 20:34.460 --> 20:37.320 then it will pop out that individual integer. 20:38.540 --> 20:40.740 So it'll just be 149. 20:40.740 --> 20:42.480 So that's what's happening there. 20:42.480 --> 20:45.380 And these are just some options to make it look nice. 20:45.380 --> 20:47.280 So what is the structure of this array? 20:49.340 --> 20:50.180 We have all these counts 20:50.180 --> 20:51.980 and we see that some of them occur often 20:51.980 --> 20:54.080 and some of them do not occur often. 20:54.080 --> 20:56.080 Now, if you scrutinize this carefully, 20:56.080 --> 20:58.740 you will notice that we're not actually being very clever. 20:58.740 --> 20:59.780 That's because when you come over here 20:59.780 --> 21:01.700 you'll notice that, for example, 21:01.700 --> 21:04.720 we have an entire row of completely zeros. 21:04.720 --> 21:07.100 And that's because the end character 21:07.100 --> 21:09.120 is never possibly going to be the first character 21:09.120 --> 21:09.960 of a bigram, 21:09.960 --> 21:11.980 because we're always placing these end tokens 21:11.980 --> 21:14.380 all at the end of the bigram. 21:14.380 --> 21:17.480 Similarly, we have entire columns of zeros here 21:17.480 --> 21:20.200 because the S character 21:20.200 --> 21:23.420 will never possibly be the second element of a bigram 21:23.420 --> 21:25.800 because we always start with S and we end with E 21:25.800 --> 21:27.780 and we only have the words in between. 21:27.780 --> 21:29.440 So we have an entire column of zeros, 21:29.440 --> 21:31.800 an entire row of zeros, 21:31.800 --> 21:34.120 and in this little two by two matrix here as well, 21:34.120 --> 21:36.060 the only one that can possibly happen 21:36.060 --> 21:38.620 is if S directly follows E. 21:38.620 --> 21:43.140 That can be non-zero if we have a word that has no letters. 21:43.140 --> 21:44.720 So in that case, there's no letters in the word, 21:44.720 --> 21:47.640 it's an empty word, and we just have S follows E. 21:47.640 --> 21:50.220 But the other ones are just not possible. 21:50.220 --> 21:51.760 And so we're basically wasting space. 21:51.760 --> 21:52.600 And not only that, 21:52.600 --> 21:55.680 but the S and the E are getting very crowded here. 21:55.680 --> 21:56.920 I was using these brackets 21:56.920 --> 21:59.320 because there's convention in natural language processing, 21:59.320 --> 22:03.340 to use these kinds of brackets to denote special tokens. 22:03.340 --> 22:05.280 But we're going to use something else. 22:05.280 --> 22:08.340 So let's fix all this and make it prettier. 22:08.340 --> 22:10.420 We're not actually going to have two special tokens. 22:10.420 --> 22:13.040 We're only going to have one special token. 22:13.040 --> 22:17.840 So we're going to have n by n array of 27 by set 27 instead. 22:18.880 --> 22:21.660 Instead of having two, we will just have one, 22:21.660 --> 22:23.180 and I will call it a dot. 22:24.880 --> 22:25.720 Okay. 22:27.420 --> 22:28.960 Let me swing this over here. 22:29.320 --> 22:31.980 Now, one more thing that I would like to do 22:31.980 --> 22:34.480 is I would actually like to make this special character 22:34.480 --> 22:36.340 have position zero. 22:36.340 --> 22:39.040 And I would like to offset all the other letters off. 22:39.040 --> 22:41.280 I find that a little bit more pleasing. 22:42.620 --> 22:47.220 So we need a plus one here so that the first character, 22:47.220 --> 22:49.920 which is A, will start at one. 22:49.920 --> 22:54.920 So S to I will now be A starts at one and dot is zero. 22:55.920 --> 22:58.960 And I to S, of course, we're not changing this, 22:58.960 --> 23:01.020 because I to S just creates a reverse mapping 23:01.020 --> 23:02.280 and this will work fine. 23:02.280 --> 23:05.240 So one is A, two is B, zero is dot. 23:06.680 --> 23:09.160 So we've reversed that here. 23:09.160 --> 23:11.520 We have a dot and a dot. 23:13.040 --> 23:14.880 This should work fine. 23:14.880 --> 23:16.220 Make sure I start at zeros. 23:17.900 --> 23:18.860 Count. 23:18.860 --> 23:21.700 And then here, we don't go up to 28, we go up to 27. 23:22.660 --> 23:24.820 And this should just work. 23:28.960 --> 23:33.580 Okay, so we see that dot dot never happened. 23:33.580 --> 23:36.520 It's at zero because we don't have empty words. 23:36.520 --> 23:39.480 Then this row here now is just very simply 23:39.480 --> 23:43.560 the counts for all the first letters. 23:43.560 --> 23:48.560 So J starts a word, H starts a word, I starts a word, etc. 23:49.620 --> 23:53.020 And then these are all the ending characters. 23:53.020 --> 23:54.580 And in between, we have the structure 23:54.580 --> 23:57.120 of what characters follow each other. 23:57.120 --> 23:58.820 So this is the counts array. 23:58.820 --> 24:01.740 This is the counts array of our entire data set. 24:01.740 --> 24:04.460 So this array actually has all the information necessary 24:04.460 --> 24:06.040 for us to actually sample 24:06.040 --> 24:09.720 from this bigram character-level language model. 24:09.720 --> 24:12.200 And roughly speaking, what we're going to do 24:12.200 --> 24:14.680 is we're just going to start following these probabilities 24:14.680 --> 24:16.860 and these counts, and we're going to start sampling 24:16.860 --> 24:18.900 from the model. 24:18.900 --> 24:21.860 So in the beginning, of course, we start with the dot, 24:21.860 --> 24:24.640 the start token dot. 24:24.640 --> 24:28.180 So to sample the first character of a name, 24:28.180 --> 24:28.380 we're looking at this right here. 24:28.380 --> 24:28.640 So we're looking at this right here. 24:28.640 --> 24:30.600 So we're looking at this right here. 24:30.600 --> 24:32.740 So we see that we have the counts, 24:32.740 --> 24:34.680 and those counts externally are telling us 24:34.680 --> 24:39.580 how often any one of these characters is to start a word. 24:39.580 --> 24:43.980 So if we take this N and we grab the first row, 24:44.880 --> 24:48.460 we can do that by using just indexing a zero, 24:48.460 --> 24:51.080 and then using this notation, colon, 24:51.080 --> 24:53.700 for the rest of that row. 24:53.700 --> 24:58.200 So N zero colon is indexing into the zero, 24:58.200 --> 25:01.960 and then it's grabbing all the columns. 25:01.960 --> 25:05.240 And so this will give us a one-dimensional array 25:05.240 --> 25:06.140 of the first row. 25:06.140 --> 25:08.440 So zero, four, four, 10. 25:08.440 --> 25:10.400 You know, it's zero, four, four, 10, 25:10.400 --> 25:12.940 one, three, oh, six, one, five, four, two, et cetera. 25:12.940 --> 25:14.400 It's just the first row. 25:14.400 --> 25:17.140 The shape of this is 27. 25:17.140 --> 25:19.840 It's just the row of 27. 25:19.840 --> 25:21.940 And the other way that you can do this also is you just, 25:21.940 --> 25:23.760 you don't actually give this, 25:23.760 --> 25:26.260 you just grab the zeroth row like this. 25:26.260 --> 25:27.260 This is equivalent. 25:28.200 --> 25:30.000 Now, these are the counts. 25:30.000 --> 25:31.640 And now what we'd like to do 25:31.640 --> 25:35.060 is we'd like to basically sample from this. 25:35.060 --> 25:36.140 Since these are the raw counts, 25:36.140 --> 25:39.160 we actually have to convert this to probabilities. 25:39.160 --> 25:41.860 So we create a probability vector. 25:42.960 --> 25:45.060 So we'll take N of zero, 25:45.060 --> 25:48.960 and we'll actually convert this to float first. 25:50.100 --> 25:52.900 Okay, so these integers are converted to float, 25:52.900 --> 25:54.140 floating point numbers. 25:54.140 --> 25:55.700 And the reason we're creating floats 25:55.700 --> 25:58.100 is because we're about to normalize these counts. 25:58.200 --> 26:00.860 So to create a probability distribution here, 26:00.860 --> 26:02.060 we want to divide, 26:02.060 --> 26:06.060 we basically want to do p, p divide, p.sum. 26:08.960 --> 26:11.460 And now we get a vector of smaller numbers, 26:11.460 --> 26:13.040 and these are now probabilities. 26:13.040 --> 26:15.300 So of course, because we divided by the sum, 26:15.300 --> 26:18.200 the sum of p now is one. 26:18.200 --> 26:20.440 So this is a nice proper probability distribution. 26:20.440 --> 26:21.600 It sums to one. 26:21.600 --> 26:22.940 And this is giving us the probability 26:22.940 --> 26:27.140 for any single character to be the first character of a word. 26:27.140 --> 26:28.100 So we can do this. 26:28.100 --> 26:30.860 So now we can try to sample from this distribution. 26:30.860 --> 26:32.260 To sample from these distributions, 26:32.260 --> 26:34.260 we're going to use torch.multinomial, 26:34.260 --> 26:36.300 which I've pulled up here. 26:36.300 --> 26:41.040 So torch.multinomial returns samples 26:41.040 --> 26:43.400 from the multinomial probability distribution, 26:43.400 --> 26:45.240 which is a complicated way of saying, 26:45.240 --> 26:48.140 you give me probabilities and I will give you integers, 26:48.140 --> 26:51.760 which are sampled according to the probability distribution. 26:51.760 --> 26:53.340 So this is the signature of the method. 26:53.340 --> 26:54.860 And to make everything deterministic, 26:54.860 --> 26:57.960 we're going to use a generator object in PyTorch. 26:58.100 --> 27:00.960 So this makes everything deterministic. 27:00.960 --> 27:02.600 So when you run this on your computer, 27:02.600 --> 27:04.660 you're going to get the exact same results 27:04.660 --> 27:07.240 that I'm getting here on my computer. 27:07.240 --> 27:09.040 So let me show you how this works. 27:12.760 --> 27:14.400 Here's the deterministic way 27:14.400 --> 27:18.100 of creating a torch generator object, 27:18.100 --> 27:21.260 seeding it with some number that we can agree on. 27:21.260 --> 27:24.940 So that seeds a generator, gives us an object g. 27:24.940 --> 27:27.260 And then we can pass that g to a function, 27:27.260 --> 27:31.860 a function that creates here random numbers. 27:31.860 --> 27:35.320 torch.rand creates random numbers, three of them. 27:35.320 --> 27:37.660 And it's using this generator object 27:37.660 --> 27:40.400 as a source of randomness. 27:40.400 --> 27:46.600 So without normalizing it, I can just print. 27:46.600 --> 27:49.020 This is sort of like numbers between 0 and 1 27:49.020 --> 27:51.260 that are random according to this thing. 27:51.260 --> 27:53.520 And whenever I run it again, I'm always 27:53.520 --> 27:55.300 going to get the same result because I keep 27:55.300 --> 27:57.160 using the same generator object, which I'm 27:57.160 --> 27:58.860 seeding here. 27:58.860 --> 28:02.920 And then if I divide to normalize, 28:02.920 --> 28:05.220 I'm going to get a nice probability distribution 28:05.220 --> 28:07.600 of just three elements. 28:07.600 --> 28:09.400 And then we can use torch.multinomial 28:09.400 --> 28:11.220 to draw samples from it. 28:11.220 --> 28:13.760 So this is what that looks like. 28:13.760 --> 28:18.420 torch.multinomial will take the torch tensor 28:18.420 --> 28:21.100 of probability distributions. 28:21.100 --> 28:24.600 Then we can ask for a number of samples, let's say 20. 28:24.600 --> 28:27.060 Replacement equals true means that when 28:27.060 --> 28:30.720 we draw an element, we can draw it, 28:30.720 --> 28:34.360 and then we can put it back into the list of eligible indices 28:34.360 --> 28:35.960 to draw again. 28:35.960 --> 28:37.820 And we have to specify replacement as true 28:37.820 --> 28:41.700 because by default, for some reason, it's false. 28:41.700 --> 28:45.800 And I think it's just something to be careful with. 28:45.800 --> 28:47.440 And the generator is passed in here. 28:47.440 --> 28:50.180 So we are going to always get deterministic results, 28:50.180 --> 28:51.460 the same results. 28:51.460 --> 28:54.180 So if I run these two, we're going 28:54.180 --> 28:56.860 to get a bunch of samples from this distribution. 28:56.860 --> 28:59.600 Now, you'll notice here that the probability 28:59.600 --> 29:04.600 for the first element in this tensor is 60%. 29:04.600 --> 29:10.800 So in these 20 samples, we'd expect 60% of them to be 0. 29:10.800 --> 29:14.420 We'd expect 30% of them to be 1. 29:14.420 --> 29:19.520 And because the element index 2 has only 10% probability, 29:19.520 --> 29:22.320 very few of these samples should be 2. 29:22.320 --> 29:25.560 And indeed, we only have a small number of 2s. 29:25.560 --> 29:26.520 And we can sample as many as we want. 29:26.520 --> 29:31.820 And the more we sample, the more these numbers 29:31.820 --> 29:35.920 should roughly have the distribution here. 29:35.920 --> 29:42.580 So we should have lots of 0s, half as many 1s. 29:42.580 --> 29:48.960 And we should have three times as few 1s and three times 29:48.960 --> 29:51.840 as few 2s. 29:51.840 --> 29:53.420 So you see that we have very few 2s. 29:53.420 --> 29:55.780 We have some 1s, and most of them are 0s. 29:55.780 --> 29:56.300 So that's what we're going to do. 29:56.300 --> 29:56.500 Thank you. 29:56.520 --> 29:58.900 So that's what Torchlight Multinomial is doing. 29:58.900 --> 30:02.460 For us here, we are interested in this row. 30:02.460 --> 30:06.940 We've created this p here. 30:06.940 --> 30:09.760 And now we can sample from it. 30:09.760 --> 30:13.800 So if we use the same seed, and then we 30:13.800 --> 30:18.200 sample from this distribution, and let's just get one sample, 30:18.200 --> 30:22.720 then we see that the sample is, say, 13. 30:22.720 --> 30:25.300 So this will be the index. 30:25.300 --> 30:26.300 And let's see. 30:26.300 --> 30:28.860 See how it's a tensor that wraps 13? 30:28.860 --> 30:33.060 We again have to use .item to pop out that integer. 30:33.060 --> 30:37.540 And now index would be just the number 13. 30:37.540 --> 30:42.960 And of course, we can map the i2s of ix 30:42.960 --> 30:46.120 to figure out exactly which character we're sampling here. 30:46.120 --> 30:48.120 We're sampling m. 30:48.120 --> 30:51.280 So we're saying that the first character is m 30:51.280 --> 30:53.200 in our generation. 30:53.200 --> 30:56.080 And just looking at the row here, m was drawn. 30:56.080 --> 31:00.180 And we can see that m actually starts a large number of words. 31:00.180 --> 31:04.780 m started 2,500 words out of 32,000 words. 31:04.780 --> 31:09.200 So almost a bit less than 10% of the words start with m. 31:09.200 --> 31:11.580 So this was actually a fairly likely character to draw. 31:15.380 --> 31:17.160 So that would be the first character of our word. 31:17.160 --> 31:19.800 And now we can continue to sample more characters, 31:19.800 --> 31:24.840 because now we know that m is already sampled. 31:24.840 --> 31:25.880 So now to draw the next character, we're going to use m. 31:25.880 --> 31:25.960 m is already sampled. So now to draw the next character, we're going to use m. 31:25.960 --> 31:26.040 m is already sampled. So now to draw the next character, we're going to use m. 31:26.080 --> 31:32.760 And we'll come back here, and we will look for the row that starts with m. 31:32.760 --> 31:36.800 So you see m, and we have a row here. 31:36.800 --> 31:40.760 So we see that m dot is 516, 31:40.760 --> 31:43.820 m a is this many, m b is this many, etc. 31:43.820 --> 31:45.660 So these are the counts for the next row, 31:45.660 --> 31:48.720 and that's the next character that we are going to now generate. 31:48.720 --> 31:51.260 So I think we are ready to actually just write out the loop, 31:51.260 --> 31:54.560 because I think you're starting to get a sense of how this is going to go. 31:54.560 --> 31:55.960 The... 31:55.960 --> 32:00.780 We always begin at index zero because that's the start token and 32:02.200 --> 32:04.200 Then while true 32:04.640 --> 32:10.400 We're going to grab the row corresponding to index that we're currently on so that's P 32:10.840 --> 32:13.440 So that's n array at IX 32:14.400 --> 32:16.500 Converted to float is our P 32:18.820 --> 32:22.580 Then we normalize the speed to sum to one 32:22.580 --> 32:24.580 I 32:25.540 --> 32:32.240 Accidentally ran the infinite loop we normalize P to sum to one then we need this generator object 32:33.600 --> 32:37.640 Now we're going to initialize up here and we're going to draw a single sample from this distribution 32:39.120 --> 32:40.700 And 32:40.700 --> 32:44.660 Then this is going to tell us what index is going to be next 32:46.200 --> 32:51.420 If the index sampled is zero then that's now the end token 32:52.580 --> 32:54.580 So we will break 32:55.260 --> 32:59.560 Otherwise we are going to print s2i of ix 33:02.300 --> 33:04.300 i2s of ix 33:05.700 --> 33:09.100 That's pretty much it we're just this should work 33:10.140 --> 33:11.840 Okay more 33:11.840 --> 33:19.440 So that's the that's the name that we've sampled. We started with M. The next step was O then R and then dot 33:21.340 --> 33:22.400 And this dot is 33:22.400 --> 33:24.400 We printed here as well, so 33:26.220 --> 33:28.220 Let's not do this a few times 33:29.720 --> 33:34.640 So let's actually create an out list here 33:36.140 --> 33:41.740 And instead of printing we're going to append so out dot append this character 33:42.900 --> 33:44.180 and 33:44.180 --> 33:46.640 Then here let's just print it at the end 33:46.640 --> 33:52.240 So let's just join up all the outs, and we're just going to print more okay now 33:52.240 --> 33:56.800 always getting the same result because of the generator so if we want to do this a few times 33:56.800 --> 34:03.760 we can go for high in range 10 we can sample 10 names and we can just do that 10 times 34:05.600 --> 34:09.200 and these are the names that we're getting out let's do 20. 34:14.160 --> 34:18.480 i'll be honest with you this doesn't look right so i started a few minutes to convince myself 34:18.480 --> 34:24.160 that it actually is right the reason these samples are so terrible is that bigram language model 34:24.800 --> 34:29.040 is actually just like really terrible we can generate a few more here 34:30.000 --> 34:33.840 and you can see that they're kind of like their name like a little bit like yanu 34:33.840 --> 34:40.880 riley etc but they're just like totally messed up and i mean the reason that this is so bad like 34:40.880 --> 34:46.400 we're generating h as a name but you have to think through it from the model's eyes 34:46.400 --> 34:48.400 it doesn't know that this h is different 34:48.480 --> 34:55.940 very first h all it knows is that h was previously and now how likely is h the last character well 34:55.940 --> 35:00.540 it's somewhat likely and so it just makes it last character it doesn't know that there were other 35:00.540 --> 35:05.500 things before it or there were not other things before it and so that's why it's generating all 35:05.500 --> 35:13.260 these like some nonsense names another way to do this is to convince yourself that it's actually 35:13.260 --> 35:20.220 doing something reasonable even though it's so terrible is these little piece here are 27 right 35:20.220 --> 35:28.200 like 27 so how about if we did something like this instead of p having any structure whatsoever 35:28.720 --> 35:32.440 how about if p was just torch dot ones 35:32.440 --> 35:40.940 of 27 by default this is a float 32 so this is fine divide 27 35:40.940 --> 35:43.260 so what i'm 35:43.260 --> 35:48.560 doing here is this is the uniform distribution which will make everything equally likely 35:48.560 --> 35:56.580 and we can sample from that so let's see if that does any better okay so it's this is what you 35:56.580 --> 36:01.100 have from a model that is completely untrained where everything is equally likely so it's 36:01.100 --> 36:07.500 obviously garbage and then if we have a trained model which is trained on just bigrams this is 36:07.500 --> 36:12.560 what we get so you can see that it is more name like it is actually working it's just 36:12.560 --> 36:18.620 bigram is so terrible and we have to do better now next i would like to fix an inefficiency that 36:18.620 --> 36:24.220 we have going on here because what we're doing here is we're always fetching a row of n from 36:24.220 --> 36:28.980 the counts matrix up ahead and then we're always doing the same things we're converting to float 36:28.980 --> 36:33.420 and we're dividing and we're doing this every single iteration of this loop and we just keep 36:33.420 --> 36:36.780 renormalizing these rows over and over again and it's extremely inefficient and wasteful 36:36.780 --> 36:37.480 so we're doing this every single iteration of this loop and we just keep renormalizing these rows over 36:37.480 --> 36:42.360 so what i'd like to do is i'd like to actually prepare a matrix capital p that will just have 36:42.360 --> 36:47.100 the probabilities in it so in other words it's going to be the same as the capital n matrix here 36:47.100 --> 36:52.700 of counts but every single row will have the row of probabilities that is normalized to one 36:52.700 --> 36:57.500 indicating the probability distribution for the next character given the character before it 36:57.500 --> 37:03.920 as defined by which row we're in so basically what we'd like to do is we'd like to just do 37:03.920 --> 37:07.220 it up front here and then we would like to just use that row here 37:07.480 --> 37:16.020 so here we would like to just do p equals p of i x instead okay the other reason i want to do this 37:16.020 --> 37:21.360 is not just for efficiency but also i would like us to practice these n-dimensional tensors and 37:21.360 --> 37:25.180 i'd like us to practice their manipulation and especially something that's called broadcasting 37:25.180 --> 37:29.220 that we'll go into in a second we're actually going to have to become very good at these 37:29.220 --> 37:33.520 tensor manipulations because if we're going to build out all the way to transformers we're going 37:33.520 --> 37:37.460 to be doing some pretty complicated array operations for efficiency and we're going to have to do some 37:37.480 --> 37:39.720 pretty complicated array operations for efficiency and we need to really understand that and be very 37:39.720 --> 37:45.460 good at it so intuitively what we want to do is we first want to grab the floating point 37:45.460 --> 37:52.800 copy of n and i'm mimicking the line here basically and then we want to divide all the rows 37:52.800 --> 37:58.820 so that they sum to one so we'd like to do something like this p divide p dot sum 37:58.820 --> 38:06.440 but now we have to be careful because p dot sum actually produces a sum 38:07.480 --> 38:17.040 sorry p equals n dot float copy p dot sum produces a um sums up all of the counts of this entire 38:17.040 --> 38:22.280 matrix n and gives us a single number of just the summation of everything so that's not the way we 38:22.280 --> 38:28.240 want to define divide we want to simultaneously and in parallel divide all the rows by their 38:28.240 --> 38:34.760 respective sums so what we have to do now is we have to go into documentation for torch.sum 38:34.760 --> 38:37.460 and we can scroll down here to a definition of the sum and we can see that the sum is 38:37.480 --> 38:42.240 a definition that is relevant to us which is where we don't only provide an input array 38:42.240 --> 38:47.540 that we want to sum but we also provide the dimension along which we want to sum and in 38:47.540 --> 38:53.940 particular we want to sum up over rows right now one more argument that i want you to pay 38:53.940 --> 39:00.980 attention to here is the keep them is false if keep them is true then the output tensor 39:00.980 --> 39:05.020 is of the same size as input except of course the dimension along which you summed which 39:05.020 --> 39:07.400 will become just one 39:07.480 --> 39:15.700 but if you pass in uh keep them as false then this dimension is squeezed out and so torch.sum 39:15.700 --> 39:20.140 not only does the sum and collapses dimension to be of size one but in addition it does 39:20.140 --> 39:26.360 what's called a squeeze where it squeeze out it squeezes out that dimension so basically 39:26.360 --> 39:32.140 what we want here is we instead want to do p dot sum of sum axis and in particular notice 39:32.140 --> 39:37.420 that p dot shape is 27 by 27 so when we sum up across axis 0 39:37.480 --> 39:39.780 then we would be taking the 0th dimension 39:39.780 --> 39:41.480 and we would be summing across it 39:41.480 --> 39:43.900 so when keep dim is true 39:43.900 --> 39:45.900 then this thing 39:45.900 --> 39:48.000 will not only give us the counts 39:48.000 --> 39:48.560 across 39:48.560 --> 39:50.940 along the columns 39:50.940 --> 39:53.980 but notice that basically the shape of this 39:53.980 --> 39:55.220 is 1 by 27 39:55.220 --> 39:56.460 we just get a row vector 39:56.460 --> 39:59.320 and the reason we get a row vector here again 39:59.320 --> 40:00.600 is because we passed in 0 dimension 40:00.600 --> 40:02.740 so this 0th dimension becomes 1 40:02.740 --> 40:04.000 and we've done a sum 40:04.000 --> 40:05.520 and we get a row 40:05.520 --> 40:07.360 and so basically we've done the sum 40:07.360 --> 40:09.740 this way, vertically 40:09.740 --> 40:12.180 and arrived at just a single 1 by 27 40:12.180 --> 40:13.760 vector of counts 40:13.760 --> 40:16.800 what happens when you take out keep dim 40:16.800 --> 40:19.060 is that we just get 27 40:19.060 --> 40:20.500 so it squeezes out 40:20.500 --> 40:21.300 that dimension 40:21.300 --> 40:24.680 and we just get a 1 dimensional vector 40:24.680 --> 40:25.760 of size 27 40:25.760 --> 40:29.960 now we don't actually want 40:29.960 --> 40:32.640 1 by 27 row vector 40:32.640 --> 40:34.180 because that gives us the 40:34.180 --> 40:35.660 counts or the sums 40:35.660 --> 40:36.340 across 40:36.340 --> 40:37.340 0th 40:37.360 --> 40:39.600 the columns 40:39.600 --> 40:41.340 we actually want to sum the other way 40:41.340 --> 40:42.860 along dimension 1 40:42.860 --> 40:45.800 and you'll see that the shape of this is 27 by 1 40:45.800 --> 40:47.500 so it's a column vector 40:47.500 --> 40:50.020 it's a 27 by 1 40:50.020 --> 40:53.980 vector of counts 40:53.980 --> 40:56.980 and that's because what's happened here is that we're going horizontally 40:56.980 --> 40:59.960 and this 27 by 27 matrix becomes a 40:59.960 --> 41:03.680 27 by 1 array 41:03.680 --> 41:06.360 now you'll notice by the way that 41:06.360 --> 41:07.340 the actual numbers 41:07.360 --> 41:09.600 of these counts are identical 41:09.600 --> 41:13.140 and that's because this special array of counts here 41:13.140 --> 41:14.420 comes from bigram statistics 41:14.420 --> 41:16.180 and actually it just so happens 41:16.180 --> 41:17.180 by chance 41:17.180 --> 41:19.720 or because of the way this array is constructed 41:19.720 --> 41:21.480 that the sums along the columns 41:21.480 --> 41:22.500 or along the rows 41:22.500 --> 41:23.900 horizontally or vertically 41:23.900 --> 41:24.940 is identical 41:24.940 --> 41:27.700 but actually what we want to do in this case 41:27.700 --> 41:29.480 is we want to sum across the 41:29.480 --> 41:30.500 rows 41:30.500 --> 41:31.720 horizontally 41:31.720 --> 41:33.540 so what we want here 41:33.540 --> 41:34.560 is p.sum of 1 41:34.560 --> 41:35.760 with keep dim true 41:37.360 --> 41:39.600 27 by 1 column vector 41:39.600 --> 41:42.000 and now what we want to do is we want to divide by that 41:42.000 --> 41:46.300 now we have to be careful here again 41:46.300 --> 41:48.840 is it possible to take 41:48.840 --> 41:51.420 what's a p.shape you see here 41:51.420 --> 41:52.800 is 27 by 27 41:52.800 --> 41:56.260 is it possible to take a 27 by 27 array 41:56.260 --> 42:01.400 and divide it by what is a 27 by 1 array 42:01.400 --> 42:03.920 is that an operation that you can do 42:03.920 --> 42:07.200 and whether or not you can perform this operation is determined by what's called broadcasting 42:07.200 --> 42:08.040 rules 42:08.040 --> 42:11.800 so if you just search broadcasting semantics in torch 42:11.800 --> 42:14.160 you'll notice that there's a special definition for 42:14.160 --> 42:15.660 what's called broadcasting 42:15.660 --> 42:18.000 that for whether or not 42:18.000 --> 42:23.660 these two arrays can be combined in a binary operation like division 42:23.660 --> 42:26.500 so the first condition is each tensor has at least one dimension 42:26.500 --> 42:28.300 which is the case for us 42:28.300 --> 42:30.240 and then when iterating over the dimension sizes 42:30.240 --> 42:32.200 starting at the trailing dimension 42:32.200 --> 42:34.400 the dimension sizes must either be equal 42:34.400 --> 42:35.400 one of them is 1 42:35.400 --> 42:37.200 or one of them does not exist 42:37.200 --> 42:38.760 okay 42:38.760 --> 42:40.340 so let's do that 42:40.340 --> 42:43.000 we need to align the two arrays 42:43.000 --> 42:44.100 and their shapes 42:44.100 --> 42:46.640 which is very easy because both of these shapes have two elements 42:46.640 --> 42:48.000 so they're aligned 42:48.000 --> 42:49.500 then we iterate over 42:49.500 --> 42:50.660 from the right 42:50.660 --> 42:52.100 and going to the left 42:52.100 --> 42:55.200 each dimension must be either equal 42:55.200 --> 42:56.340 one of them is a 1 42:56.340 --> 42:57.660 or one of them does not exist 42:57.660 --> 42:59.340 so in this case they're not equal 42:59.340 --> 43:00.500 but one of them is a 1 43:00.500 --> 43:01.700 so this is fine 43:01.700 --> 43:03.700 and then this dimension they're both equal 43:03.700 --> 43:05.560 so this is fine 43:05.560 --> 43:07.040 so all the dimensions 43:07.040 --> 43:13.200 are fine and therefore this operation is broadcastable. So that means that this operation 43:13.200 --> 43:20.380 is allowed. And what is it that these arrays do when you divide 27 by 27 by 27 by 1? What it does 43:20.380 --> 43:28.360 is that it takes this dimension 1 and it stretches it out. It copies it to match 27 here in this case. 43:28.760 --> 43:35.660 So in our case, it takes this column vector, which is 27 by 1, and it copies it 27 times 43:35.660 --> 43:43.000 to make these both be 27 by 27 internally. You can think of it that way. And so it copies those 43:43.000 --> 43:49.480 counts and then it does an element-wise division, which is what we want because these counts we 43:49.480 --> 43:55.520 want to divide by them on every single one of these columns in this matrix. So this actually 43:55.520 --> 44:02.240 we expect will normalize every single row. And we can check that this is true by taking the first 44:02.240 --> 44:04.820 row, for example, and taking its sum. 44:04.820 --> 44:13.000 We expect this to be 1 because it's now normalized. And then we expect this now because 44:13.000 --> 44:17.400 if we actually correctly normalize all the rows, we expect to get the exact same result here. 44:17.800 --> 44:24.060 So let's run this. It's the exact same result. So this is correct. So now I would like to scare 44:24.060 --> 44:28.660 you a little bit. You actually have to like, I basically encourage you very strongly to read 44:28.660 --> 44:33.220 through broadcasting semantics. And I encourage you to treat this with respect. And it's not 44:34.820 --> 44:38.200 something you should do with it. It's something to really respect, really understand and look up 44:38.200 --> 44:42.600 maybe some tutorials for broadcasting and practice it and be careful with it because you can very 44:42.600 --> 44:49.240 quickly run into bugs. Let me show you what I mean. You see how here we have p dot sum of 1, 44:49.240 --> 44:55.820 keep them as true. The shape of this is 27 by 1. Let me take out this line just so we have the n, 44:55.820 --> 45:03.800 and then we can see the counts. We can see that this is all the counts across all the rows. And 45:03.800 --> 45:04.760 it's 27 by 1. 45:04.820 --> 45:11.640 vector right now suppose that I tried to do the following but I erase keep them 45:11.640 --> 45:17.360 just true here what does that do if keep them is not true it's false then 45:17.360 --> 45:21.440 remember according to documentation it gets rid of this dimension one it 45:21.440 --> 45:26.000 squeezes it out so basically we just get all the same counts the same result 45:26.000 --> 45:32.060 except the shape of it is not 27 by 1 it's just 27 the one disappears but all 45:32.060 --> 45:39.300 the counts are the same so you'd think that this divide that would would work 45:39.300 --> 45:44.300 first of all can we even write this and will it even is it even is it even 45:44.300 --> 45:47.720 expected to run is it broadcastable let's determine if this result is 45:47.720 --> 45:57.340 broadcastable p.summit1 is shape is 27 this is 27 by 27 so 27 by 27 45:57.340 --> 46:02.040 broadcasting into 27 so now rules of 46:02.040 --> 46:06.480 broadcasting number one align all the dimensions on the right done now 46:06.480 --> 46:09.180 iteration over all the dimensions starting from the right going to the 46:09.180 --> 46:14.920 left all the dimensions must either be equal one of them must be one or one then 46:14.920 --> 46:19.200 does not exist so here they are all equal here the dimension does not exist 46:19.200 --> 46:26.100 so internally what broadcasting will do is it will create a one here and then we 46:26.100 --> 46:30.480 see that one of them is a one and this will get copied and this will run this 46:30.480 --> 46:30.980 will broadcast 46:32.040 --> 46:42.100 okay so you'd expect this to work because we we are this broadcast and 46:42.100 --> 46:46.800 this we can divide this now if I run this you'd expect it to work but it 46:46.800 --> 46:51.220 doesn't you actually get garbage you get a wrong result because this is actually 46:51.220 --> 47:01.380 a bug this keep them equals true makes it work this is a bug 47:02.040 --> 47:06.480 but it's actually we are this in both cases we are doing the correct counts we 47:06.480 --> 47:11.760 are summing up across the rows but keep them is saving us and making it work so 47:11.760 --> 47:15.040 in this case I'd like you to encourage you to potentially like pause this video 47:15.040 --> 47:19.360 at this point and try to think about why this is buggy and why the keep dem was 47:19.360 --> 47:26.540 necessary here okay so the reason to do for this is I'm trying to hint at here 47:26.540 --> 47:31.980 when I was sort of giving you a bit of a hint on how this works this 27 factor is 47:32.040 --> 47:39.800 internally inside the broadcasting this becomes a 1 by 27 and 1 by 27 is a row vector right and 47:39.800 --> 47:46.980 now we are dividing 27 by 27 by 1 by 27 and torch will replicate this dimension so basically 47:46.980 --> 47:56.940 it will take it will take this row vector and it will copy it vertically now 27 times so the 27 by 47:56.940 --> 48:04.760 27 lines exactly and element wise divides and so basically what's happening here is we're actually 48:04.760 --> 48:11.440 normalizing the columns instead of normalizing the rows so you can check that what's happening 48:11.440 --> 48:19.920 here is that P at 0 which is the first row of P dot sum is not 1 it's 7 it is the first column 48:19.920 --> 48:26.920 as an example that sums to 1 so to summarize where does the issue come from the issue 48:26.920 --> 48:31.960 comes from the silent adding of a dimension here because in broadcasting rules you align on the 48:31.960 --> 48:36.820 right and go from right to left and if dimension doesn't exist you create it so that's where the 48:36.820 --> 48:41.900 problem happens we still did the counts correctly we did the counts across the rows and we got the 48:41.900 --> 48:48.460 counts on the right here as a column vector but because the keep dims was true this this this 48:48.460 --> 48:53.200 dimension was discarded and now we just have a vector 27 and because of broadcasting the way 48:53.200 --> 48:56.380 it works this vector of 27 suddenly becomes a row vector 48:56.920 --> 49:01.080 and then this row vector gets replicated vertically and at every single point we 49:01.080 --> 49:11.400 are dividing by the by the count in the opposite direction so so this thing just doesn't work 49:11.400 --> 49:18.360 this needs to be keep dims equals true in this case so then then we have that P at 0 is normalized 49:19.800 --> 49:23.160 and conversely the first column you'd expect to potentially not be normalized 49:24.520 --> 49:25.960 and this is what makes it work 49:27.560 --> 49:33.560 so pretty subtle and hopefully this helps to scare you that you should have respect for 49:33.560 --> 49:38.840 broadcasting be careful check your work and understand how it works under the hood and make 49:38.840 --> 49:42.360 sure that it's broadcasting in the direction that you like otherwise you're going to introduce very 49:42.360 --> 49:48.600 subtle bugs very hard to find bugs and just be careful one more note on efficiency we don't want 49:48.600 --> 49:53.640 to be doing this here because this creates a completely new tensor that we store into p 49:54.280 --> 49:56.840 we prefer to use in place operations if possible 49:57.560 --> 50:02.520 uh so this would be an in-place operation it has the potential to be faster it doesn't create new 50:02.520 --> 50:12.680 memory under the hood and then let's erase this we don't need it and let's also um just do fewer 50:12.680 --> 50:17.640 just so i'm not wasting space okay so we're actually in a pretty good spot now we trained 50:17.640 --> 50:23.720 a bigram language model and we trained it really just by counting uh how frequently any pairing 50:23.720 --> 50:26.840 occurs and then normalizing so that we get a nice property distribution 50:27.300 --> 50:31.600 so really these elements of this array p are really the 50:31.600 --> 50:36.160 parameters of our bigram language model giving us in summarizing the statistics of these bigrams 50:36.160 --> 50:40.080 so we train the model and then we know how to sample from the model 50:40.080 --> 50:46.000 we just iteratively uh sample the next character and feed it in each time and get the next character 50:46.960 --> 50:51.040 now what i'd like to do is i'd like to somehow evaluate the quality of this model 50:51.040 --> 50:51.820 we'd like to somehow summarize the quality of this model into a single number how good is it at predicting the quality of the data and we can use that here to kind of write out which is not what we want to use here but like to do keep in front of a table for FARM 50:51.820 --> 50:52.140 summarize the quality of this model into a single number how good is it at predicting the number of Bana 50:52.140 --> 50:56.580 summarize the quality of this model into a single number. How good is it at predicting 50:56.580 --> 51:02.920 the training set? And as an example, so in the training set, we can evaluate now the training 51:02.920 --> 51:08.500 loss. And this training loss is telling us about sort of the quality of this model in a single 51:08.500 --> 51:14.080 number, just like we saw in micrograd. So let's try to think through the quality of the model 51:14.080 --> 51:19.440 and how we would evaluate it. Basically, what we're going to do is we're going to copy paste 51:19.440 --> 51:26.220 this code that we previously used for counting. And let me just print these bigrams first. We're 51:26.220 --> 51:30.860 going to use fstrings, and I'm going to print character one followed by character two. These 51:30.860 --> 51:34.680 are the bigrams. And then I don't want to do it for all the words, just do the first three words. 51:35.860 --> 51:42.260 So here we have Emma, Olivia, and Ava bigrams. Now what we'd like to do is we'd like to basically 51:42.260 --> 51:48.800 look at the probability that the model assigns to every one of these bigrams. So in other words, 51:48.840 --> 51:49.420 we can look at the probability of the model, and we can look at the probability of the model, 51:49.420 --> 51:49.440 and we can look at the probability of the model, and we can look at the probability of the model, 51:49.440 --> 51:58.860 which is summarized in the matrix B of Ix1, Ix2. And then we can print it here as probability. 52:00.520 --> 52:07.860 And because these probabilities are way too large, let me percent or colon .4f to truncate it a bit. 52:09.000 --> 52:12.840 So what do we have here, right? We're looking at the probabilities that the model assigns to every 52:12.840 --> 52:19.200 one of these bigrams in the dataset. And so we can see some of them are 4%, 3%, etc. Just to have a 52:19.200 --> 52:25.420 measuring stick in our mind, by the way. We have 27 possible characters or tokens. And if everything 52:25.420 --> 52:33.320 was equally likely, then you'd expect all these probabilities to be 4% roughly. So anything above 52:33.320 --> 52:38.460 4% means that we've learned something useful from these bigram statistics. And you see that roughly 52:38.460 --> 52:44.700 some of these are 4%, but some of them are as high as 40%, 35%, and so on. So you see that the model 52:44.700 --> 52:49.060 actually assigned a pretty high probability to whatever's in the training set. And so that's a 52:49.060 --> 52:49.180 good thing. And so we can look at the probability of the model, and we can look at the probability 52:49.180 --> 52:53.580 of the model. Basically, if you have a very good model, you'd expect that these probabilities 52:53.580 --> 52:58.140 should be near one, because that means that your model is correctly predicting what's going to come 52:58.140 --> 53:04.580 next, especially on the training set where you trained your model. So now we'd like to think 53:04.580 --> 53:09.440 about how can we summarize these probabilities into a single number that measures the quality 53:09.440 --> 53:14.380 of this model. Now, when you look at the literature into maximum likelihood estimation 53:14.380 --> 53:19.040 and statistical modeling and so on, you'll see that what's typically used here 53:19.040 --> 53:23.980 is something called the likelihood. And the likelihood is the product of all of these 53:23.980 --> 53:29.760 probabilities. And so the product of all of these probabilities is the likelihood. And it's really 53:29.760 --> 53:37.140 telling us about the probability of the entire data set assigned by the model that we've trained. 53:37.600 --> 53:43.600 And that is a measure of quality. So the product of these should be as high as possible when you 53:43.600 --> 53:47.680 are training the model and when you have a good model. Your product of these probabilities should 53:47.680 --> 53:48.300 be very high. 53:49.040 --> 53:54.700 Now, because the product of these probabilities is an unwieldy thing to work with, you can see 53:54.700 --> 53:58.760 that all of them are between zero and one. So your product of these probabilities will be a very tiny 53:58.760 --> 54:05.440 number. So for convenience, what people work with usually is not the likelihood, but they work with 54:05.440 --> 54:11.580 what's called the log likelihood. So the product of these is the likelihood. To get the log 54:11.580 --> 54:16.420 likelihood, we just have to take the log of the probability. And so the log of the probability 54:16.420 --> 54:18.620 here, I have the log of x from zero to one. 54:19.720 --> 54:27.320 The log is a, you see here, monotonic transformation of the probability, where if you pass in one, you 54:27.320 --> 54:33.320 get zero. So probability one gets you log probability of zero. And then as you go lower and 54:33.320 --> 54:38.920 lower probability, the log will grow more and more negative until all the way to negative infinity at 54:38.920 --> 54:39.420 zero. 54:41.800 --> 54:47.560 So here we have a log prob, which is really just a torch.log of probability. Let's print it out to get a sense of what that looks like. 54:47.560 --> 54:48.160 Let's print it out to get a sense of what that looks like. 54:48.160 --> 54:48.660 Let's print it out to get a sense of what that looks like. 54:50.000 --> 54:52.040 Log prob, also, 0.4f. 54:56.600 --> 55:02.880 So as you can see, when we plug in numbers that are very close to some of our higher numbers, we get closer and closer to zero. 55:03.520 --> 55:08.100 And then if we plug in very bad probabilities, we get more and more negative number that's bad. 55:09.540 --> 55:16.940 So, and the reason we work with this is for a large extent, convenience, because we have, mathematically, that if 55:16.940 --> 55:18.380 you have some product A x B x C analyze a function and add some product, you've got a set method. 55:18.380 --> 55:18.960 Yes. 55:18.960 --> 55:24.560 all these probabilities right the likelihood is the product of all these probabilities 55:25.360 --> 55:31.280 then the log of these is just log of a plus log of b 55:33.760 --> 55:40.320 plus log of c if you remember your logs from your high school or undergrad and so on so we have that 55:40.320 --> 55:44.640 basically the likelihood of the product probabilities the log likelihood is just 55:44.640 --> 55:53.440 the sum of the logs of the individual probabilities so log likelihood starts at zero 55:54.560 --> 56:01.680 and then log likelihood here we can just accumulate simply and then the end we can print this 56:05.360 --> 56:06.560 print the log likelihood 56:09.520 --> 56:12.720 f strings maybe you're familiar with this 56:13.840 --> 56:14.640 so log likelihood 56:14.640 --> 56:16.240 is negative 38 56:19.840 --> 56:30.080 okay now we actually want um so how high can log likelihood get it can go to zero so when 56:30.080 --> 56:34.160 all the probabilities are one log likelihood will be zero and then when all the probabilities 56:34.160 --> 56:40.080 are lower this will grow more and more negative now we don't actually like this because what we'd 56:40.080 --> 56:43.840 like is a loss function and a loss function has the semantics that low 56:43.840 --> 56:49.040 is good because we're trying to minimize the loss so we actually need to invert this 56:49.040 --> 56:52.880 and that's what gives us something called the negative log likelihood 56:54.880 --> 56:58.800 negative log likelihood is just negative of the log likelihood 57:02.720 --> 57:07.040 these are f strings by the way if you'd like to look this up negative log likelihood equals 57:08.320 --> 57:13.040 so negative log likelihood now is just negative of it and so the negative log likelihood is a negative 57:13.040 --> 57:20.660 likelihood, is a very nice loss function because the lowest it can get is zero. And the higher it 57:20.660 --> 57:26.160 is, the worse off the predictions are that you're making. And then one more modification to this 57:26.160 --> 57:31.740 that sometimes people do is that for convenience, they actually like to normalize by, they like to 57:31.740 --> 57:40.400 make it an average instead of a sum. And so here, let's just keep some counts as well. So n plus 57:40.400 --> 57:46.800 equals one starts at zero. And then here, we can have sort of like a normalized log likelihood. 57:50.240 --> 57:56.120 If we just normalize it by the count, then we will sort of get the average log likelihood. So this 57:56.120 --> 58:03.660 would be usually our loss function here. This is what we would use. So our loss function for the 58:03.660 --> 58:09.560 training set assigned by the model is 2.4. That's the quality of this model. And the lower it is, 58:09.560 --> 58:10.380 the better off we are. 58:10.420 --> 58:17.460 And the higher it is, the worse off we are. And the job of our, you know, training is to find the 58:17.460 --> 58:24.300 parameters that minimize the negative log likelihood loss. And that would be like a high 58:24.300 --> 58:29.800 quality model. Okay, so to summarize, I actually wrote it out here. So our goal is to maximize 58:29.800 --> 58:36.080 likelihood, which is the product of all the probabilities assigned by the model. And we want 58:36.080 --> 58:40.240 to maximize this likelihood with respect to the model parameters. And in our case, we want to 58:40.240 --> 58:41.100 maximize the likelihood of all the probabilities assigned by the model. And in our case, the model 58:41.100 --> 58:47.380 parameters here are defined in the table. These numbers, the probabilities are the model parameters 58:47.380 --> 58:52.340 sort of in our diagram language model so far. But you have to keep in mind that here we are storing 58:52.340 --> 58:57.460 everything in a table format, the probabilities. But what's coming up as a brief preview is that 58:57.460 --> 59:02.100 these numbers will not be kept explicitly, but these numbers will be calculated by a neural 59:02.100 --> 59:07.280 network. So that's coming up. And we want to change and tune the parameters of these neural 59:07.280 --> 59:10.220 networks. We want to change these parameters to maximize the likelihood of all the probabilities 59:10.240 --> 59:15.700 the likelihood, the product of the probabilities. Now, maximizing the likelihood is equivalent to 59:15.700 --> 59:22.260 maximizing the log likelihood, because log is a monotonic function. Here's the graph of log. And 59:22.260 --> 59:28.260 basically, all it is doing is it's just scaling your, you can look at it as just a scaling of the 59:28.260 --> 59:34.500 loss function. And so the optimization problem here, and here are actually equivalent, because 59:34.500 --> 59:39.160 this is just scaling, you can look at it that way. And so these are two identical optimization 59:39.160 --> 59:39.720 problems. 59:40.240 --> 59:46.420 Maximizing the log likelihood is equivalent to minimizing the negative log likelihood. 59:46.420 --> 59:50.540 And then in practice, people actually minimize the average negative log likelihood to get 59:50.540 --> 59:56.860 numbers like 2.4. And then this summarizes the quality of your model. And we'd like to 59:56.860 --> 01:00:02.680 minimize it and make it as small as possible. And the lowest it can get is zero. And the 01:00:02.680 --> 01:00:07.440 lower it is, the better off your model is because it's assigning it's assigning high 01:00:07.440 --> 01:00:09.720 probabilities to your data. 01:00:09.720 --> 01:00:10.240 Now let's estimate. 01:00:10.240 --> 01:00:14.240 The probability over the entire training set just to make sure that we get something around 2.4. 01:00:14.800 --> 01:00:18.720 Let's run this over the entire oops, let's take out the print statement as well. 01:00:20.640 --> 01:00:22.880 Okay, 2.45 for the entire training set. 01:00:24.400 --> 01:00:27.600 Now what I'd like to show you is that you can actually evaluate the probability for any word 01:00:27.600 --> 01:00:33.520 that you want. Like for example, if we just test a single word Andre, and bring back the print 01:00:33.520 --> 01:00:39.520 statement, then you see that Andre is actually kind of like an unlikely word or like on average, 01:00:40.240 --> 01:00:47.280 we take three log probability to represent it. And roughly, that's because EJ apparently is very 01:00:47.280 --> 01:00:56.160 uncommon as an example. Now, think through this. When I take Andre and I append Q, and I test the 01:00:56.160 --> 01:01:04.800 probability of it Andre q, we actually get infinity. And that's because J Q has a 0% 01:01:04.800 --> 01:01:05.200 probability according to our model. So the log likelihood, so the log of 0% is 0% which is the 01:01:05.200 --> 01:01:09.360 probability of actually dancing. And then what happens when I take Andre, I take Andre q, and I test the 01:01:09.360 --> 01:01:11.680 So the log of 0 will be negative infinity. 01:01:12.040 --> 01:01:13.780 We get infinite loss. 01:01:14.340 --> 01:01:15.780 So this is kind of undesirable, right? 01:01:15.780 --> 01:01:18.840 Because we plugged in a string that could be like a somewhat reasonable name. 01:01:18.840 --> 01:01:25.760 But basically what this is saying is that this model is exactly 0% likely to predict this name. 01:01:26.620 --> 01:01:29.080 And our loss is infinity on this example. 01:01:29.840 --> 01:01:36.360 And really the reason for that is that j is followed by q 0 times. 01:01:37.000 --> 01:01:37.600 Where is q? 01:01:37.600 --> 01:01:38.780 jq is 0. 01:01:39.180 --> 01:01:41.440 And so jq is 0% likely. 01:01:42.100 --> 01:01:44.840 So it's actually kind of gross and people don't like this too much. 01:01:44.960 --> 01:01:50.320 To fix this, there's a very simple fix that people like to do to sort of like smooth out your model a little bit. 01:01:50.360 --> 01:01:51.300 And it's called model smoothing. 01:01:51.900 --> 01:01:55.500 And roughly what's happening is that we will add some fake counts. 01:01:56.140 --> 01:01:59.700 So imagine adding a count of 1 to everything. 01:02:00.780 --> 01:02:04.020 So we add a count of 1 like this. 01:02:04.360 --> 01:02:05.960 And then we recalculate the probabilities. 01:02:07.600 --> 01:02:08.820 And that's model smoothing. 01:02:08.960 --> 01:02:10.160 And you can add as much as you like. 01:02:10.220 --> 01:02:12.220 You can add 5 and that will give you a smoother model. 01:02:12.700 --> 01:02:17.260 And the more you add here, the more uniform model you're going to have. 01:02:17.840 --> 01:02:21.740 And the less you add, the more peaked model you are going to have, of course. 01:02:22.300 --> 01:02:25.240 So 1 is like a pretty decent count to add. 01:02:25.600 --> 01:02:29.700 And that will ensure that there will be no zeros in our probability matrix P. 01:02:30.780 --> 01:02:33.140 And so this will, of course, change the generations a little bit. 01:02:33.640 --> 01:02:34.500 In this case, it didn't. 01:02:34.600 --> 01:02:35.880 But in principle, it could. 01:02:36.540 --> 01:02:37.580 But what that's going to do... 01:02:37.600 --> 01:02:40.340 What it's going to do now is that nothing will be infinity unlikely. 01:02:41.260 --> 01:02:44.500 So now our model will predict some other probability. 01:02:44.880 --> 01:02:47.160 And we see that jq now has a very small probability. 01:02:47.580 --> 01:02:51.220 So the model still finds it very surprising that this was a word or a bigram. 01:02:51.440 --> 01:02:52.720 But we don't get negative infinity. 01:02:53.320 --> 01:02:55.760 So it's kind of like a nice fix that people like to apply sometimes. 01:02:55.800 --> 01:02:56.660 And it's called model smoothing. 01:02:57.100 --> 01:03:01.060 Okay, so we've now trained a respectable bigram character-level language model. 01:03:01.320 --> 01:03:07.380 And we saw that we both sort of trained the model by looking at the counts of all the bigrams. 01:03:07.600 --> 01:03:10.480 And normalizing the rows to get probability distributions. 01:03:11.200 --> 01:03:17.920 We saw that we can also then use those parameters of this model to perform sampling of new words. 01:03:19.260 --> 01:03:21.680 So we sample new names according to those distributions. 01:03:22.100 --> 01:03:24.860 And we also saw that we can evaluate the quality of this model. 01:03:25.320 --> 01:03:29.400 And the quality of this model is summarized in a single number, which is the negative log likelihood. 01:03:29.880 --> 01:03:32.700 And the lower this number is, the better the model is. 01:03:33.140 --> 01:03:37.060 Because it is giving high probabilities to the actual next characters. 01:03:37.060 --> 01:03:38.900 And all the bigrams in our training set. 01:03:39.960 --> 01:03:41.600 So that's all well and good. 01:03:41.860 --> 01:03:45.980 But we've arrived at this model explicitly by doing something that felt sensible. 01:03:46.220 --> 01:03:47.620 We were just performing counts. 01:03:47.860 --> 01:03:50.080 And then we were normalizing those counts. 01:03:50.860 --> 01:03:53.760 Now what I would like to do is I would like to take an alternative approach. 01:03:54.000 --> 01:03:56.200 We will end up in a very, very similar position. 01:03:56.440 --> 01:03:57.840 But the approach will look very different. 01:03:58.180 --> 01:04:03.360 Because I would like to cast the problem of bigram character-level language modeling into the neural network framework. 01:04:04.020 --> 01:04:07.040 And in the neural network framework, we're going to approach things. 01:04:07.280 --> 01:04:10.160 Slightly differently, but again, end up in a very similar spot. 01:04:10.360 --> 01:04:11.260 I'll go into that later. 01:04:12.060 --> 01:04:16.960 Now, our neural network is going to be a still a bigram character-level language model. 01:04:17.360 --> 01:04:19.860 So it receives a single character as an input. 01:04:20.460 --> 01:04:23.460 Then there's neural network with some weights or some parameters w. 01:04:24.260 --> 01:04:29.060 And it's going to output the probability distribution over the next character in a sequence. 01:04:29.260 --> 01:04:34.660 It's going to make guesses as to what is likely to follow this character that was input to the model. 01:04:36.060 --> 01:04:36.960 And then in addition to that, 01:04:37.260 --> 01:04:41.060 we're going to be able to evaluate any setting of the parameters of the neural net. 01:04:41.260 --> 01:04:44.860 Because we have the loss function, the negative log likelihood. 01:04:45.060 --> 01:04:47.160 So we're going to take a look at its probability distributions. 01:04:47.360 --> 01:04:48.960 And we're going to use the labels, 01:04:49.160 --> 01:04:54.160 which are basically just the identity of the next character in that bigram, the second character. 01:04:54.360 --> 01:04:59.360 So knowing what the second character actually comes next in the bigram allows us to then look at 01:04:59.560 --> 01:05:03.260 how high of probability the model assigns to that character. 01:05:03.460 --> 01:05:06.160 And then we, of course, want the probability to be very high. 01:05:07.060 --> 01:05:09.860 And that is another way of saying that the loss is low. 01:05:10.860 --> 01:05:15.060 So we're going to use gradient-based optimization then to tune the parameters of this network. 01:05:15.460 --> 01:05:18.260 Because we have the loss function and we're going to minimize it. 01:05:18.460 --> 01:05:23.660 So we're going to tune the weights so that the neural net is correctly predicting the probabilities for the next character. 01:05:24.460 --> 01:05:25.460 So let's get started. 01:05:25.660 --> 01:05:29.460 The first thing I want to do is I want to compile the training set of this neural network, right? 01:05:29.660 --> 01:05:34.260 So create the training set of all the bigrams. 01:05:34.260 --> 01:05:45.860 Okay, and here I'm going to copy-paste this code because this code iterates over all the bigrams. 01:05:46.060 --> 01:05:50.260 So here we start with the words, we iterate over all the bigrams. 01:05:50.460 --> 01:05:52.860 And previously, as you recall, we did the counts. 01:05:53.060 --> 01:05:54.460 But now we're not going to do counts. 01:05:54.660 --> 01:05:56.060 We're just creating a training set. 01:05:56.260 --> 01:05:59.860 Now this training set will be made up of two lists. 01:06:00.060 --> 01:06:03.860 We have the... 01:06:04.260 --> 01:06:09.060 inputs and the targets, the labels. 01:06:09.260 --> 01:06:11.060 And these bigrams will denote x, y. 01:06:11.260 --> 01:06:13.060 Those are the characters, right? 01:06:13.260 --> 01:06:17.060 And so we're given the first character of the bigram and then we're trying to predict the next one. 01:06:17.260 --> 01:06:19.060 Both of these are going to be integers. 01:06:19.260 --> 01:06:24.060 So here we'll take xs.append is just x1. 01:06:24.260 --> 01:06:27.060 ys.append is x2. 01:06:27.260 --> 01:06:31.060 And then here we actually don't want lists of integers. 01:06:31.260 --> 01:06:34.060 We will create tensors out of these. 01:06:34.260 --> 01:06:37.060 xs is torch.tensor of xs. 01:06:37.260 --> 01:06:41.060 And ys is torch.tensor of ys. 01:06:41.260 --> 01:06:47.060 And then we don't actually want to take all the words just yet because I want everything to be manageable. 01:06:47.260 --> 01:06:51.060 So let's just do the first word, which is Emma. 01:06:51.260 --> 01:06:55.060 And then it's clear what these xs and ys would be. 01:06:55.260 --> 01:07:01.060 Here let me print character1, character2, just so you see what's going on here. 01:07:01.260 --> 01:07:04.060 So the bigrams of these characters is... 01:07:04.260 --> 01:07:14.060 So this single word, as I mentioned, has one, two, three, four, five examples for our neural network. 01:07:14.260 --> 01:07:17.060 There are five separate examples in Emma. 01:07:17.260 --> 01:07:19.060 And those examples I'll summarize here. 01:07:19.260 --> 01:07:27.060 When the input to the neural network is integer 0, the desired label is integer 5, which corresponds to e. 01:07:27.260 --> 01:07:32.060 When the input to the neural network is 5, we want its weights to be arranged, 01:07:32.060 --> 01:07:34.860 so that 13 gets a very high probability. 01:07:35.060 --> 01:07:38.860 When 13 is put in, we want 13 to have a high probability. 01:07:39.060 --> 01:07:42.860 When 13 is put in, we also want 1 to have a high probability. 01:07:43.060 --> 01:07:46.860 When 1 is input, we want 0 to have a very high probability. 01:07:47.060 --> 01:07:52.860 So there are five separate input examples to a neural net in this dataset. 01:07:55.060 --> 01:08:00.860 I wanted to add a tangent of a note of caution to be careful with a lot of the APIs of some of these frameworks. 01:08:00.860 --> 01:08:07.660 You saw me silently use torch.tensor with a lowercase t, and the output looked right. 01:08:07.860 --> 01:08:11.660 But you should be aware that there's actually two ways of constructing a tensor. 01:08:11.860 --> 01:08:16.660 There's a torch.lowercase tensor, and there's also a torch.capitalTensor class, 01:08:16.860 --> 01:08:19.660 which you can also construct, so you can actually call both. 01:08:19.860 --> 01:08:24.660 You can also do torch.capitalTensor, and you get an x as in y as well. 01:08:24.860 --> 01:08:27.660 So that's not confusing at all. 01:08:27.860 --> 01:08:30.660 There are threads on what is the difference between these two. 01:08:30.860 --> 01:08:35.660 And unfortunately, the docs are just not clear on the difference. 01:08:35.860 --> 01:08:38.660 And when you look at the docs of lowercase tensor, 01:08:38.860 --> 01:08:42.660 constructs tensor with no autograd history by copying data. 01:08:42.860 --> 01:08:45.660 It's just like, it doesn't make sense. 01:08:45.860 --> 01:08:50.660 So the actual difference, as far as I can tell, is explained eventually in this random thread that you can Google. 01:08:50.860 --> 01:08:55.660 And really it comes down to, I believe, that... 01:08:55.860 --> 01:08:57.660 Where is this? 01:08:57.860 --> 01:09:00.660 Torch.tensor infers the D type, the data type, 01:09:00.860 --> 01:09:03.660 automatically, while torch.tensor just returns a float tensor. 01:09:03.860 --> 01:09:06.660 I would recommend to stick to torch.lowercase tensor. 01:09:06.860 --> 01:09:12.660 So indeed, we see that when I construct this with a capital T, 01:09:12.860 --> 01:09:16.660 the data type here of x is float32. 01:09:16.860 --> 01:09:19.660 But torch.lowercase tensor, 01:09:19.860 --> 01:09:25.660 you see how it's now x.dtype is now integer. 01:09:25.860 --> 01:09:30.660 So it's advised that you use lowercase t 01:09:30.860 --> 01:09:33.660 and you can read more about it if you like in some of these threads. 01:09:33.860 --> 01:09:37.660 But basically, I'm pointing out some of these things 01:09:37.860 --> 01:09:42.660 because I want to caution you and I want you to get used to reading a lot of documentation 01:09:42.860 --> 01:09:46.660 and reading through a lot of Q&As and threads like this. 01:09:46.860 --> 01:09:50.660 And some of this stuff is unfortunately not easy and not very well documented 01:09:50.860 --> 01:09:52.660 and you have to be careful out there. 01:09:52.860 --> 01:09:56.660 What we want here is integers because that's what makes sense. 01:09:56.860 --> 01:10:00.660 And so lowercase tensor is what we are using. 01:10:00.860 --> 01:10:05.660 OK, now we want to think through how we're going to feed in these examples into a neural network. 01:10:05.860 --> 01:10:09.660 Now, it's not quite as straightforward as plugging it in 01:10:09.860 --> 01:10:11.660 because these examples right now are integers. 01:10:11.860 --> 01:10:14.660 So there's like a 0, 5 or 13. 01:10:14.860 --> 01:10:16.660 It gives us the index of the character. 01:10:16.860 --> 01:10:19.660 And you can't just plug an integer index into a neural net. 01:10:19.860 --> 01:10:23.660 These neural nets are sort of made up of these neurons 01:10:23.860 --> 01:10:26.660 and these neurons have weights. 01:10:26.860 --> 01:10:30.660 And as you saw in microGRAD, these weights act multiplicatively on the inputs. 01:10:30.860 --> 01:10:33.660 WX plus B, there's 10 Hs and so on. 01:10:33.860 --> 01:10:37.660 And so it doesn't really make sense to make an input neuron take on integer values 01:10:37.860 --> 01:10:41.660 that you feed in and then multiply on with weights. 01:10:41.860 --> 01:10:46.660 So instead, a common way of encoding integers is what's called one-hot encoding. 01:10:46.860 --> 01:10:50.660 In one-hot encoding, we take an integer like 13 01:10:50.860 --> 01:10:55.660 and we create a vector that is all zeros except for the 13th dimension, 01:10:55.860 --> 01:10:57.660 which we turn to a 1. 01:10:57.860 --> 01:11:00.660 And then that vector can feed into a neural net. 01:11:00.860 --> 01:11:07.660 Now, conveniently, PyTorch actually has something called the one-hot function 01:11:07.860 --> 01:11:09.660 inside torch and then functional. 01:11:09.860 --> 01:11:13.660 It takes a tensor made up of integers. 01:11:13.860 --> 01:11:17.660 Long is an integer. 01:11:17.860 --> 01:11:21.660 And it also takes a number of classes, 01:11:21.860 --> 01:11:26.660 which is how large you want your tensor, your vector to be. 01:11:26.860 --> 01:11:30.660 So here, let's import torch.nn.func. 01:11:30.860 --> 01:11:33.660 This is a common way of importing it. 01:11:33.860 --> 01:11:36.660 And then let's do f.one-hot. 01:11:36.860 --> 01:11:39.660 And we feed in the integers that we want to encode. 01:11:39.860 --> 01:11:43.660 So we can actually feed in the entire array of Xs. 01:11:43.860 --> 01:11:47.660 And we can tell it that numclasses is 27. 01:11:47.860 --> 01:11:49.660 So it doesn't have to try to guess it. 01:11:49.860 --> 01:11:53.660 It may have guessed that it's only 13 and would give us an incorrect result. 01:11:53.860 --> 01:11:55.660 So this is the one-hot. 01:11:55.860 --> 01:11:59.660 Let's call this xinc for xencoded. 01:12:00.860 --> 01:12:05.660 And then we see that xencoded.shape is 5 by 27. 01:12:05.860 --> 01:12:11.660 And we can also visualize it, plt.imshow of xinc, 01:12:11.860 --> 01:12:14.660 to make it a little bit more clear because this is a little messy. 01:12:14.860 --> 01:12:19.660 So we see that we've encoded all the five examples into vectors. 01:12:19.860 --> 01:12:22.660 We have five examples, so we have five rows, 01:12:22.860 --> 01:12:25.660 and each row here is now an example into a neural net. 01:12:25.860 --> 01:12:29.660 And we see that the appropriate bit is turned on as a one, 01:12:29.660 --> 01:12:31.460 and everything else is zero. 01:12:31.660 --> 01:12:36.460 So here, for example, the zeroth bit is turned on. 01:12:36.660 --> 01:12:38.460 The fifth bit is turned on. 01:12:38.660 --> 01:12:41.460 Thirteenth bits are turned on for both of these examples. 01:12:41.660 --> 01:12:44.460 And then the first bit here is turned on. 01:12:44.660 --> 01:12:49.460 So that's how we can encode integers into vectors. 01:12:49.660 --> 01:12:52.460 And then these vectors can feed into neural nets. 01:12:52.660 --> 01:12:55.460 One more issue to be careful with here, by the way, is 01:12:55.660 --> 01:12:57.460 let's look at the data type of xincoding. 01:12:57.660 --> 01:12:59.460 We always want to be careful with data types. 01:12:59.460 --> 01:13:02.260 What would you expect xincoding's data type to be? 01:13:02.460 --> 01:13:04.260 When we're plugging numbers into neural nets, 01:13:04.460 --> 01:13:06.260 we don't want them to be integers. 01:13:06.460 --> 01:13:10.260 We want them to be floating-point numbers that can take on various values. 01:13:10.460 --> 01:13:13.260 But the dtype here is actually a 64-bit integer. 01:13:13.460 --> 01:13:15.260 And the reason for that, I suspect, 01:13:15.460 --> 01:13:19.260 is that one hot received a 64-bit integer here, 01:13:19.460 --> 01:13:21.260 and it returned the same data type. 01:13:21.460 --> 01:13:23.260 And when you look at the signature of one hot, 01:13:23.460 --> 01:13:26.260 it doesn't even take a dtype, a desired data type, 01:13:26.460 --> 01:13:28.260 of the output tensor. 01:13:28.260 --> 01:13:31.060 And so we can't, in a lot of functions in Torch, 01:13:31.260 --> 01:13:34.060 we'd be able to do something like dtype equals torch.float32, 01:13:34.260 --> 01:13:38.060 which is what we want, but one hot does not support that. 01:13:38.260 --> 01:13:43.060 So instead, we're going to want to cast this to float like this. 01:13:43.260 --> 01:13:46.060 So that these, everything is the same, 01:13:46.260 --> 01:13:48.060 everything looks the same, 01:13:48.260 --> 01:13:50.060 but the dtype is float32. 01:13:50.260 --> 01:13:53.060 And floats can feed into neural nets. 01:13:53.260 --> 01:13:56.060 So now let's construct our first neuron. 01:13:56.260 --> 01:13:58.060 This neuron will look at 01:13:58.060 --> 01:13:59.860 these input vectors. 01:14:00.060 --> 01:14:01.860 And as you remember from micrograd, 01:14:02.060 --> 01:14:03.860 these neurons basically perform a very simple function, 01:14:04.060 --> 01:14:05.860 wx plus b, 01:14:06.060 --> 01:14:08.860 where wx is a dot product, right? 01:14:09.060 --> 01:14:11.860 So we can achieve the same thing here. 01:14:12.060 --> 01:14:14.860 Let's first define the weights of this neuron, basically. 01:14:15.060 --> 01:14:17.860 What are the initial weights at initialization for this neuron? 01:14:18.060 --> 01:14:20.860 Let's initialize them with torch.random. 01:14:21.060 --> 01:14:26.860 torch.random fills a tensor with random numbers 01:14:26.860 --> 01:14:28.660 drawn from a normal distribution. 01:14:28.860 --> 01:14:33.660 And a normal distribution has a probability density function like this. 01:14:33.860 --> 01:14:36.660 And so most of the numbers drawn from this distribution 01:14:36.860 --> 01:14:38.660 will be around zero, 01:14:38.860 --> 01:14:41.660 but some of them will be as high as almost three and so on. 01:14:41.860 --> 01:14:45.660 And very few numbers will be above three in magnitude. 01:14:45.860 --> 01:14:49.660 So we need to take a size as an input here. 01:14:49.860 --> 01:14:53.660 And I'm going to use size to be 27 by one. 01:14:53.860 --> 01:14:56.660 So 27 by one 01:14:56.660 --> 01:14:58.460 and then let's visualize w. 01:14:58.660 --> 01:15:02.460 So w is a column vector of 27 numbers. 01:15:02.660 --> 01:15:08.460 And these weights are then multiplied by the inputs. 01:15:08.660 --> 01:15:10.460 So now to perform this multiplication, 01:15:10.660 --> 01:15:14.460 we can take x encoding and we can multiply it with w. 01:15:14.660 --> 01:15:19.460 This is a matrix multiplication operator in PyTorch. 01:15:19.660 --> 01:15:23.460 And the output of this operation is five by one. 01:15:23.660 --> 01:15:25.460 The reason it's five by one is the following. 01:15:25.660 --> 01:15:26.460 We took x encoding 01:15:26.660 --> 01:15:28.460 which is five by 27 01:15:28.660 --> 01:15:32.460 and we multiplied it by 27 by one. 01:15:32.660 --> 01:15:35.460 And in matrix multiplication, 01:15:35.660 --> 01:15:39.460 you see that the output will become five by one 01:15:39.660 --> 01:15:43.460 because these 27 will multiply and add. 01:15:43.660 --> 01:15:46.460 So basically what we're seeing here 01:15:46.660 --> 01:15:48.460 out of this operation 01:15:48.660 --> 01:15:53.460 is we are seeing the five activations 01:15:53.660 --> 01:15:55.460 of this neuron 01:15:55.460 --> 01:15:57.260 on these five inputs. 01:15:57.460 --> 01:16:00.260 And we've evaluated all of them in parallel. 01:16:00.460 --> 01:16:03.260 We didn't feed in just a single input to the single neuron. 01:16:03.460 --> 01:16:07.260 We fed in simultaneously all the five inputs into the same neuron. 01:16:07.460 --> 01:16:09.260 And in parallel, 01:16:09.460 --> 01:16:12.260 PyTorch has evaluated the wx plus b. 01:16:12.460 --> 01:16:14.260 But here is just wx. 01:16:14.460 --> 01:16:15.260 There's no bias. 01:16:15.460 --> 01:16:20.260 It has value w times x for all of them independently. 01:16:20.460 --> 01:16:22.260 Now instead of a single neuron though, 01:16:22.460 --> 01:16:24.260 I would like to have 27 neurons. 01:16:24.260 --> 01:16:27.060 And I'll show you in a second why I want 27 neurons. 01:16:27.260 --> 01:16:29.060 So instead of having just a one here, 01:16:29.260 --> 01:16:32.060 which is indicating this presence of one single neuron, 01:16:32.260 --> 01:16:34.060 we can use 27. 01:16:34.260 --> 01:16:37.060 And then when w is 27 by 27, 01:16:37.260 --> 01:16:43.060 this will in parallel evaluate all the 27 neurons 01:16:43.260 --> 01:16:45.060 on all the five inputs, 01:16:45.260 --> 01:16:49.060 giving us a much bigger result. 01:16:49.260 --> 01:16:53.060 So now what we've done is five by 27 multiplied 27 by 27. 01:16:53.060 --> 01:16:56.860 And the output of this is now five by 27. 01:16:57.060 --> 01:17:02.860 So we can see that the shape of this is five by 27. 01:17:03.060 --> 01:17:06.860 So what is every element here telling us, right? 01:17:07.060 --> 01:17:11.860 It's telling us for every one of 27 neurons that we created, 01:17:12.060 --> 01:17:18.860 what is the firing rate of those neurons on every one of those five examples? 01:17:19.060 --> 01:17:21.860 So the element, for example, 01:17:21.860 --> 01:17:24.660 three comma 13, 01:17:24.860 --> 01:17:28.660 is giving us the firing rate of the 13th neuron 01:17:28.860 --> 01:17:31.660 looking at the third input. 01:17:31.860 --> 01:17:35.660 And the way this was achieved is by a dot product 01:17:35.860 --> 01:17:40.660 between the third input and the 13th column 01:17:40.860 --> 01:17:44.660 of this w matrix here. 01:17:44.860 --> 01:17:47.660 So using matrix multiplication, 01:17:47.860 --> 01:17:51.660 we can very efficiently evaluate the dot product 01:17:51.660 --> 01:17:54.460 between lots of input examples in a batch 01:17:54.660 --> 01:17:58.460 and lots of neurons where all of those neurons have weights 01:17:58.660 --> 01:18:00.460 in the columns of those w's. 01:18:00.660 --> 01:18:02.460 And in matrix multiplication, 01:18:02.660 --> 01:18:05.460 we're just doing those dot products in parallel. 01:18:05.660 --> 01:18:07.460 Just to show you that this is the case, 01:18:07.660 --> 01:18:11.460 we can take xank and we can take the third row. 01:18:11.660 --> 01:18:16.460 And we can take the w and take its 13th column. 01:18:16.660 --> 01:18:21.460 And then we can do xank at three 01:18:21.660 --> 01:18:26.460 element-wise multiply with w at 13 01:18:26.660 --> 01:18:27.460 and sum that up. 01:18:27.660 --> 01:18:29.460 That's wx plus b. 01:18:29.660 --> 01:18:32.460 Well, there's no plus b, it's just wx dot product. 01:18:32.660 --> 01:18:34.460 And that's this number. 01:18:34.660 --> 01:18:37.460 So you see that this is just being done efficiently 01:18:37.660 --> 01:18:40.460 by the matrix multiplication operation 01:18:40.660 --> 01:18:42.460 for all the input examples 01:18:42.660 --> 01:18:45.460 and for all the output neurons of this first layer. 01:18:45.660 --> 01:18:48.460 Okay, so we fed our 27 dimensional inputs 01:18:48.660 --> 01:18:50.460 into a first layer of a neural net 01:18:50.460 --> 01:18:52.260 that has 27 neurons, right? 01:18:52.460 --> 01:18:56.260 So we have 27 inputs and now we have 27 neurons. 01:18:56.460 --> 01:18:59.260 These neurons perform w times x. 01:18:59.460 --> 01:19:00.260 They don't have a bias 01:19:00.460 --> 01:19:02.260 and they don't have a nonlinearity like tanh. 01:19:02.460 --> 01:19:05.260 We're going to leave them to be a linear layer. 01:19:05.460 --> 01:19:07.260 In addition to that, 01:19:07.460 --> 01:19:09.260 we're not going to have any other layers. 01:19:09.460 --> 01:19:10.260 This is going to be it. 01:19:10.460 --> 01:19:12.260 It's just going to be the dumbest, smallest, 01:19:12.460 --> 01:19:13.260 simplest neural net, 01:19:13.460 --> 01:19:15.260 which is just a single linear layer. 01:19:15.460 --> 01:19:17.260 And now I'd like to explain 01:19:17.460 --> 01:19:20.260 what I want those 27 outputs to be. 01:19:20.460 --> 01:19:22.260 Intuitively, what we're trying to produce here 01:19:22.460 --> 01:19:24.260 for every single input example 01:19:24.460 --> 01:19:25.260 is we're trying to produce 01:19:25.460 --> 01:19:27.260 some kind of a probability distribution 01:19:27.460 --> 01:19:29.260 for the next character in a sequence. 01:19:29.460 --> 01:19:31.260 And there's 27 of them. 01:19:31.460 --> 01:19:33.260 But we have to come up with precise semantics 01:19:33.460 --> 01:19:35.260 for exactly how we're going to interpret 01:19:35.460 --> 01:19:39.260 these 27 numbers that these neurons take on. 01:19:39.460 --> 01:19:41.260 Now intuitively, you see here 01:19:41.460 --> 01:19:43.260 that these numbers are negative 01:19:43.460 --> 01:19:45.260 and some of them are positive, etc. 01:19:45.460 --> 01:19:47.260 And that's because these are coming out 01:19:47.460 --> 01:19:48.260 of the neural net layer 01:19:48.460 --> 01:19:50.260 initialized with these 01:19:50.460 --> 01:19:53.260 normal distribution parameters. 01:19:53.460 --> 01:19:55.260 But what we want is 01:19:55.460 --> 01:19:57.260 we want something like we had here. 01:19:57.460 --> 01:20:00.260 Like each row here told us the counts 01:20:00.460 --> 01:20:02.260 and then we normalize the counts 01:20:02.460 --> 01:20:03.260 to get probabilities. 01:20:03.460 --> 01:20:05.260 And we want something similar 01:20:05.460 --> 01:20:06.260 to come out of the neural net. 01:20:06.460 --> 01:20:08.260 But what we just have right now 01:20:08.460 --> 01:20:10.260 is just some negative and positive numbers. 01:20:10.460 --> 01:20:12.260 Now we want those numbers 01:20:12.460 --> 01:20:14.260 to somehow represent the probabilities 01:20:14.460 --> 01:20:15.260 for the next character. 01:20:15.460 --> 01:20:17.260 But you see that probabilities, 01:20:17.460 --> 01:20:19.260 they have a special structure. 01:20:19.260 --> 01:20:21.060 They're positive numbers 01:20:21.260 --> 01:20:22.060 and they sum to one. 01:20:22.260 --> 01:20:24.060 And so that doesn't just come out 01:20:24.260 --> 01:20:25.060 of a neural net. 01:20:25.260 --> 01:20:27.060 And then they can't be counts 01:20:27.260 --> 01:20:30.060 because these counts are positive 01:20:30.260 --> 01:20:32.060 and counts are integers. 01:20:32.260 --> 01:20:34.060 So counts are also not really a good thing 01:20:34.260 --> 01:20:36.060 to output from a neural net. 01:20:36.260 --> 01:20:38.060 So instead, what the neural net 01:20:38.260 --> 01:20:39.060 is going to output 01:20:39.260 --> 01:20:41.060 and how we are going to interpret 01:20:41.260 --> 01:20:43.060 the 27 numbers 01:20:43.260 --> 01:20:45.060 is that these 27 numbers 01:20:45.260 --> 01:20:48.060 are giving us log counts, basically. 01:20:48.060 --> 01:20:52.860 So instead of giving us counts directly, 01:20:53.060 --> 01:20:53.860 like in this table, 01:20:54.060 --> 01:20:55.860 they're giving us log counts. 01:20:56.060 --> 01:20:57.060 And to get the counts, 01:20:57.260 --> 01:20:58.860 we're going to take the log counts 01:20:59.060 --> 01:21:00.860 and we're going to exponentiate them. 01:21:01.060 --> 01:21:05.860 Now, exponentiation takes the following form. 01:21:06.060 --> 01:21:09.860 It takes numbers that are negative 01:21:10.060 --> 01:21:10.860 or they are positive. 01:21:11.060 --> 01:21:12.860 It takes the entire real line. 01:21:13.060 --> 01:21:14.860 And then if you plug in negative numbers, 01:21:15.060 --> 01:21:16.860 you're going to get e to the x, 01:21:16.860 --> 01:21:19.660 which is always below one. 01:21:19.860 --> 01:21:22.660 So you're getting numbers lower than one. 01:21:22.860 --> 01:21:25.660 And if you plug in numbers greater than zero, 01:21:25.860 --> 01:21:27.660 you're getting numbers greater than one 01:21:27.860 --> 01:21:30.660 all the way growing to the infinity. 01:21:30.860 --> 01:21:32.660 And this here grows to zero. 01:21:32.860 --> 01:21:34.660 So basically, we're going to 01:21:34.860 --> 01:21:39.660 take these numbers here 01:21:39.860 --> 01:21:43.660 and instead of them being positive 01:21:43.860 --> 01:21:45.660 and negative in all their place, 01:21:45.660 --> 01:21:48.460 we're going to interpret them as log counts. 01:21:48.660 --> 01:21:50.460 And then we're going to element-wise 01:21:50.660 --> 01:21:52.460 exponentiate these numbers. 01:21:52.660 --> 01:21:55.460 Exponentiating them now gives us something like this. 01:21:55.660 --> 01:21:57.460 And you see that these numbers now, 01:21:57.660 --> 01:21:59.460 because they went through an exponent, 01:21:59.660 --> 01:22:02.460 all the negative numbers turned into numbers below one, 01:22:02.660 --> 01:22:04.460 like 0.338. 01:22:04.660 --> 01:22:06.460 And all the positive numbers, originally, 01:22:06.660 --> 01:22:08.460 turned into even more positive numbers, 01:22:08.660 --> 01:22:10.460 sort of greater than one. 01:22:10.660 --> 01:22:12.460 So like, for example, 01:22:12.660 --> 01:22:14.460 seven 01:22:14.460 --> 01:22:18.260 is some positive number over here 01:22:18.460 --> 01:22:20.260 that is greater than zero. 01:22:20.460 --> 01:22:24.260 But exponentiated outputs here 01:22:24.460 --> 01:22:27.260 basically give us something that we can use and interpret 01:22:27.460 --> 01:22:30.260 as the equivalent of counts originally. 01:22:30.460 --> 01:22:32.260 So you see these counts here? 01:22:32.460 --> 01:22:35.260 1, 12, 7, 51, 1, etc. 01:22:35.460 --> 01:22:39.260 The neural net is kind of now predicting 01:22:39.460 --> 01:22:41.260 counts. 01:22:41.460 --> 01:22:44.260 And these counts are positive numbers. 01:22:44.460 --> 01:22:47.260 They're probably below zero, so that makes sense. 01:22:47.460 --> 01:22:50.260 And they can now take on various values 01:22:50.460 --> 01:22:54.260 depending on the settings of W. 01:22:54.460 --> 01:22:56.260 So let me break this down. 01:22:56.460 --> 01:23:01.260 We're going to interpret these to be the log counts. 01:23:01.460 --> 01:23:03.260 In other words for this, that is often used, 01:23:03.460 --> 01:23:05.260 is so-called logits. 01:23:05.460 --> 01:23:08.260 These are logits, log counts. 01:23:08.460 --> 01:23:11.260 And these will be sort of the counts. 01:23:11.460 --> 01:23:13.260 Logits exponentiated. 01:23:13.260 --> 01:23:16.060 And this is equivalent to the n matrix, 01:23:16.260 --> 01:23:20.060 sort of the n array that we used previously. 01:23:20.260 --> 01:23:22.060 Remember this was the n? 01:23:22.260 --> 01:23:24.060 This is the array of counts. 01:23:24.260 --> 01:23:32.060 And each row here are the counts for the next character, sort of. 01:23:32.260 --> 01:23:34.060 So those are the counts. 01:23:34.260 --> 01:23:39.060 And now the probabilities are just the counts normalized. 01:23:39.260 --> 01:23:43.060 And so I'm not going to find the same, 01:23:43.060 --> 01:23:45.860 but basically I'm not going to scroll all over the place. 01:23:46.060 --> 01:23:47.860 We've already done this. 01:23:48.060 --> 01:23:51.860 We want to counts.sum along the first dimension. 01:23:52.060 --> 01:23:54.860 And we want to keep dims as true. 01:23:55.060 --> 01:23:56.860 We've went over this. 01:23:57.060 --> 01:23:59.860 And this is how we normalize the rows of our counts matrix 01:24:00.060 --> 01:24:02.860 to get our probabilities. 01:24:03.060 --> 01:24:04.860 Props. 01:24:05.060 --> 01:24:07.860 So now these are the probabilities. 01:24:08.060 --> 01:24:10.860 And these are the counts that we have currently. 01:24:10.860 --> 01:24:13.660 And now when I show the probabilities, 01:24:13.860 --> 01:24:18.660 you see that every row here, of course, 01:24:18.860 --> 01:24:22.660 will sum to one because they're normalized. 01:24:22.860 --> 01:24:26.660 And the shape of this is 5 by 27. 01:24:26.860 --> 01:24:29.660 And so really what we've achieved is 01:24:29.860 --> 01:24:31.660 for every one of our five examples, 01:24:31.860 --> 01:24:34.660 we now have a row that came out of a neural net. 01:24:34.860 --> 01:24:37.660 And because of the transformations here, 01:24:37.860 --> 01:24:40.660 we made sure that this output of this neural net now 01:24:40.660 --> 01:24:42.460 can be interpreted to be probabilities 01:24:42.660 --> 01:24:45.460 or we can interpret to be probabilities. 01:24:45.660 --> 01:24:48.460 So our WX here gave us logits. 01:24:48.660 --> 01:24:51.460 And then we interpret those to be log counts. 01:24:51.660 --> 01:24:54.460 We exponentiate to get something that looks like counts. 01:24:54.660 --> 01:24:56.460 And then we normalize those counts 01:24:56.660 --> 01:24:58.460 to get a probability distribution. 01:24:58.660 --> 01:25:00.460 And all of these are differentiable operations. 01:25:00.660 --> 01:25:03.460 So what we've done now is we are taking inputs. 01:25:03.660 --> 01:25:05.460 We have differentiable operations 01:25:05.660 --> 01:25:07.460 that we can back propagate through. 01:25:07.660 --> 01:25:09.460 And we're getting out probability distributions. 01:25:09.460 --> 01:25:14.260 So for example, for the zeroth example that fed in, 01:25:14.460 --> 01:25:18.260 which was the zeroth example here, 01:25:18.460 --> 01:25:20.260 was a one-hot vector of zero. 01:25:20.460 --> 01:25:27.260 And it basically corresponded to feeding in this example here. 01:25:27.460 --> 01:25:30.260 So we're feeding in a dot into a neural net. 01:25:30.460 --> 01:25:32.260 And the way we fed the dot into a neural net 01:25:32.460 --> 01:25:34.260 is that we first got its index. 01:25:34.460 --> 01:25:36.260 Then we one-hot encoded it. 01:25:36.460 --> 01:25:38.260 Then it went into the neural net. 01:25:38.260 --> 01:25:43.060 And out came this distribution of probabilities. 01:25:43.260 --> 01:25:47.060 And its shape is 27. 01:25:47.260 --> 01:25:49.060 There's 27 numbers. 01:25:49.260 --> 01:25:52.060 And we're going to interpret this as the neural net's assignment 01:25:52.260 --> 01:25:56.060 for how likely every one of these characters, 01:25:56.260 --> 01:25:59.060 the 27 characters, are to come next. 01:25:59.260 --> 01:26:02.060 And as we tune the weights W, 01:26:02.260 --> 01:26:05.060 we're going to be, of course, getting different probabilities out 01:26:05.260 --> 01:26:07.060 for any character that you input. 01:26:07.060 --> 01:26:08.860 And so now the question is just, 01:26:09.060 --> 01:26:10.860 can we optimize and find a good W 01:26:11.060 --> 01:26:13.860 such that the probabilities coming out are pretty good? 01:26:14.060 --> 01:26:16.860 And the way we measure pretty good is by the loss function. 01:26:17.060 --> 01:26:18.860 Okay, so I organized everything into a single summary 01:26:19.060 --> 01:26:20.860 so that hopefully it's a bit more clear. 01:26:21.060 --> 01:26:21.860 So it starts here. 01:26:22.060 --> 01:26:23.860 We have an input data set. 01:26:24.060 --> 01:26:25.860 We have some inputs to the neural net. 01:26:26.060 --> 01:26:29.860 And we have some labels for the correct next character in a sequence. 01:26:30.060 --> 01:26:31.860 And these are integers. 01:26:32.060 --> 01:26:34.860 Here I'm using torch generators now 01:26:35.060 --> 01:26:36.860 so that you see the same numbers 01:26:37.060 --> 01:26:37.860 that I see. 01:26:38.060 --> 01:26:41.860 And I'm generating 27 neurons' weights. 01:26:42.060 --> 01:26:47.860 And each neuron here receives 27 inputs. 01:26:48.060 --> 01:26:50.860 Then here we're going to plug in all the input examples, 01:26:51.060 --> 01:26:52.860 x's, into a neural net. 01:26:53.060 --> 01:26:54.860 So here, this is a forward pass. 01:26:55.060 --> 01:26:57.860 First, we have to encode all of the inputs 01:26:58.060 --> 01:26:59.860 into one-hot representations. 01:27:00.060 --> 01:27:01.860 So we have 27 classes. 01:27:02.060 --> 01:27:03.860 We pass in these integers. 01:27:04.060 --> 01:27:06.860 And xinc becomes an array 01:27:07.060 --> 01:27:08.860 that is 5 by 27. 01:27:09.060 --> 01:27:11.860 Zeros except for a few ones. 01:27:12.060 --> 01:27:14.860 We then multiply this in the first layer of a neural net 01:27:15.060 --> 01:27:16.860 to get logits. 01:27:17.060 --> 01:27:19.860 Exponentiate the logits to get fake counts, sort of. 01:27:20.060 --> 01:27:23.860 And normalize these counts to get probabilities. 01:27:24.060 --> 01:27:26.860 So these last two lines, by the way, here 01:27:27.060 --> 01:27:29.860 are called the softmax, 01:27:30.060 --> 01:27:31.860 which I pulled up here. 01:27:32.060 --> 01:27:35.860 Softmax is a very often used layer in a neural net 01:27:35.860 --> 01:27:38.660 that takes these z's, which are logits, 01:27:38.860 --> 01:27:40.660 exponentiates them, 01:27:40.860 --> 01:27:42.660 and divides and normalizes. 01:27:42.860 --> 01:27:45.660 It's a way of taking outputs of a neural net layer. 01:27:45.860 --> 01:27:48.660 And these outputs can be positive or negative. 01:27:48.860 --> 01:27:51.660 And it outputs probability distributions. 01:27:51.860 --> 01:27:54.660 It outputs something that is always 01:27:54.860 --> 01:27:56.660 sums to one and are positive numbers, 01:27:56.860 --> 01:27:58.660 just like probabilities. 01:27:58.860 --> 01:28:00.660 So it's kind of like a normalization function 01:28:00.860 --> 01:28:02.660 if you want to think of it that way. 01:28:02.860 --> 01:28:04.660 And you can put it on top of any other linear layer 01:28:04.660 --> 01:28:05.460 inside a neural net. 01:28:05.660 --> 01:28:08.460 And it basically makes a neural net output probabilities 01:28:08.660 --> 01:28:10.460 that's very often used. 01:28:10.660 --> 01:28:13.460 And we used it as well here. 01:28:13.660 --> 01:28:14.460 So this is the forward pass, 01:28:14.660 --> 01:28:17.460 and that's how we made a neural net output probability. 01:28:17.660 --> 01:28:22.460 Now, you'll notice that 01:28:22.660 --> 01:28:25.460 all of these, this entire forward pass 01:28:25.660 --> 01:28:27.460 is made up of differentiable layers. 01:28:27.660 --> 01:28:30.460 Everything here we can backpropagate through. 01:28:30.660 --> 01:28:33.460 And we saw some of the backpropagation in micrograd. 01:28:33.460 --> 01:28:36.260 This is just multiplication and addition. 01:28:36.460 --> 01:28:38.260 All that's happening here is just multiply and add. 01:28:38.460 --> 01:28:40.260 And we know how to backpropagate through them. 01:28:40.460 --> 01:28:43.260 Exponentiation, we know how to backpropagate through. 01:28:43.460 --> 01:28:46.260 And then here, we are summing. 01:28:46.460 --> 01:28:49.260 And sum is easily backpropagatable as well. 01:28:49.460 --> 01:28:51.260 And division as well. 01:28:51.460 --> 01:28:54.260 So everything here is a differentiable operation. 01:28:54.460 --> 01:28:57.260 And we can backpropagate through. 01:28:57.460 --> 01:28:59.260 Now, we achieve these probabilities, 01:28:59.460 --> 01:29:01.260 which are 5 by 27. 01:29:01.460 --> 01:29:03.260 For every single example, 01:29:03.260 --> 01:29:06.060 we have a vector of probabilities that sum to 1. 01:29:06.260 --> 01:29:08.060 And then here, I wrote a bunch of stuff 01:29:08.260 --> 01:29:11.060 to sort of like break down the examples. 01:29:11.260 --> 01:29:16.060 So we have 5 examples making up Emma, right? 01:29:16.260 --> 01:29:20.060 And there are 5 bigrams inside Emma. 01:29:20.260 --> 01:29:23.060 So bigram example 1 01:29:23.260 --> 01:29:26.060 is that E is the beginning character 01:29:26.260 --> 01:29:28.060 right after dot. 01:29:28.260 --> 01:29:31.060 And the indexes for these are 0 and 5. 01:29:31.260 --> 01:29:33.060 So then we feed in a 0 01:29:33.260 --> 01:29:36.060 that's the input to the neural net. 01:29:36.260 --> 01:29:38.060 We get probabilities from the neural net 01:29:38.260 --> 01:29:41.060 that are 27 numbers. 01:29:41.260 --> 01:29:43.060 And then the label is 5 01:29:43.260 --> 01:29:46.060 because E actually comes after dot. 01:29:46.260 --> 01:29:48.060 So that's the label. 01:29:48.260 --> 01:29:51.060 And then we use this label 5 01:29:51.260 --> 01:29:54.060 to index into the probability distribution here. 01:29:54.260 --> 01:29:57.060 So this index 5 here 01:29:57.260 --> 01:30:00.060 is 0, 1, 2, 3, 4, 5. 01:30:00.260 --> 01:30:02.060 It's this number here, 01:30:02.060 --> 01:30:03.860 and this number here. 01:30:04.060 --> 01:30:05.860 So that's basically the probability 01:30:06.060 --> 01:30:06.860 assigned by the neural net 01:30:07.060 --> 01:30:08.860 to the actual correct character. 01:30:09.060 --> 01:30:10.860 You see that the network currently thinks 01:30:11.060 --> 01:30:11.860 that this next character, 01:30:12.060 --> 01:30:13.860 that E following dot, 01:30:14.060 --> 01:30:15.860 is only 1% likely, 01:30:16.060 --> 01:30:17.860 which is of course not very good, right? 01:30:18.060 --> 01:30:19.860 Because this actually is a training example, 01:30:20.060 --> 01:30:21.860 and the network thinks that this is currently 01:30:22.060 --> 01:30:22.860 very, very unlikely. 01:30:23.060 --> 01:30:24.860 But that's just because we didn't get very lucky 01:30:25.060 --> 01:30:26.860 in generating a good setting of W. 01:30:27.060 --> 01:30:29.860 So right now this network thinks this is unlikely, 01:30:30.060 --> 01:30:31.860 and 0.01 is not a good outcome. 01:30:32.060 --> 01:30:33.860 So the log likelihood then 01:30:34.060 --> 01:30:35.860 is very negative. 01:30:36.060 --> 01:30:38.860 And the negative log likelihood is very positive. 01:30:39.060 --> 01:30:42.860 And so 4 is a very high negative log likelihood, 01:30:43.060 --> 01:30:44.860 and that means we're going to have a high loss. 01:30:45.060 --> 01:30:46.860 Because what is the loss? 01:30:47.060 --> 01:30:49.860 The loss is just the average negative log likelihood. 01:30:51.060 --> 01:30:53.860 So the second character is E . 01:30:54.060 --> 01:30:55.860 And you see here that also the network thought 01:30:56.060 --> 01:30:58.860 that M following E is very unlikely, 1%. 01:30:58.860 --> 01:31:03.660 For M following M, it thought it was 2%. 01:31:03.860 --> 01:31:05.660 And for A following M, 01:31:05.860 --> 01:31:07.660 it actually thought it was 7% likely. 01:31:07.860 --> 01:31:09.660 So just by chance, 01:31:09.860 --> 01:31:11.660 this one actually has a pretty good probability, 01:31:11.860 --> 01:31:14.660 and therefore a pretty low negative log likelihood. 01:31:14.860 --> 01:31:17.660 And finally here, it thought this was 1% likely. 01:31:17.860 --> 01:31:20.660 So overall, our average negative log likelihood, 01:31:20.860 --> 01:31:21.660 which is the loss, 01:31:21.860 --> 01:31:24.660 the total loss that summarizes basically 01:31:24.860 --> 01:31:26.660 how well this network currently works, 01:31:26.860 --> 01:31:28.660 at least on this one word, 01:31:28.860 --> 01:31:30.660 not on the full data set, just the one word, 01:31:30.860 --> 01:31:31.660 is 3.76, 01:31:31.860 --> 01:31:33.660 which is actually a fairly high loss. 01:31:33.860 --> 01:31:36.660 This is not a very good setting of Ws. 01:31:36.860 --> 01:31:38.660 Now here's what we can do. 01:31:38.860 --> 01:31:40.660 We're currently getting 3.76. 01:31:40.860 --> 01:31:43.660 We can actually come here and we can change our W. 01:31:43.860 --> 01:31:45.660 We can resample it. 01:31:45.860 --> 01:31:48.660 So let me just add one to have a different seed. 01:31:48.860 --> 01:31:50.660 And then we get a different W. 01:31:50.860 --> 01:31:52.660 And then we can rerun this. 01:31:52.860 --> 01:31:54.660 And with this different seed, 01:31:54.860 --> 01:31:56.660 with this different setting of Ws, 01:31:56.860 --> 01:31:58.660 we now get 3.37. 01:31:58.860 --> 01:32:00.660 So this is a much better W, right? 01:32:00.860 --> 01:32:02.660 And it's better because the probabilities 01:32:02.860 --> 01:32:05.660 just happen to come out higher 01:32:05.860 --> 01:32:08.660 for the characters that actually are next. 01:32:08.860 --> 01:32:11.660 And so you can imagine actually just resampling this. 01:32:11.860 --> 01:32:14.660 We can try 2. 01:32:14.860 --> 01:32:16.660 Okay, this was not very good. 01:32:16.860 --> 01:32:18.660 Let's try one more. 01:32:18.860 --> 01:32:20.660 We can try 3. 01:32:20.860 --> 01:32:22.660 Okay, this was a terrible setting 01:32:22.860 --> 01:32:24.660 because we have a very high loss. 01:32:24.860 --> 01:32:27.660 So anyway, I'm going to erase this. 01:32:28.860 --> 01:32:30.660 What I'm doing here, 01:32:30.860 --> 01:32:32.660 which is just guess and check 01:32:32.860 --> 01:32:34.660 of randomly assigning parameters 01:32:34.860 --> 01:32:36.660 and seeing if the network is good, 01:32:36.860 --> 01:32:38.660 that is amateur hour. 01:32:38.860 --> 01:32:40.660 That's not how you optimize in neural net. 01:32:40.860 --> 01:32:42.660 The way you optimize in neural net 01:32:42.860 --> 01:32:44.660 is you start with some random guess 01:32:44.860 --> 01:32:46.660 and we're going to commit to this one, 01:32:46.860 --> 01:32:48.660 even though it's not very good. 01:32:48.860 --> 01:32:50.660 But now the big deal is we have a loss function. 01:32:50.860 --> 01:32:53.660 So this loss is made up only of differentiable operations. 01:32:53.860 --> 01:32:56.660 And we can minimize the loss by tuning Ws 01:32:56.660 --> 01:33:00.460 by computing the gradients of the loss 01:33:00.660 --> 01:33:03.460 with respect to these W matrices. 01:33:03.660 --> 01:33:06.460 And so then we can tune W to minimize the loss 01:33:06.660 --> 01:33:08.460 and find a good setting of W 01:33:08.660 --> 01:33:10.460 using gradient based optimization. 01:33:10.660 --> 01:33:12.460 So let's see how that will work. 01:33:12.660 --> 01:33:14.460 Now things are actually going to look 01:33:14.660 --> 01:33:16.460 almost identical to what we had with micrograd. 01:33:16.660 --> 01:33:20.460 So here I pulled up the lecture from micrograd, 01:33:20.660 --> 01:33:22.460 the notebook that's from this repository. 01:33:22.660 --> 01:33:24.460 And when I scroll all the way to the end 01:33:24.660 --> 01:33:26.460 where we left off with micrograd, 01:33:26.460 --> 01:33:28.260 we had something very, very similar. 01:33:28.460 --> 01:33:30.260 We had a number of input examples. 01:33:30.460 --> 01:33:33.260 In this case, we had four input examples inside Xs. 01:33:33.460 --> 01:33:37.260 And we had their targets, desired targets. 01:33:37.460 --> 01:33:39.260 Just like here, we have our Xs now, 01:33:39.460 --> 01:33:40.260 but we have five of them. 01:33:40.460 --> 01:33:43.260 And they're now integers instead of vectors. 01:33:43.460 --> 01:33:46.260 But we're going to convert our integers to vectors, 01:33:46.460 --> 01:33:49.260 except our vectors will be 27 large 01:33:49.460 --> 01:33:51.260 instead of three large. 01:33:51.460 --> 01:33:54.260 And then here what we did is first we did a forward pass 01:33:54.460 --> 01:33:56.260 where we ran a neural net 01:33:56.260 --> 01:34:00.060 from all of the inputs to get predictions. 01:34:00.260 --> 01:34:02.060 Our neural net at the time, this NFX, 01:34:02.260 --> 01:34:05.060 was a multi-layer perceptron. 01:34:05.260 --> 01:34:07.060 Our neural net is going to look different 01:34:07.260 --> 01:34:10.060 because our neural net is just a single layer, 01:34:10.260 --> 01:34:13.060 single linear layer followed by a softmax. 01:34:13.260 --> 01:34:15.060 So that's our neural net. 01:34:15.260 --> 01:34:18.060 And the loss here was the mean squared error. 01:34:18.260 --> 01:34:20.060 So we simply subtracted the prediction 01:34:20.260 --> 01:34:22.060 from the ground truth and squared it 01:34:22.260 --> 01:34:23.060 and summed it all up. 01:34:23.260 --> 01:34:24.060 And that was the loss. 01:34:24.260 --> 01:34:26.060 And loss was the single number 01:34:26.060 --> 01:34:28.860 that summarized the quality of the neural net. 01:34:29.060 --> 01:34:31.860 And when loss is low, like almost zero, 01:34:32.060 --> 01:34:35.860 that means the neural net is predicting correctly. 01:34:36.060 --> 01:34:37.860 So we had a single number 01:34:38.060 --> 01:34:41.860 that summarized the performance of the neural net. 01:34:42.060 --> 01:34:43.860 And everything here was differentiable 01:34:44.060 --> 01:34:46.860 and was stored in a massive compute graph. 01:34:47.060 --> 01:34:49.860 And then we iterated over all the parameters. 01:34:50.060 --> 01:34:51.860 We made sure that the gradients are set to zero. 01:34:52.060 --> 01:34:53.860 And we called loss.backward. 01:34:54.060 --> 01:34:55.860 And loss.backward 01:34:55.860 --> 01:34:57.660 and we iterated backpropagation 01:34:57.860 --> 01:34:59.660 at the final output node of loss. 01:34:59.860 --> 01:35:01.660 So remember these expressions? 01:35:01.860 --> 01:35:03.660 We had loss all the way at the end. 01:35:03.860 --> 01:35:06.660 We start backpropagation and we went all the way back. 01:35:06.860 --> 01:35:08.660 And we made sure that we populated 01:35:08.860 --> 01:35:10.660 all the parameters .grad. 01:35:10.860 --> 01:35:12.660 So .grad started at zero, 01:35:12.860 --> 01:35:14.660 but backpropagation filled it in. 01:35:14.860 --> 01:35:15.660 And then in the update, 01:35:15.860 --> 01:35:17.660 we iterated over all the parameters 01:35:17.860 --> 01:35:19.660 and we simply did a parameter update 01:35:19.860 --> 01:35:23.660 where every single element of our parameters 01:35:23.660 --> 01:35:27.460 was notched in the opposite direction of the gradient. 01:35:27.660 --> 01:35:31.660 And so we're going to do the exact same thing here. 01:35:31.860 --> 01:35:38.460 So I'm going to pull this up on the side here 01:35:38.660 --> 01:35:39.860 so that we have it available. 01:35:40.060 --> 01:35:42.060 And we're actually going to do the exact same thing. 01:35:42.260 --> 01:35:44.060 So this was the forward pass. 01:35:44.260 --> 01:35:46.860 So we did this. 01:35:47.060 --> 01:35:48.860 And props is our YPred. 01:35:49.060 --> 01:35:50.460 So now we have to evaluate the loss, 01:35:50.660 --> 01:35:52.460 but we're not using the mean squared error. 01:35:52.460 --> 01:35:54.060 We're using the negative log likelihood 01:35:54.260 --> 01:35:55.460 because we are doing classification. 01:35:55.660 --> 01:35:58.860 We're not doing regression as it's called. 01:35:59.060 --> 01:36:02.260 So here we want to calculate loss. 01:36:02.460 --> 01:36:04.460 Now, the way we calculate it is just 01:36:04.660 --> 01:36:07.060 this average negative log likelihood. 01:36:07.260 --> 01:36:10.580 Now, this props here 01:36:10.780 --> 01:36:13.140 has a shape of five by twenty seven. 01:36:13.340 --> 01:36:14.860 And so to get all that, 01:36:15.060 --> 01:36:17.540 we basically want to pluck out the probabilities 01:36:17.740 --> 01:36:19.940 at the correct indices here. 01:36:20.140 --> 01:36:22.260 So in particular, because the labels are 01:36:22.460 --> 01:36:26.340 stored here in the array wise, basically what we're after is for the first 01:36:26.540 --> 01:36:30.820 example, we're looking at probability of five right at index five. 01:36:31.020 --> 01:36:36.100 For the second example, at the second row or row index one, 01:36:36.300 --> 01:36:40.140 we are interested in the probability assigned to index 13. 01:36:40.340 --> 01:36:43.300 At the second example, we also have 13. 01:36:43.500 --> 01:36:47.260 At the third row, we want one. 01:36:47.460 --> 01:36:51.140 And at the last row, which is four, we want zero. 01:36:51.340 --> 01:36:52.460 So these are the probabilities. 01:36:52.660 --> 01:36:53.940 We're interested in. 01:36:54.140 --> 01:36:58.580 And you can see that they're not amazing as we saw above. 01:36:58.780 --> 01:37:00.100 So these are the probabilities we want, 01:37:00.300 --> 01:37:04.380 but we want like a more efficient way to access these probabilities, 01:37:04.580 --> 01:37:06.940 not just listing them out in a tuple like this. 01:37:07.140 --> 01:37:09.180 So it turns out that the way to do this in PyTorch, 01:37:09.380 --> 01:37:15.140 one of the ways, at least, is we can basically pass in all of these 01:37:16.820 --> 01:37:19.580 sorry about that, all of these 01:37:19.780 --> 01:37:22.140 integers in the vectors. 01:37:22.660 --> 01:37:27.020 So these ones, you see how they're just zero, one, two, three, four. 01:37:27.220 --> 01:37:32.740 We can actually create that using MP, not MP, sorry, torch.arrange of five. 01:37:32.940 --> 01:37:34.300 Zero, one, two, three, four. 01:37:34.500 --> 01:37:38.180 So we can index here with torch.arrange of five. 01:37:38.380 --> 01:37:41.060 And here we index with wise. 01:37:41.260 --> 01:37:45.540 And you see that that gives us exactly these numbers. 01:37:49.100 --> 01:37:51.780 So that plucks up the probabilities of that. 01:37:51.780 --> 01:37:56.140 That the neural network assigns to the correct next character. 01:37:56.340 --> 01:37:59.700 Now we take those probabilities and we don't we actually look at the log 01:37:59.900 --> 01:38:03.340 probability, so we want to dot log 01:38:03.540 --> 01:38:06.620 and then we want to just average that up. 01:38:06.820 --> 01:38:09.100 So take the mean of all of that and then 01:38:09.300 --> 01:38:14.100 it's the negative average log likelihood that is the loss. 01:38:14.300 --> 01:38:17.860 So the loss here is three point seven something. 01:38:18.060 --> 01:38:21.780 And you see that this loss, three point seven six, three point seven six is 01:38:21.980 --> 01:38:26.300 exactly as we've obtained before, but this is a vectorized form of that expression. 01:38:26.500 --> 01:38:32.900 So we get the same loss and the same loss we can consider sort of as part of this 01:38:33.100 --> 01:38:36.180 forward pass and we've achieved here now loss. 01:38:36.380 --> 01:38:38.380 OK, so we made our way all the way to loss. 01:38:38.580 --> 01:38:39.900 We've defined the forward pass. 01:38:40.100 --> 01:38:42.100 We forwarded the network and the loss. 01:38:42.300 --> 01:38:44.180 Now we're ready to do the backward pass. 01:38:44.380 --> 01:38:46.420 So backward pass. 01:38:48.100 --> 01:38:50.780 We want to first make sure that all the gradients are reset. 01:38:50.980 --> 01:38:51.580 So they're at zero. 01:38:51.980 --> 01:38:55.980 Now, in PyTorch, you can set the gradients to be zero, 01:38:56.180 --> 01:38:59.940 but you can also just set it to none and setting it to none is more efficient. 01:39:00.140 --> 01:39:05.300 And PyTorch will interpret none as like a lack of a gradient and is the same as zeros. 01:39:05.500 --> 01:39:09.500 So this is a way to set to zero the gradient. 01:39:09.700 --> 01:39:13.700 And now we do loss.backward. 01:39:13.900 --> 01:39:16.900 Before we do loss.backward, we need one more thing. 01:39:17.100 --> 01:39:20.780 If you remember from micrograd, PyTorch actually requires 01:39:20.780 --> 01:39:25.020 that we pass in requires grad is true 01:39:25.220 --> 01:39:29.740 so that we tell PyTorch that we are interested in calculating gradients 01:39:29.940 --> 01:39:33.340 for this leaf tensor by default, this is false. 01:39:33.540 --> 01:39:40.340 So let me recalculate with that and then set to none and loss.backward. 01:39:40.740 --> 01:39:44.260 Now, something magical happened when loss.backward was run 01:39:44.460 --> 01:39:49.900 because PyTorch, just like micrograd, when we did the forward pass here, it keeps 01:39:49.900 --> 01:39:52.140 track of all the operations under the hood. 01:39:52.340 --> 01:39:54.620 It builds a full computational graph, 01:39:54.820 --> 01:39:57.660 just like the graphs we produced in micrograd. 01:39:57.860 --> 01:40:00.580 Those graphs exist inside PyTorch. 01:40:00.780 --> 01:40:02.740 And so it knows all the dependencies 01:40:02.740 --> 01:40:04.860 and all the mathematical operations of everything. 01:40:05.060 --> 01:40:09.380 And when you then calculate the loss, we can call a dot.backward on it. 01:40:09.580 --> 01:40:15.460 And dot.backward then fills in the gradients of all the intermediates all 01:40:15.660 --> 01:40:19.740 the way back to w's, which are the parameters of our neural net. 01:40:20.020 --> 01:40:23.780 So now we can do w.grad and we see that it has structure. 01:40:23.980 --> 01:40:25.980 There's stuff inside it. 01:40:29.100 --> 01:40:33.260 And these gradients, every single element here, 01:40:33.460 --> 01:40:40.460 so w.shape is 27 by 27, w.grad's shape is the same, 27 by 27. 01:40:40.660 --> 01:40:48.540 And every element of w.grad is telling us the influence of that weight on the loss function. 01:40:48.740 --> 01:40:49.540 So, for example, 01:40:49.540 --> 01:40:55.380 this number all the way here, if this element, the 00 element of w, 01:40:55.580 --> 01:41:00.100 because the gradient is positive, it's telling us that this has a positive 01:41:00.300 --> 01:41:06.780 influence on the loss, slightly nudging w, slightly taking w00 01:41:06.980 --> 01:41:12.300 and adding a small h to it would increase the loss 01:41:12.500 --> 01:41:15.580 mildly because this gradient is positive. 01:41:15.780 --> 01:41:18.460 Some of these gradients are also negative. 01:41:18.660 --> 01:41:19.500 So that's telling us 01:41:19.700 --> 01:41:21.140 about the gradient information. 01:41:21.340 --> 01:41:23.220 And we can use this gradient information 01:41:23.420 --> 01:41:26.580 to update the weights of this neural network. 01:41:26.780 --> 01:41:28.140 So let's now do the update. 01:41:28.340 --> 01:41:30.660 It's going to be very similar to what we had in micrograd. 01:41:30.860 --> 01:41:33.420 We need no loop over all the parameters 01:41:33.620 --> 01:41:37.020 because we only have one parameter tensor and that is w. 01:41:37.220 --> 01:41:42.060 So we simply do w.data plus equals. 01:41:42.260 --> 01:41:48.300 We can actually copy this almost exactly negative 0.1 times w.grad. 01:41:49.700 --> 01:41:54.420 And that would be the update to the tensor. 01:41:54.620 --> 01:41:58.500 So that updates the tensor. 01:41:58.700 --> 01:42:00.980 And because the tensor is updated, 01:42:01.180 --> 01:42:04.140 we would expect that now the loss should decrease. 01:42:04.340 --> 01:42:09.380 So here, if I print loss, 01:42:09.580 --> 01:42:11.100 that item, 01:42:11.300 --> 01:42:12.980 it was 3.76, right? 01:42:13.180 --> 01:42:15.820 So we've updated the w here. 01:42:16.020 --> 01:42:18.900 So if I recalculate forward pass, 01:42:18.900 --> 01:42:21.260 the loss now should be slightly lower. 01:42:21.460 --> 01:42:25.540 So 3.76 goes to 3.74. 01:42:25.740 --> 01:42:32.380 And then we can again set grad to none and backward, update. 01:42:32.580 --> 01:42:34.740 And now the parameters changed again. 01:42:34.940 --> 01:42:41.900 So if we recalculate the forward pass, we expect a lower loss again, 3.72. 01:42:42.260 --> 01:42:47.660 OK, and this is again doing the, we're now doing gradient descent. 01:42:47.660 --> 01:42:50.220 And when we achieve a low loss, 01:42:50.420 --> 01:42:55.140 that will mean that the network is assigning high probabilities to the correct next characters. 01:42:55.340 --> 01:42:59.340 OK, so I rearranged everything and I put it all together from scratch. 01:42:59.540 --> 01:43:03.220 So here is where we construct our data set of bigrams. 01:43:03.420 --> 01:43:06.860 You see that we are still iterating only over the first word, Emma. 01:43:07.060 --> 01:43:08.980 I'm going to change that in a second. 01:43:09.180 --> 01:43:13.380 I added a number that counts the number of elements in Xs 01:43:13.580 --> 01:43:16.820 so that we explicitly see that number of examples is five, 01:43:16.820 --> 01:43:20.420 because currently we're just working with Emma and there's five bigrams there. 01:43:20.620 --> 01:43:23.500 And here I added a loop of exactly what we had before. 01:43:23.700 --> 01:43:28.780 So we had ten iterations of gradient descent of forward pass, backward pass and update. 01:43:28.980 --> 01:43:32.620 And so running these two cells, initialization and gradient descent 01:43:32.820 --> 01:43:37.980 gives us some improvement on the loss function. 01:43:38.180 --> 01:43:41.460 But now I want to use all the words 01:43:41.660 --> 01:43:46.380 and there's not five, but 228,000 bigrams now. 01:43:46.820 --> 01:43:49.460 However, this should require no modification whatsoever. 01:43:49.660 --> 01:43:52.900 Everything should just run because all the code we wrote doesn't care if there's 01:43:53.100 --> 01:43:57.260 five bigrams or 228,000 bigrams and with everything, we should just work. 01:43:57.460 --> 01:44:00.260 So you see that this will just run. 01:44:00.460 --> 01:44:04.500 But now we are optimizing over the entire training set of all the bigrams. 01:44:04.700 --> 01:44:07.380 And you see now that we are decreasing very slightly. 01:44:07.580 --> 01:44:11.580 So actually, we can probably afford a larger learning rate. 01:44:12.460 --> 01:44:16.260 And probably afford even larger learning rate. 01:44:16.820 --> 01:44:23.700 Even 50 seems to work on this very, very simple example, right? 01:44:23.900 --> 01:44:27.660 So let me re-initialize and let's run 100 iterations. 01:44:27.860 --> 01:44:30.060 See what happens. 01:44:30.260 --> 01:44:33.260 Okay. 01:44:33.460 --> 01:44:40.780 We seem to be coming up to some pretty good losses here. 01:44:40.980 --> 01:44:42.100 2.47. 01:44:42.300 --> 01:44:43.940 Let me run 100 more. 01:44:44.140 --> 01:44:46.660 What is the number that we expect, by the way, in the loss? 01:44:46.860 --> 01:44:50.700 We expect to get something around what we had originally, actually. 01:44:50.900 --> 01:44:54.500 So all the way back, if you remember in the beginning of this video, 01:44:54.700 --> 01:45:02.700 when we optimized just by counting, our loss was roughly 2.47 after we added smoothing. 01:45:02.900 --> 01:45:09.020 But before smoothing, we had roughly 2.45 loss. 01:45:09.220 --> 01:45:13.420 And so that's actually roughly the vicinity of what we expect to achieve. 01:45:13.620 --> 01:45:15.700 But before we achieved it by counting. 01:45:15.900 --> 01:45:16.700 And here we are. 01:45:16.860 --> 01:45:20.820 We're achieving roughly the same result, but with gradient based optimization. 01:45:21.020 --> 01:45:26.140 So we come to about 2.46, 2.45, etc. 01:45:26.340 --> 01:45:27.860 And that makes sense because fundamentally, 01:45:27.860 --> 01:45:29.780 we're not taking in any additional information. 01:45:29.980 --> 01:45:31.460 We're still just taking in the previous 01:45:31.460 --> 01:45:33.460 character and trying to predict the next one. 01:45:33.660 --> 01:45:38.060 But instead of doing it explicitly by counting and normalizing, 01:45:38.260 --> 01:45:39.940 we are doing it with gradient based learning. 01:45:40.140 --> 01:45:42.060 And it just so happens that the explicit 01:45:42.260 --> 01:45:46.660 approach happens to very well optimize the loss function without any need 01:45:46.860 --> 01:45:50.180 for gradient based optimization, because the setup for bigram language 01:45:50.380 --> 01:45:54.500 models is so straightforward and so simple, we can just afford to estimate 01:45:54.700 --> 01:45:58.740 those probabilities directly and maintain them in a table. 01:45:58.940 --> 01:46:02.820 But the gradient based approach is significantly more flexible. 01:46:03.020 --> 01:46:06.540 So we've actually gained a lot because 01:46:06.740 --> 01:46:09.020 what we can do now is 01:46:09.220 --> 01:46:12.740 we can expand this approach and complexify the neural net. 01:46:12.940 --> 01:46:15.940 So currently we're just taking a single character and feeding into a neural net. 01:46:15.940 --> 01:46:17.660 And the neural net is extremely simple, 01:46:17.860 --> 01:46:20.300 but we're about to iterate on this substantially. 01:46:20.500 --> 01:46:23.820 We're going to be taking multiple previous characters and we're going 01:46:24.020 --> 01:46:27.340 to be feeding them into increasingly more complex neural nets. 01:46:27.540 --> 01:46:32.460 But fundamentally, the output of the neural net will always just be logits. 01:46:32.660 --> 01:46:35.340 And those logits will go through the exact same transformation. 01:46:35.540 --> 01:46:37.780 We are going to take them through a softmax, 01:46:37.980 --> 01:46:40.900 calculate the loss function and the negative log likelihood, 01:46:41.100 --> 01:46:45.860 and do gradient based optimization. And so actually, as we complexify, 01:46:46.060 --> 01:46:49.580 the neural nets and work all the way up to transformers, 01:46:49.780 --> 01:46:51.900 none of this will really fundamentally change. 01:46:51.980 --> 01:46:53.500 None of this will fundamentally change. 01:46:53.700 --> 01:46:57.300 The only thing that will change is the way we do the forward pass, 01:46:57.500 --> 01:47:01.180 where we take in some previous characters and calculate logits for the next 01:47:01.380 --> 01:47:04.900 character in a sequence that will become more complex. 01:47:05.100 --> 01:47:08.620 And we'll use the same machinery to optimize it. 01:47:08.820 --> 01:47:10.300 And 01:47:10.700 --> 01:47:15.580 it's not obvious how we would have extended this bigram approach into 01:47:16.060 --> 01:47:19.100 a space where there are many more characters at the input, 01:47:19.300 --> 01:47:23.060 because eventually these tables would get way too large because there's way too 01:47:23.260 --> 01:47:27.740 many combinations of what previous characters could be. 01:47:27.940 --> 01:47:29.540 If you only have one previous character, 01:47:29.740 --> 01:47:31.980 we can just keep everything in a table that counts. 01:47:32.180 --> 01:47:34.220 But if you have the last 10 characters 01:47:34.220 --> 01:47:37.300 that are input, we can't actually keep everything in the table anymore. 01:47:37.500 --> 01:47:39.700 So this is fundamentally an unscalable approach. 01:47:39.900 --> 01:47:42.900 And the neural network approach is significantly more scalable. 01:47:43.100 --> 01:47:45.820 And it's something that actually we can improve on 01:47:46.060 --> 01:47:48.380 over time. So that's where we will be digging next. 01:47:48.580 --> 01:47:50.980 I wanted to point out two more things. 01:47:51.180 --> 01:47:56.620 Number one, I want you to notice that this X-ENG here, 01:47:56.820 --> 01:47:58.780 this is made up of one-hot vectors. 01:47:58.980 --> 01:48:03.020 And then those one-hot vectors are multiplied by this W matrix. 01:48:03.220 --> 01:48:05.860 And we think of this as multiple neurons 01:48:06.060 --> 01:48:08.580 being forwarded in a fully connected manner. 01:48:08.780 --> 01:48:11.820 But actually what's happening here is that, for example, 01:48:12.020 --> 01:48:15.700 if you have a one-hot vector here that has a one 01:48:15.700 --> 01:48:19.300 at, say, the fifth dimension, then because of the way the matrix 01:48:19.500 --> 01:48:23.300 multiplication works, multiplying that one-hot vector with W 01:48:23.500 --> 01:48:27.420 actually ends up plucking out the fifth row of W. 01:48:27.620 --> 01:48:31.180 Logits would become just the fifth row of W. 01:48:31.380 --> 01:48:35.580 And that's because of the way the matrix multiplication works. 01:48:36.940 --> 01:48:39.860 So that's actually what ends up happening. 01:48:40.060 --> 01:48:45.660 So but that's actually exactly what happened before, because remember all the way up here, 01:48:45.860 --> 01:48:50.380 we have a bigram, we took the first character and then that first character 01:48:50.580 --> 01:48:56.620 indexed into a row of this array here, and that row gave us the probability 01:48:56.820 --> 01:49:01.140 distribution for the next character. So the first character was used as a lookup 01:49:01.340 --> 01:49:06.220 into a matrix here to get the probability distribution. 01:49:06.420 --> 01:49:09.300 Well, that's actually exactly what's happening here, because we're taking 01:49:09.500 --> 01:49:13.380 the index, we're encoding it as one-hot and multiplying it by W. 01:49:13.580 --> 01:49:15.300 So logits literally becomes 01:49:15.860 --> 01:49:20.660 the appropriate row of W. 01:49:20.860 --> 01:49:22.660 And that gets just as before, 01:49:22.860 --> 01:49:27.340 exponentiated to create the counts and then normalized and becomes probability. 01:49:27.540 --> 01:49:34.900 So this W here is literally the same as this array here. 01:49:35.100 --> 01:49:38.820 But W, remember, is the log counts, not the counts. 01:49:39.020 --> 01:49:45.660 So it's more precise to say that W exponentiated, W dot exp, is this array. 01:49:45.860 --> 01:49:51.860 But this array was filled in by counting and by basically 01:49:52.060 --> 01:49:55.740 populating the counts of bigrams, whereas in the gradient-based framework, 01:49:55.940 --> 01:50:03.060 we initialize it randomly and then we let the loss guide us to arrive at the exact same array. 01:50:03.260 --> 01:50:09.980 So this array exactly here is basically the array W at the end of optimization, 01:50:10.180 --> 01:50:14.860 except we arrived at it piece by piece by following the loss. 01:50:15.020 --> 01:50:17.740 And that's why we also obtain the same loss function at the end. 01:50:17.940 --> 01:50:20.340 And the second note is if I come here, 01:50:20.540 --> 01:50:25.780 remember the smoothing where we added fake counts to our counts in order to 01:50:25.980 --> 01:50:30.860 smooth out and make more uniform the distributions of these probabilities. 01:50:31.060 --> 01:50:34.820 And that prevented us from assigning zero probability to 01:50:35.020 --> 01:50:36.980 to any one bigram. 01:50:37.180 --> 01:50:42.820 Now, if I increase the count here, what's happening to the probability? 01:50:43.020 --> 01:50:44.820 As I increase the count, 01:50:45.020 --> 01:50:48.180 probability becomes more and more uniform, right? 01:50:48.380 --> 01:50:51.540 Because these counts go only up to like 900 or whatever. 01:50:51.740 --> 01:50:54.940 So if I'm adding plus a million to every single number here, 01:50:55.140 --> 01:50:59.700 you can see how the row and its probability then when you divide is just going to 01:50:59.900 --> 01:51:05.060 become more and more close to exactly even probability, uniform distribution. 01:51:05.260 --> 01:51:10.580 It turns out that the gradient-based framework has an equivalent to smoothing. 01:51:10.780 --> 01:51:12.580 In particular, 01:51:13.180 --> 01:51:14.820 think through these W's here. 01:51:15.020 --> 01:51:17.380 Which we initialize randomly. 01:51:17.580 --> 01:51:21.260 We could also think about initializing W's to be zero. 01:51:21.460 --> 01:51:23.980 If all the entries of W are zero, 01:51:24.180 --> 01:51:28.060 then you'll see that logits will become all zero. 01:51:28.260 --> 01:51:31.100 And then exponentiating those logits becomes all one. 01:51:31.300 --> 01:51:34.860 And then the probabilities turn out to be exactly uniform. 01:51:35.060 --> 01:51:39.140 So basically, when W's are all equal to each other or say, 01:51:39.340 --> 01:51:43.380 especially zero, then the probabilities come out completely uniform. 01:51:43.580 --> 01:51:44.780 So 01:51:44.980 --> 01:51:52.500 trying to incentivize W to be near zero is basically equivalent to label smoothing. 01:51:52.700 --> 01:51:55.180 And the more you incentivize that in a loss function, 01:51:55.380 --> 01:51:58.100 the more smooth distribution you're going to achieve. 01:51:58.300 --> 01:52:01.260 So this brings us to something that's called regularization, 01:52:01.460 --> 01:52:03.860 where we can actually augment the loss 01:52:04.060 --> 01:52:07.780 function to have a small component that we call a regularization loss. 01:52:07.980 --> 01:52:10.980 In particular, what we're going to do is we can take W 01:52:11.180 --> 01:52:13.780 and we can, for example, square all of its entries. 01:52:13.980 --> 01:52:14.780 And then, 01:52:15.060 --> 01:52:18.860 we can, whoops, sorry about that. 01:52:19.060 --> 01:52:22.380 We can take all the entries of W and we can sum them. 01:52:23.580 --> 01:52:28.100 And because we're squaring, there will be no signs anymore. 01:52:28.300 --> 01:52:31.300 Negatives and positives all get squashed to be positive numbers. 01:52:31.500 --> 01:52:37.020 And then the way this works is you achieve zero loss if W is exactly or zero. 01:52:37.220 --> 01:52:40.980 But if W has non-zero numbers, you accumulate loss. 01:52:41.180 --> 01:52:44.780 And so we can actually take this and we can add it on here. 01:52:44.980 --> 01:52:51.900 So we can do something like loss plus W square dot sum. 01:52:52.100 --> 01:52:53.500 Or let's actually instead of sum, 01:52:53.700 --> 01:52:57.420 let's take a mean because otherwise the sum gets too large. 01:52:57.620 --> 01:53:01.220 So mean is like a little bit more manageable. 01:53:01.420 --> 01:53:03.460 And then we have a regularization loss here. 01:53:03.660 --> 01:53:06.420 Let's say 0.01 times or something like that. 01:53:06.620 --> 01:53:09.220 You can choose the regularization strength 01:53:09.420 --> 01:53:11.980 and then we can just optimize this. 01:53:12.180 --> 01:53:14.860 And now this optimization actually has two components. 01:53:15.060 --> 01:53:17.860 Not only is it trying to make all the probabilities work out, 01:53:18.060 --> 01:53:20.380 but in addition to that, there's an additional component 01:53:20.580 --> 01:53:23.420 that simultaneously tries to make all Ws be zero. 01:53:23.620 --> 01:53:26.020 Because if Ws are non-zero, you feel a loss. 01:53:26.220 --> 01:53:29.980 And so minimizing this, the only way to achieve that is for W to be zero. 01:53:30.180 --> 01:53:34.740 And so you can think of this as adding like a spring force or like a gravity 01:53:34.940 --> 01:53:37.260 force that pushes W to be zero. 01:53:37.460 --> 01:53:40.940 So W wants to be zero and the probabilities want to be uniform, 01:53:41.140 --> 01:53:44.620 but they also simultaneously want to match up your 01:53:44.820 --> 01:53:47.220 probabilities as indicated by the data. 01:53:47.420 --> 01:53:50.460 And so the strength of this regularization 01:53:50.660 --> 01:53:57.020 is exactly controlling the amount of counts that you add here. 01:53:57.220 --> 01:54:02.580 Adding a lot more counts here corresponds to 01:54:02.780 --> 01:54:06.180 increasing this number, because the more you increase it, 01:54:06.380 --> 01:54:09.340 the more this part of the loss function dominates this part. 01:54:09.540 --> 01:54:14.340 And the more these weights will be unable to grow, because as they 01:54:14.620 --> 01:54:18.140 grow, they accumulate way too much loss. 01:54:18.340 --> 01:54:21.060 And so if this is strong enough, 01:54:21.260 --> 01:54:26.620 then we are not able to overcome the force of this loss and we will never 01:54:26.820 --> 01:54:29.260 and basically everything will be uniform predictions. 01:54:29.460 --> 01:54:30.540 So I thought that's kind of cool. 01:54:30.740 --> 01:54:32.980 OK, and lastly, before we wrap up, 01:54:33.180 --> 01:54:36.580 I wanted to show you how you would sample from this neural net model. 01:54:36.780 --> 01:54:43.340 And I copy pasted the sampling code from before, where remember that we sampled five 01:54:43.540 --> 01:54:44.620 times. 01:54:44.820 --> 01:54:46.100 And all we did is we start at zero. 01:54:46.300 --> 01:54:52.220 We grabbed the current ix row of p and that was our probability row 01:54:52.420 --> 01:54:58.700 from which we sampled the next index and just accumulated that and break when zero. 01:54:58.900 --> 01:55:03.700 And running this gave us these results. 01:55:03.900 --> 01:55:07.380 I still have the p in memory, so this is fine. 01:55:07.580 --> 01:55:11.780 Now, this p doesn't come from the row of p. 01:55:11.980 --> 01:55:14.540 Instead, it comes from this neural net. 01:55:14.820 --> 01:55:22.300 First, we take ix and we encode it into a one hot row of xank. 01:55:22.500 --> 01:55:25.020 This xank multiplies our w, 01:55:25.220 --> 01:55:28.980 which really just plucks out the row of w corresponding to ix. 01:55:29.180 --> 01:55:30.260 Really, that's what's happening. 01:55:30.460 --> 01:55:32.100 And that gets our logits. 01:55:32.300 --> 01:55:34.620 And then we normalize those logits, 01:55:34.820 --> 01:55:38.820 exponentiate to get counts and then normalize to get the distribution. 01:55:39.020 --> 01:55:41.180 And then we can sample from the distribution. 01:55:41.380 --> 01:55:43.100 So if I run this, 01:55:44.740 --> 01:55:48.420 it's kind of anticlimactic or climatic, depending how you look at it. 01:55:48.620 --> 01:55:51.500 But we get the exact same result. 01:55:51.700 --> 01:55:54.460 And that's because this is the identical model. 01:55:54.660 --> 01:55:59.300 Not only does it achieve the same loss, but as I mentioned, these are identical 01:55:59.500 --> 01:56:03.820 models and this w is the log counts of what we've estimated before. 01:56:04.020 --> 01:56:06.460 But we came to this answer in a very 01:56:06.460 --> 01:56:09.060 different way and it's got a very different interpretation. 01:56:09.260 --> 01:56:12.620 But fundamentally, this is basically the same model and gives the same samples here. 01:56:12.820 --> 01:56:14.540 And so 01:56:14.740 --> 01:56:15.500 that's kind of cool. 01:56:15.700 --> 01:56:17.820 OK, so we've actually covered a lot of ground. 01:56:18.020 --> 01:56:21.780 We introduced the bigram character level language model. 01:56:21.980 --> 01:56:26.020 We saw how we can train the model, how we can sample from the model and how we can 01:56:26.220 --> 01:56:30.020 evaluate the quality of the model using the negative log likelihood loss. 01:56:30.220 --> 01:56:31.620 And then we actually trained the model 01:56:31.820 --> 01:56:35.260 in two completely different ways that actually get the same result and the same 01:56:35.460 --> 01:56:40.300 model. In the first way, we just counted up the frequency of all the bigrams and 01:56:40.500 --> 01:56:44.540 normalized. In the second way, we used the 01:56:44.740 --> 01:56:50.700 negative log likelihood loss as a guide to optimizing the counts matrix 01:56:50.900 --> 01:56:55.660 or the counts array so that the loss is minimized in a gradient based framework. 01:56:55.860 --> 01:56:58.220 And we saw that both of them give the same result. 01:56:58.420 --> 01:57:00.060 And 01:57:00.460 --> 01:57:01.300 that's it. 01:57:01.500 --> 01:57:04.740 Now, the second one of these, the gradient based framework is much more flexible. 01:57:04.940 --> 01:57:07.580 And right now, our neural network is super simple. 01:57:07.780 --> 01:57:09.980 We're taking a single previous character 01:57:10.180 --> 01:57:13.740 and we're taking it through a single linear layer to calculate the logits. 01:57:13.860 --> 01:57:15.660 This is about to complexify. 01:57:15.860 --> 01:57:19.260 So in the follow up videos, we're going to be taking more and more of these 01:57:19.460 --> 01:57:22.780 characters and we're going to be feeding them into a neural net. 01:57:22.980 --> 01:57:25.220 But this neural net will still output the exact same thing. 01:57:25.420 --> 01:57:27.740 The neural net will output logits. 01:57:27.940 --> 01:57:30.620 And these logits will still be normalized in the exact same way. 01:57:30.620 --> 01:57:32.180 And all the loss and everything else 01:57:32.180 --> 01:57:35.220 in the gradient based framework, everything stays identical. 01:57:35.420 --> 01:57:40.260 It's just that this neural net will now complexify all the way to transformers. 01:57:40.460 --> 01:57:43.260 So that's going to be pretty awesome and I'm looking forward to it. 01:57:43.260 --> 01:57:44.300 So for now, bye.