Skip to content

Model Documentation

Ryan Ludwick edited this page May 25, 2021 · 2 revisions

Trie Model

The trie model was the very first modeling attempt and thus the basis for future iteration in our modeling efforts. Tries are general search tree data structures that are used for efficiently looking up elements of a set. Specifically, in our application of the trie model, these set elements are strings and links between nodes in the underlying tree are defined by individual characters.

To train the trie model, we start by splitting a file of code into individual tokens. These tokens are then loaded into the trie by doing a depth-wise insertion of each character in each parsed token. At inference time, we can then determine the most likely subsequent character sequence for a new token prefix by performing an iterative depth-first search through our trained Trie model on the token prefix's individual characters.

This method is an ideal starting point because it allows us to test both general training of the model, as well as the benefits of fine-tuning. To this point, we can perform a general training of the Trie model on a lexicon of reserved Python terms, popular variable names, and references in the current file -- then, we can explore the benefits of fine-tuning by continuing to train the Trie model on an entire user repository. We've found that even with this very naive model, this fine-tuning approach provides useful autocomplete recommendations based on token patterns defined in other files in the repository. Given how naive this initial attempt is, there are nonetheless many drawbacks. First, the model isn't space-efficient. While inference will always run in O(log n) time with respect to the average token length n, the size of Trie will scale linearly with respect to amount of inputted training data. Second, the Trie inference and training don't generalize to multi-token predictions. One of the most useful applications of the code autocomplete system is the ability to provide long sequences of scaffolding code that includes more complicated patterns than the Trie model can handle. These two points will be areas of focused improvement in next iterations of the models.

MLE Model

The next model was our MLE implementation. This model splits each file into individual words, which are used to populate a dictionary data structure. The dictionary maps a tuple of word bigrams to the number of times they appear in the training files. A word bigram is simply a pair of words that appear next to each other in the text. In addition, the model will keep track of each individual word count and the total number of unique words which will be used later on for the maximum likelihood estimation.

After training, the model is ready to predict the next word, given a previous word. It uses the location of the cursor in the document to extract the previous word. Finally, once the model is given the previous word, it predicts the most likely next word by using maximum likelihood estimation (MLE). This is calculated by the following equation which utilizes Laplace smoothing: (bigram count + 1) / (previous word count + vocab count). Each of these counts were generated during the training process and stored in their respective data structures. Laplace smoothing makes sure that there is not an error the a count is equal to 0. Using this equation the model returns the top 3 next words with the highest probability by looping over all possible words.

Combination Model

This model combines the strengths of the Trie and MLE models into one. It first trains on the input files to populate the Trie and MLE data structures that were mentioned in their implementations. When utilized, this model will suggest an autocompletion using the Trie model while the user is typing a word. Then, whenever a space is entered, signaling a word is finished, the model switches to the MLE implementation. It extracts the previous word the user typed, which will be used to predict the next word. As soon as the user begins to type letters again, it will switch back to Trie to provide autocompletions. This process of switching back and forth occurs while the user types code.

Transformer Model