Skip to content

Latest commit

 

History

History
21 lines (19 loc) · 2.49 KB

README.md

File metadata and controls

21 lines (19 loc) · 2.49 KB

AI Autocomplete

This is a simple implementation of an autocorrect engine in Python. It is made with Tensorflow and Keras. It is meant to autcorrect or "autocomplete" a word with only given a segment of it.
Here is an example of how it works:

input: hel
output: hello

How it works

When creating this, I made the dataset by hand. I didn't want this to be a fully functional thing but rather a demo. The dataset was created in a CSV format. It has a mistyped/partially typed word on the X or input side, and then the Y or output side has the completed or corrected word. It will preprocess it with the custom preprocessor I made which split the word by character and tokenizes it. I did this because with traditional NLP tokenization methods such as by word would not have worked as efficiently. This is because character by character tokenization helps the Network learn more complex and patterns. This will also improve the accuracy overall. Also, mispelled words can be unique so tokenization by character would help solve the problem of the user entering something not in the vocabulary list. For the Y axis or output, I took all the unique labels and assigned them each a number. Next I would make a Sequential model. It will have the obvious Embeddings layer. Then we will add two LSTM layers one being bidirectional. This will help the network recognize patters on the front and back of the data. It will help it get context on inputs in simple terms. Then the last layer is a Fully connected Dense layer with the softmax activation. What that would do is classify the textual input as one of the labeled words. We will train it wiith the Adam optimizer and Sparse_categorical_crossentropy. This is commonly used for multiclass classification. When I run the model, it will take a misspelled word input from the user. It will then proceed to classify it. The classification is returned in an index where each index is one lablem and the label contains the probability of the prediction being said class. So find the label, we will get the index of the highest percentage in the array. That index will be converted to a label. The run function would proceed to return the label along with the confidence of it being said label.

V2

Version 2 is in the making.

Improvements

It will be able to take in a sentence and the last word in the sentence should be a misspelled word. It will take that and correct it. Hopefully it can use the sentence as better context and be more accurate with fixing the misspelled word.