English Auto Correct is a simple spell-correction tool based on word frequency from a large text corpus. This project uses the Leipzig English dataset to clean, prepare, and process text for auto-correct suggestions.
The process involves:
- Data Cleaning: Removing unwanted characters such as leading numbers and non-word characters.
- Corpus Preparation: Building a frequency dictionary by counting word occurrences.
- Generating Edits: Creating possible variations of the word (deletions, transpositions, replacements, insertions).
- Search for Correct Word: Checking generated edits against the frequency dictionary.
- Providing Suggestions: Suggesting the most likely correction based on word frequency.
Here’s a simple example of how the auto-correct functionality works:
Input: "appl"
Output: "apple"
Here’s an illustration of how English Auto Correct works: