Spell-magic is an auto spell-correction tool for Bangla
. (under development)
This project uses python 3.11.7
as it's dev
environment.
This is the dvc
bucket URL in google storage gs://spell-magic-dvc
.
The bucket is publicly accessible. You simply need to place your GOOGLE_APPLICATION_CREDENTIALS
in the root of the project, named gcp-creds.json
.
- Task (https://taskfile.dev/)
- Docker (https://docs.docker.com/get-docker/)
- Pyright (
vscode
)
To install with docker
docker build -t spell-magic .
docker run --rm -p 8080:8080 spell-magic
You can install the dev
requirements by running -
pip install -r requirements-dev.txt
For running the production environment (with poetry) -
poetry install --no-root
Finally, pull the weights
dvc pull
Now, run
uvicorn app.main:app --reload --port 8080
__Note 1: It is recommended to use a separate venv
for local installation.
Test with the following curl request
curl --location 'localhost:8080/api/v1/correct' \
--header 'Content-Type: application/json' \
--data '{
"text": "আমি বংলায় গন গাঁই"
}'
Expected response:
{
"corrected": "আমি বাংলায় গান গাই।",
"delay_in_seconds": 0.1145162582397461
}
Here is the project structure
.
├── app/ --> contains the api
├── datasets/ --> contains the datasets
├── model_artifacts/ --> serialized model weights and tokenizers
├── models/ --> empty placeholder
├── notebooks/ --> contains training and evaluation pipelines
├── tests/ --> contains some unit tests
├── utils/ --> contains reusable code snippets
├── Dockerfile
├── gcp-creds.json
├── lefthook.yml
├── Taskfile.yml
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── pyrightconfig.json
├── README.md
└── requirements-dev.txt
For this experiment, the Potrika
dataset was used to generate the synthetic data.
A subset of the entire dataset was used to train the model.
Domain | Sample Count |
---|---|
Economy | 5k |
Education | 5k |
Entertainment | 5k |
International | 5k |
National | 5k |
Politics | 5k |
ScienceTechnology | 5k |
Sports | 5k |
Although it's hard to simulate human error patterns, an attempt was made to generate some common errors -
Function | Action |
---|---|
swap_adjacent_characters | sometimes swaps consecutive characters |
insert_random_characters | randomly inserts characters in a sentence |
delete_random_characters | randomly deletes characters from a sentence |
substitute_phonetically_similar_characters | randomly swaps phonetically similar characters |
repeat_characters | repeats random characters |
remove_random_punctuation | deletes punctuation from random places |
delete_random_words | deletes random words from a sentence |
insert_random_spaces | inserts spaces at random positions |
The problem of correcting spelling can be tackled in various ways. Some common ways are using statistical models, NER, hand-picked rules, Seq2Seq, etc. In this experiment, a Seq2Seq model was used t5-small to correct incorrect sentences. Using a transformer-based model can essentially reduce the requirement of a lot of granular hand-picked rules. With a sufficient amount of diverse data, it can achieve a notable score in correcting spelling.
The following tables contain the evaluation scores for the model
Metric | Score | Scale |
---|---|---|
BLEU | 74.20 | 0-100 |
WER | 0.1296 | 0-1 |
CER | 0.0999 | 0-1 |
Note that the model was trained for only for 10 epochs on 40k
samples from the Potrika
dataset.
For training and evaluation related details, please refer to the notebooks directory.
Here is a weights and bias
dashboard containing the training logs and evaluation reports.
This project was done as a proof of concept. There are a few ways to improve the model.
- ML
- Training the model with real-world data, and adding more diverse synthetic data to the dataset should improve the scores.
- Training a larger variation of
t5
should contribute to the improvement. - A pre-trained tokenizer was used instead of a custom-trained one. The goal was to take full advantage of the pre-trained model embeddings. Training the model with a larger dataset along with a customized tokenizer should improve the performance.
- Right now, the model has a smaller context length (128 tokens). It can be increased to handle longer contexts.
- No heuristics are being applied to this project. In the real world, there are a lot of edge cases, and heuristics must be applied in those scenarios.
- (Unverified) In my opinion, a graph-based model could be used along with the Seq2Seq model to generalize token-level dependency better.
- (Unverified) A token classifier could be used to ensure the alignment of the source and target texts.
- Ops
- Currently, the model is being trained from notebooks, without any DP/DDP pipeline. For larger models, it's better to include multi-node training capabilities.
- The model is being loaded from
safetensors
. In production, anonnx
serialized model should yield more tokens per second. - Early stopping and other regularization methods could be added to make the training pipeline robust.
MIT License