Skip to content

Latest commit

 

History

History
168 lines (119 loc) · 6.19 KB

README.md

File metadata and controls

168 lines (119 loc) · 6.19 KB

Spell-Magic


Description

Spell-magic is an auto spell-correction tool for Bangla. (under development)

Installation

This project uses python 3.11.7 as it's dev environment.

This is the dvc bucket URL in google storage gs://spell-magic-dvc. The bucket is publicly accessible. You simply need to place your GOOGLE_APPLICATION_CREDENTIALS in the root of the project, named gcp-creds.json.

Tools required

Docker

To install with docker

Build

docker build -t spell-magic .

Run

docker run --rm -p 8080:8080 spell-magic

Local

You can install the dev requirements by running -

pip install -r requirements-dev.txt

For running the production environment (with poetry) -

poetry install --no-root

Finally, pull the weights

dvc pull

Now, run

uvicorn app.main:app --reload --port 8080

__Note 1: It is recommended to use a separate venv for local installation.

Usage

Test with the following curl request

curl --location 'localhost:8080/api/v1/correct' \
--header 'Content-Type: application/json' \
--data '{
"text": "আমি বংলায় গন গাঁই"
}'

Expected response:

{
  "corrected": "আমি বাংলায় গান গাই।",
  "delay_in_seconds": 0.1145162582397461
}

Project Structure

Here is the project structure

.
├── app/ --> contains the api
├── datasets/ --> contains the datasets
├── model_artifacts/ --> serialized model weights and tokenizers
├── models/ --> empty placeholder
├── notebooks/ --> contains training and evaluation pipelines
├── tests/ --> contains some unit tests
├── utils/ --> contains reusable code snippets
├── Dockerfile
├── gcp-creds.json
├── lefthook.yml
├── Taskfile.yml
├── poetry.lock
├── poetry.toml
├── pyproject.toml
├── pyrightconfig.json
├── README.md
└── requirements-dev.txt

Dataset

For this experiment, the Potrika dataset was used to generate the synthetic data.

A subset of the entire dataset was used to train the model.

Domain Sample Count
Economy 5k
Education 5k
Entertainment 5k
International 5k
National 5k
Politics 5k
ScienceTechnology 5k
Sports 5k

Although it's hard to simulate human error patterns, an attempt was made to generate some common errors -

Function Action
swap_adjacent_characters sometimes swaps consecutive characters
insert_random_characters randomly inserts characters in a sentence
delete_random_characters randomly deletes characters from a sentence
substitute_phonetically_similar_characters randomly swaps phonetically similar characters
repeat_characters repeats random characters
remove_random_punctuation deletes punctuation from random places
delete_random_words deletes random words from a sentence
insert_random_spaces inserts spaces at random positions

Model

The problem of correcting spelling can be tackled in various ways. Some common ways are using statistical models, NER, hand-picked rules, Seq2Seq, etc. In this experiment, a Seq2Seq model was used t5-small to correct incorrect sentences. Using a transformer-based model can essentially reduce the requirement of a lot of granular hand-picked rules. With a sufficient amount of diverse data, it can achieve a notable score in correcting spelling.

Evaluation

The following tables contain the evaluation scores for the model

Metric Score Scale
BLEU 74.20 0-100
WER 0.1296 0-1
CER 0.0999 0-1

Note that the model was trained for only for 10 epochs on 40k samples from the Potrika dataset. For training and evaluation related details, please refer to the notebooks directory. Here is a weights and bias dashboard containing the training logs and evaluation reports.

Improvements

This project was done as a proof of concept. There are a few ways to improve the model.

  • ML
    • Training the model with real-world data, and adding more diverse synthetic data to the dataset should improve the scores.
    • Training a larger variation of t5 should contribute to the improvement.
    • A pre-trained tokenizer was used instead of a custom-trained one. The goal was to take full advantage of the pre-trained model embeddings. Training the model with a larger dataset along with a customized tokenizer should improve the performance.
    • Right now, the model has a smaller context length (128 tokens). It can be increased to handle longer contexts.
    • No heuristics are being applied to this project. In the real world, there are a lot of edge cases, and heuristics must be applied in those scenarios.
    • (Unverified) In my opinion, a graph-based model could be used along with the Seq2Seq model to generalize token-level dependency better.
    • (Unverified) A token classifier could be used to ensure the alignment of the source and target texts.
  • Ops
    • Currently, the model is being trained from notebooks, without any DP/DDP pipeline. For larger models, it's better to include multi-node training capabilities.
    • The model is being loaded from safetensors. In production, an onnx serialized model should yield more tokens per second.
    • Early stopping and other regularization methods could be added to make the training pipeline robust.

License

MIT License