Deep Text Corrector corrects simple grammatical errors in a sentence.
$ git clone https://github.com/floydhub/dl-text-corrector
$ cd dl-text-corrector
$ floyd init dl-text-corrector
This project uses Cornel Movie-Dialogs Corpus as the dataset for training and testing. The dataset has already been preprocessed using preprocess_movie_dialogs.py. The data is split into 3 sets, 80% for training, and 10% each for validation and testing.
You can train your model by running correct_text.py
script in this repo on FloydHub, with necessary parameters. By default, the number of steps is 3000. This can be changed by setting the num_steps flag.
$ floyd run --gpu --env tensorflow:py2 --data feNGtpH9tZSj79NeqchSEB "python correct_text.py --train_path /input/data/movie_dialog_train.txt --val_path /input/data/movie_dialog_val.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --output_path /output"
This will kick off a new job on Floyd. This will take 30 minutes to run and will generate the model. You can follow along the progress by using the logs command.
$ floyd logs <RUN_ID> -t
Now you need to get the ID of the Output
generated by your job. Floyd info can give you that information.
$ floyd info <RUN_ID>
You can evaluate the generated model by running correct_text.py with decode flag set. Use the output id from the training step
as the datasource in this step. Then run correct_text.py
.
floyd run --gpu --env tensorflow:py2 --data <REPLACE_WITH_OUTPUT_ID> "python correct_text.py --train_path /input/data/movie_dialog_train.txt --test_path test.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --input_path /input --decode"
You can track the status of the run with the status or logs command.
$ floyd status <RUN_ID>
$ floyd logs <RUN_ID> -t
After the job finishes successfuly, view the output directory to see the output for the sample input sentences. Run the floyd output for this.
$ floyd output <RUN_ID>
You may notice that the output does not look great. In fact, the algorithm would've added more mistakes into the sentences than correct it. That is because we ran the training for a small number of iterations. To train a fully working model try the train step again, this time by setting the flag num_steps to a greater value. In general, about 20000 steps are necessary to give a fine corrector model, which takes about 15 hours to run. You can instead try one of our pre-trained models in the next section.
If you want to try out a good pre-trained model, we have a datasource for that available feNGtpH9tZSj79NeqchSEB publicly. You can either use it to do further training, or do testing or use the --decode_sentence flag to play with it locally on your machine. For the latter, just download the data above to the input directory and run the following
python correct_text.py --train_path ./input/data/movie_dialog_train.txt --config DefaultMovieDialogConfig --data_reader_type MovieDialogReader --input_path ./input --decode_sentence
To do testing on a given sample input, perform the testing commands under Evaluate your model. Now, you'd find that most sentences are corrected. The few sentences that are not corrected indicate that more training is required.