Please see NATv2, an improved version of the original NAT framework.
This repository contains the code and the data that can be used to reproduce the experiments described in the "NAT: Noise-Aware Training for Robust Neural Sequence Labeling" paper, which was accepted to be presented at the ACL 2020 conference.
The standard sequence labeling systems are usually trained on clean text. Such systems exhibit substantially lower accuracy when applied on imperfect textual input. NAT aims to improve robustness of sequence labeling performed on data from noisy sources, like Optical Character Recognition (OCR), Automatic Speech Recognition (ASR) or misspelled user-generated text. NAT uses a standard sequence labeling architecture, but modifies the training objective of the neural model. To this end, it employs two auxiliary training objectives:
- Data augmentation objective, which directly induces noise in the input data and trains the model on a mixture of clean and noisy samples.
- Stability training objective that encourages similarity between the original and the perturbed input, which helps the model to produce a noise-invariant latent representation.
NAT was implemented as an extension to the FLAR library, but it can be integrated into any other sequence labeling framework.
├── flair
├── flair_ext
│ ├── models
│ ├── trainers
│ └── visual
├── resources
│ ├── cmx
│ ├── taggers
│ ├── tasks
│ └── typos
└── robust_ner
The flair directory includes the basic FLAIR framework. See the Quick Start section for more information, how to get it.
The flair_ext directory contains extensions to the basic FLAIR library:
- An extended sequence labeling model, which implements both NAT objectives.
- A modified trainer class, which performs training using the extended sequence labeling model.
The robust_ner directory comprises of the modules that are used for noise induction, spelling correction and the helper function/classes.
The resources directory includes the data files. Confusion matrices are included in the project. See the Quick Start notes for more information, how to get the typos files and the data sets.
- Clone or download the NAT GitHub repository:
git clone https://github.com/mnamysl/nat-acl2020
- Extract "flair.tgz" to the NAT directory:
tar -xzvf flair.tgz
Please follow the instruction on the websites of the corresponding shared tasks:
-
CoNLL 2003: https://www.clips.uantwerpen.be/conll2003/ner/
-
GermEval 2014: https://sites.google.com/site/germeval2014ner/data/
- Download typos lists from the following websites:
- Misspelling Oblivious Word Embeddings (MOE): https://github.com/facebookresearch/moe/tree/master/data (moe_misspellings_train.tsv)
- Typos released by Belinkov & Bisk: https://github.com/ybisk/charNMT-noise/tree/master/noise (de.natural and en.natural files).
- Move the downloaded files to the resources/tasks sub-directory.
- Please install all required packages as shown below:
pip install -r requirements.txt
- To use the ELMo embeddings, you need to install the AllenNLP library:
pip install allennlp
- If you plan to use the spell checking functionality, you need to install the packages required to run hunspell:
sudo apt-get install hunspell hunspell-de-de hunspell-en-us libhunspell-dev python-dev
pip install hunspell
You can use the NAT functionality by calling the main.py python script. The following command-line parameters can be specified (in the order of importance; parameters in bold are required):
Parameter | Description | Value |
---|---|---|
--mode | Execution mode | One of: train, tune, eval. |
--corpus | Data set to use | One of: conll03_en (default), conll03_de, germeval. |
--model | Model name | Arbitrary string. |
--train_mode | Training mode | One of: standard (default), augmentation, stability. |
--alpha | Auxiliary loss weight | Floating point number (default: 1.0). |
--misspelling_rate | Noise level | Floating point number (default: 0.0). |
--type | Type of embeddings | One of: flair (default), bert, elmo, word+char. |
--cmx_file | Confusion matrix file | e.g.: tesseract3-RS. |
--typos_file | Typos file | e.g.: en.natural or moe_misspellings_train.tsv. |
--spell_check | Use spell checking | no parameters, turned off by default. |
--lr | Initial learning rate | Floating point number (default: 0.1). |
--train_with_dev | Train with dev. set | no parameters, turned off by default. |
--col_idx | Index of a tag column | Integer value (default: 3). |
--text_idx | Index of the text column | Integer value (default: 0). |
--device | Device to use | Torch device type string (default: cuda). |
--downsample | Downsample rate | Floating point value (default: 1.0). |
--num_hidden | Tagger hidden state size | Integer value (default: 256). |
--max_epochs | Max. training epochs | Integer value (default: 100). |
--batch_size | Mini batch size | Integer value (default: 32). |
--checkpoint | Checkpoint file name | String (default: best-model.pt). |
--no_valid_misspell | No validation with misspellings | No parameters, turned off by default. |
--verbose | Print verbose messages | No parameters, turned off by default. |
-h | Print help | No parameters. |
The following call will start the training of a new model called my_new_model on the English CoNLL 2003 data set using the data augmentation objective with the weight factor of 1.0 and the noise level of 10%.
python3 main.py --mode train --corpus conll03_en --model my_new_model --train_mode augmentation --misspelling_rate 0.1 --alpha 1.0
All your models will be stored in the resources/taggers directory.
The following call will start the fine-tuning process of the previously trained model using the stability objective with different parameters and a lower learning rate:
python3 main.py --mode tune --corpus conll03_en --model my_trained_model --train_mode stability --misspelling_rate 0.05 --alpha 0.5 --lr 0.01
Finally, the prepared model can be evaluated on the real OCR erros by using the following call:
python3 main.py --mode eval --corpus conll03_en --model my_trained_model --cmx_file tesseract3-RS
Please cite our paper when using the code:
@inproceedings{namysl-etal-2020-nat,
title = "{NAT}: Noise-Aware Training for Robust Neural Sequence Labeling",
author = {Namysl, Marcin and Behnke, Sven and K{\"o}hler, Joachim},
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.138",
pages = "1501--1517",
}
This project is licensed under the MIT License - see the LICENSE.md file for details