TextAugment: Improving Short Text Classification through Global Augmentation Methods

You have just found TextAugment.

TextAugment is a Python 3 library for augmenting text for natural language processing applications. TextAugment stands on the giant shoulders of NLTK, Gensim v3.x, and TextBlob and plays nicely with them.

Acknowledgements

Cite this paper when using this library. Arxiv Version

@inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

>>> from textaugment import Word2vec, Fasttext
>>> t = Word2vec(model='path/to/gensim/model'or 'gensim model itself')
>>> t.augment('The stories are good')
The films are good
>>> t = Fasttext(model='path/to/gensim/model'or 'gensim model itself')
>>> t.augment('The stories are good')
The films are good

Advanced example

>>> runs = 1 # By default.
>>> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

>>> word = Word2vec(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)
>>> word.augment('The stories are good', top_n=10)
The movies are excellent
>>> fast = Fasttext(model='path/to/gensim/model'or'gensim model itself', runs=5, v=False, p=0.5)
>>> fast.augment('The stories are good', top_n=10)
The movies are excellent

WordNet-based augmentation

Basic example

>>> import nltk
>>> nltk.download('punkt')
>>> nltk.download('wordnet')
>>> from textaugment import Wordnet
>>> t = Wordnet()
>>> t.augment('In the afternoon, John is going to town')
In the afternoon, John is walking to town

Advanced example

>>> v = True # enable verbs augmentation. By default is True.
>>> n = False # enable nouns augmentation. By default is False.
>>> runs = 1 # number of times to augment a sentence. By default is 1.
>>> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

>>> t = Wordnet(v=False ,n=True, p=0.5)
>>> t.augment('In the afternoon, John is going to town', top_n=10)
In the afternoon, Joseph is going to town.

RTT-based augmentation

Example

>>> src = "en" # source language of the sentence
>>> to = "fr" # target language
>>> from textaugment import Translate
>>> t = Translate(src="en", to="fr")
>>> t.augment('In the afternoon, John is going to town')
In the afternoon John goes to town

EDA: Easy data augmentation techniques for boosting performance on text classification tasks

This is the implementation of EDA by Jason Wei and Kai Zou.

https://www.aclweb.org/anthology/D19-1670.pdf

See this notebook for an example

Synonym Replacement

Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.synonym_replacement("John is going to town", top_n=10)
John is give out to town

Random Deletion

Randomly remove each word in the sentence with probability p.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_deletion("John is going to town", p=0.2)
is going to town

Random Swap

Randomly choose two words in the sentence and swap their positions. Do this n times.

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_swap("John is going to town")
John town going to is

Random Insertion

Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times

Basic example

>>> from textaugment import EDA
>>> t = EDA()
>>> t.random_insertion("John is going to town")
John is going to make up town

AEDA: An easier data augmentation technique for text classification

This is the implementation of AEDA by Karimi et al, a variant of EDA. It is based on the random insertion of punctuation marks.

https://aclanthology.org/2021.findings-emnlp.234.pdf

Implementation

See this notebook for an example

Random Insertion of Punctuation Marks

Basic example

>>> from textaugment import AEDA
>>> t = AEDA()
>>> t.punct_insertion("John is going to town")
! John is going to town

Mixup augmentation

This is the implementation of mixup augmentation by Hongyi Zhang, Moustapha Cisse, Yann Dauphin, David Lopez-Paz adapted to NLP.

Used in Augmenting Data with Mixup for Sentence Classification: An Empirical Study.

Mixup is a generic and straightforward data augmentation principle. In essence, mixup trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularises the neural network to favour simple linear behaviour in-between training examples.

Implementation

See this notebook for an example

Built with ❤ on

Python

Authors

Acknowledgements

Cite this paper when using this library. Arxiv Version

@inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

Licence

MIT licensed. See the bundled LICENCE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
examples		examples
tests		tests
textaugment		textaugment
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
augment.png		augment.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TextAugment: Improving Short Text Classification through Global Augmentation Methods

You have just found TextAugment.

Acknowledgements

Table of Contents

Features

Citation Paper

Requirements

Installation

How to use

Fasttext/Word2vec-based augmentation

WordNet-based augmentation

RTT-based augmentation

EDA: Easy data augmentation techniques for boosting performance on text classification tasks

This is the implementation of EDA by Jason Wei and Kai Zou.

Synonym Replacement

Random Deletion

Random Swap

Random Insertion

AEDA: An easier data augmentation technique for text classification

Implementation

Random Insertion of Punctuation Marks

Mixup augmentation

Implementation

Built with ❤ on

Authors

Acknowledgements

Licence

About

Releases 9

Packages

Contributors 7

Languages

License

dsfsi/textaugment

Folders and files

Latest commit

History

Repository files navigation

TextAugment: Improving Short Text Classification through Global Augmentation Methods

You have just found TextAugment.

Acknowledgements

Table of Contents

Features

Citation Paper

Requirements

Installation

How to use

Fasttext/Word2vec-based augmentation

WordNet-based augmentation

RTT-based augmentation

EDA: Easy data augmentation techniques for boosting performance on text classification tasks

This is the implementation of EDA by Jason Wei and Kai Zou.

Synonym Replacement

Random Deletion

Random Swap

Random Insertion

AEDA: An easier data augmentation technique for text classification

Implementation

Random Insertion of Punctuation Marks

Mixup augmentation

Implementation

Built with ❤ on

Authors

Acknowledgements

Licence

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 7

Languages

Packages