Releases · pytorch/text

10 Dec 17:03

zhangguanheng66

v0.8.1

0f911ec

Torchtext 0.8.1 release notes

Highlights

Updated pinned PyTorch version to 1.7.1 and added Python 3.9 support.

Improvement

Added Python 3.9 support #1088
Added certifi for Windows unittest envir #1077
Added setup version to pin torch dependency #1067

Docs

Updated docs strings for torchtext.nn.InProjContainer #1083
Updated the doc strings for torchtext.nn.MultiheadAttentionContainer #1057

Assets 2

27 Oct 16:18

zhangguanheng66

v0.8.0-rc2

cd6902d

Torchtext 0.8.0 release notes

This is a relatively light release while we are working on revamping the library. According to PyTorch feature classification changes, the new building blocks and datasets in the experimental folder are defined as Prototype and available in the nightly release only. Once the prototype building blocks are matured enough, we will release them together with all the relevant commits in a beta release. At the same time, users are encouraged to take a look at those building blocks and give us feedback. An easy way to send your feedback is to open an issue in pytorch/text repo or comment in Issue #664. For details regarding the revamp execution, see Issue #985.

The nightly packages are accessible via Pip and Conda for Windows, Mac, and Linux. For example, Linux users can install the nightly wheels with the following command.

pip install --pre torch torchtext -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html

For more detail instructions, please refer to Install PyTorch. It should be noted that the new building blocks are still under development, and the APIs have not been solidified.

The stable release branch here includes a few feature improvements and documentation updates. Compiled against the PyTorch 1.7.0 release, the stable release packages are available via Pip and Conda for Windows, Linux, and Mac.

Improvements

Updated the BERT pipeline to improve question-answer task score #950
Fixed the order of the datasets used in the BERT example #1040
Skipped requests.get in download_from_url function if path exists #922
Used Ninja to build extensions and disable C++11 ABI when necessary for libtorch compatibility. #931
Removed SentencePiece from setup.py file. SentencePiece source code is now being used as the third-party library in torchtext #1055
Improved CircleCI setting for better engineering
- Switched PyTorch binary location for CI unittests #1044
- Parameterized UPLOAD_CHANNEL #1037
- Installed binaries for the CI test directly from the CPU channel #1025, #981
- Added dataclasses to dependencies for environment.yml #964
- Bumped Xcode workers to 9.4.1 #951
- Disabled glove tests due to URL breakage #920
- Used the specific channel for the CI tests #907

Docs

Added test and updated error message for load_sp_model function in torch.data.functional #984
Updated the README file in BERT example #899
Updated the legacy retirement message #1047
Updated index page to include links to PyTorch libraries and describe feature classification #1048
Cleaned up the doc strings #1049
Fixed clang-format version to what PyTorch uses #1052
Added OSX environment variables to the README file #1054
Updated README file for the prototype in the nightly release #1050

Bug Fixes

Fixed the order of the datasets used in the BERT example #1040

Assets 2

28 Jul 15:03

zhangguanheng66

v0.7.0-rc3

c851c3e

0.7.0: a new dataset abstraction for data processing

Highlights

With the continued progress of PyTorch, some code in torchtext grew out of date with the SOTA PyTorch modules (for example torch.utils.data.DataLoader, torchscript). In 0.7.0 release, we’re taking big steps toward modernizing torchtext, and adding warning messages to these legacy components which will be retired in the October 0.8.0 release. We’re also introducing a host of new features, including:

A generalized MultiheadAttentionContainer for flexible attention behavior
Torchscript support for SentencePiece models
An end-to-end BERT example pipeline, including pertained weights and a question answering fine-tuning example
The SQuAD1 and SQuAD2 question answering datasets
Windows support

Legacy code and issues

For a period of time (ending around June of 2019), torchtext lacked active maintenance and grew out of date with the present SOTA research and PyTorch features. We’ve committed to bringing the library fully up to date, and identified a few core issues:

Several components and functionals were unclear and difficult to adopt. For example, the Field class coupled tokenization, vocabularies, splitting, batching and sampling, padding, and numericalization all together, and was opaque and confusing to users. We determined that these components should be divided into separate orthogonal building blocks. For example, it was difficult to use HuggingFace's tokenizers with the Field class (issue #609). Modular pipeline components would allow a third party tokenizer to be swapped into the pipeline easily.
torchtext’s datasets were incompatible with DataLoader and Sampler in torch.utils.data, or even duplicated that code (e.g. torchtext.data.Iterator, torchtext.data.Batch). Basic inconsistencies confused users. For example, many struggled to fix the data order while using Iterator (issue #828), whereas with DataLoader, users can simply set shuffle=False to fix the data order.

We’ve addressed these issues in this release, and several legacy components are now ready to be retired:

torchtext.data.Batch (link)
torchtext.data.Field (link)
torchtext.data.Iterator (link)
torchtext.data.Example (link)

In 0.7.0 release, we add deprecation warnings, and finally will retire them to the torchtext.legacy directory in 0.8.0 release on October.

New dataset abstraction

Since the 0.4.0 release, we’ve been working on a new common interface for the torchtext datasets (inheriting from torch.utils.data.Dataset) to address the issues above, and completed it for this release. For standard usage, we’ve created a map-style dataset which materializes the text iterator. A default dataset processing pipeline, including tokenizer and vocabulary, is added to the map-style datasets to support one-command data loading.

from torchtext.experimental.datasets import AG_NEWS
train, test = AG_NEWS(ngrams=3)

For those who want more flexibility, the raw text is still available as a torch.utils.data.IterableDataset by simply inserting .raw into the module path as follows.

train, test = torchtext.experimental.datasets.raw.AG_NEWS()

Instead of maintaining Batch and Iterator func in torchtext, the new dataset abstraction is fully compatible with torch.utils.data.DataLoader like below. collate_fn is used to process the data batch generated from DataLoader.

from torch.utils.data import DataLoader
def collate_fn(batch):
    texts, labels = [], []
    for label, txt in batch:
        texts.append(txt)
        labels.append(label)
    return texts, labels
dataloader = DataLoader(train, batch_size=8, collate_fn=collate_fn)
for idx, (texts, labels) in enumerate(dataloader):
    print(idx, texts, labels)

With the new datasets, we worked together with the OSS community to re-write the legacy datasets in torchtext. Here is a brief summary of the progress:

Word language modeling datasets (WikiText2, WikiText103, PennTreeBank) #661, #774
Text classification datasets (AG_NEWS, SogouNews, DBpedia, YelpReviewPolarity, YelpReviewFull, YahooAnswers, AmazonReviewPolarity, AmazonReviewFull) #701, #775, #776
Sentiment analysis dataset (IMDb) #651
Translation datasets (Multi30k, IWSLT, WMT14) #751, #821, #851
Question-answer datasets (SQuAD1, SQuAD2) #773
Sequence tagging datasets (UDPOS, CoNLL2000Chunking) #805

Those new datasets stay in torchtext.experimental.datasets directory. The old version of the datasets are still available in torchtext.datasets and the new datasets are opt-in. In 0.8.0 release, the old datasets will be moved to torchtext.legacy directory.

To learn how to apply the new dataset abstraction with DataLoader and SOTA PyTorch compatibilities (like Distributed Data Parallel), we created a full example to use the new torchtext datasets (WikiText103, SQuAD1, etc) to train a BERT model. A pretrained BERT model is generated from masked language task and next sentence task. Then, the model is fine-tuned for the question-answer task. The example is available in torchtext repo (here).

Backwards Incompatible Changes

Remove code specific to Python2 #732

New Features

Refractor nn.MultiheadAttention as MultiheadAttentionContainer in torchtext #720, #839, #883
Pre-train BERT pipeline and fine-tune question-answer task #767
Experimental datasets in torchtext.experimental.datasets (See New Dataset Abstraction section above for the full list) #701, #773, #774, #775, #776, #805, #821, #851
Add Windows support for torchtext #772, #781, #789, #796, #807, #810, #829
Add torchscript support to SentencePiece #755, #771, #786, #798, #799

Improvements

Integrates pytorch-probot into the repo #877
Switch to pytorch TestCase for build-in dataset #822
Switch experimental ngrams_func to data.utils.ngrams_iterator #813
Create root directory automatically for download_from_url if not exists #797
Add shebang line to suppress the lint warning #787
Switch to CircleCI and improve torchtext CI tests #744, #766, #768, #777, #783, #784, #794, #800, #801, #803, #809, #832, #837, #881, #888
Put sacremoses tokenizer test back #782
Update installation directions #763, #764, #769, #795
Add CCI cache for test data #748
Disable travis tests except for RUN_FLAKE8 #747
Disable Travis tests of which equivalent run on CCI #746
Use 'cpu' instead of None for Iterator #745
Remove the allow to fail statement in travis test #743
Add explicit test results to text classification datasets #738

Docs

Bump nightlies to 0.8.0 #847
Update README.rst file #735, #817
Update the labels of docs in text classification datasets #734

Bug Fixes

None

Deprecations

Add deprecation warning to legacy code #863. The following legacy components are ready to be retired, including

torchtext.data.Batch (link)
torchtext.data.Field (link)
torchtext.data.Iterator (link)
torchtext.data.Example (link)
torchtext.datasets (link)

In 0.7.0 release, we add deprecation warnings, and finally will retire them to the torchtext.legacy directory in the October 0.8.0 release.

Assets 2

21 Apr 13:49

zhangguanheng66

0.6.0

3a54c7f

0.6.0: Drop Python2 support for torchtext

Highlights

This release drops the Python2 support from torchtext. Some minor bug fixes and doc updates are included.

We are continuously working on the new dataset abstraction. Users and developers are welcome to send feedback to issue #664. We want also to highlight a pull request #701 where the latest dataset abstraction is applied to the text classification datasets.

Backward compatibility

Unified tar and zip file handling within extract_archive function #692

Docs

Updated the BLEU example in doc #729
Updated README file with conda installation #728
Allowed maximum sentence length to 120 in flake8 #719
Updated CODE_OF_CONDUCT.md file #702
Removed duplicate docs on torchtext website #697
Updated README file with a disclaimer for the new dataset abstraction #693
Updated docs in experimental language modeling dataset #682

Bug Fixes

Sent out error message if SentencePiece is not installed. Fixed the SentencePiece dependency issue within conda package #733
Fixed a bug in experimental IMDB dataset to allow a custom vocab #683

Assets 2

14 Jan 15:29

zhangguanheng66

0.5.0

0169cde

0.5.0: A new abstraction for torchtext dataset

Highlights

We simplify the current torchtext dataset library by leveraging existing utils (DataLoader, Sampler) in PyTorch core library. Separate tokenizer, vocabulary, and data processing functionals. Users will feel empowered to build data processing pipelines.

[Experimental] New abstraction for torchtext dataset

torchtext v0.5.0 release officially introduces a new abstraction for the datasets. Based on the feedback from users, the new abstraction will solve several issues existing in torchtext, including

Several components and functionals are unclear and difficult to adopt. For example, Field class couples tokenizer, vocabulary, split, batching and sampling, padding, and numericalization together. The current Field class works like a "black box", and users are confused about what's going on within the class. Instead, those components should be divided into several basic building blocks. This is more consistent with PyTorch core library where users build models and pipelines with orthogonal components.
Incompatible with PyTorch core library, like DataLoader and Sampler in torch.utils.data. Some custom modules/functions in torchtext (e.g. Iterator, Batch, splits) should be replaced by the corresponding modules in torch.utils.data.

We have re-written several datasets in torchtext.experimental.datasets, which are using the new abstraction. The old version of the datasets are still available in torchtext.datasets, and the new datasets are opt-in. We expect to replace the legacy datasets with the experimental ones in the future. Torchtext users are welcome to send feedback to issue [#664]

Re-write Sentiment Analysis dataset [#651]
- IMDB
Re-write Language Modeling datasets [#624, #661], including
- WikiText2
- WikiText103
- PennTreebank

SentencePiece binding

The SentencePiece binding provides an effective way to solve the open vocabulary problems in NLP tasks. The binding now supports two segmentation algorithms, byte-pair-encoding (BPE) and unigram language model. It trains a subword models directly from raw text data, which are used to tokenize corpus and convert them into PyTorch tensors [#597]

Backward compatibility

Last release with the support of Python 2
Change the default ngrams value to 1 in text classification datasets [#663]
Temporarily remove a unit test test_get_tokenizer_moses from CI tests. Need to push it back after issue related to moses tokenizer is resolved. [#588]

We would like to thank the open source community, who continues to send pull requests for new features and bug-fixes.

New Features

Add unsupervised learning dataset EnWik9, compressing first 10⁹ bytes of enwiki-20060303-pages-articles.xml [#610]
Several generators are created to build the pipeline for text preprocessing [#624, #610, #597].
Add Bilingual Evaluation Understudy (BLEU) metric for translation task in torch.data.metrics [#627]
Add Cross-Lingual NLI Corpus (XNLI) dataset [#613]

Improvements

Improve download_from_url and extract_archive func. extract_archive func now supports .zip files. download_from_url func now explicitly gets the filename from the url instead of from url header. This allows to download from a non-google drive link [#602]
Add a legal disclaimer for torchtext datasets [#590]
Add installation command to Travis [#585]
Some improvements in the example torchtext/examples/text_classification [#580] [#578] [#576]
Fix and improve docs [#603] [#598] [#594] [#577] [#662]
Add Code of Conduct document [#638]
Add Contributing document [#637]

Bug Fixes

Fix a backward compatibility issue in Vocab class. The old version of torchtext doesn’t have unk_index attribute in Vocab, To avoid BC breaking, the setstate function now checks if there is unk_index attribute in the vocab object [#591]
Resolve an overflow error by decreasing the maxInt value, which is used to check csv.field_size_limit in unicode_csv_reader [#584]

Assets 2

27 Nov 21:49

zhangguanheng66

0.4.0

812ddc9

0.4.0: Supervised learning datasets and baselines

Highlights

Supervised learning baselines

torchtext 0.4.0 includes several example scripts that showcase how to create data, build vocabularies, train, test and run inference for common supervised learning baselines. We further provide a tutorial to explain these examples in more detail.

For an advanced application of these constructs see the iterable_train.py example.

Community

We would like to thank the open source community, who continues to send pull
requests for new features and bug-fixes.

Major New Features

New datasets for supervised learning (#557 #565 #580)
- AG_NEWS
- SogouNews
- DBpedia
- YelpReviewPolarity
- YelpReviewFull
- YahooAnswers
- AmazonReviewPolarity
- AmazonReviewFull
Tutorials and examples:
- Reference examples (#569 #575 #571 #575 #576) to
  - Create/save text classification datasets
  - Train and test a text classification model using one-line dataloading and iterator based Datasets.
  - Setup online inference based on a trained model
- A tutorial to showcase and illustrate these examples.

New Features

ngrams_iterator an iterator that yields ngrams based on a given list or iterator of strings. (#567 #577)
build_vocab_from_iterator (#567)
extract_archive (#569)

Improvements

Added logging to download_from_url (#569)
Added fast, basic english sentence normalization to get_tokenizer (#569 #568)
Updated docs theme to pytorch_sphinx_theme (#573)
Refined Example.fromJSON() to support parse nested key for parsing nested JSON dataset. (#563)
Added __len__ & get_vecs_by_tokens in 'Vectors' class to generate vector from a list of tokens (#561)
Added templates for torchtext users to bring up issues (#553 #574)
Added a new argument specials in Field.build_vocab to save the user-defined special tokens (#495)
Added a new argument is_target in RawField class to show whether the field is a target variable - False by default (#459). Adjusted is_target argument in LabelField to True to take it into effect (#450)
Added the option to serialize fields with torch.save or pickle.dump, allow tokenizers in different languages (#453)

Bug Fixes

Allow caching from unverified SSL in CharNGram (#554)
Fix the wrong unk index by generating the unk_index according to the specials (#531)
Update Moses tokenizer link in README.rst file (#529)
Fix the url to load wiki.simple.vec (#525), fix the dead url to load fastText vectors (#521)
Fix UnicodeDecodeError for loading sequence tagging dataset (#506)
Fix collisions between oov words and in-vocab words caused by Issue #447 (#482)
Fix a mistake in the processing bar of Vectors class (#480)
Add the dependency to six under 'install_requires' in the setup.py file (PR #475 for Issue #465)
Fix a bug in Field class which causes overwriting the stop_words attribute (PR #458 for Issue #457)
Transpose the text and target tensors if the text field in BPTTIterator has 'batch_first' set to True (#462)
Add <unk> to default specials (#567)

Backward Compatibility

Dropped support for python 2.7.9 (#552)

Assets 2

12 Oct 13:20

mttk

0.3.1

499e327

0.3.1: Quality-of-life improvements and bugfixes

Major changes:

Added bABI dataset (#286)
Added MultiNLP dataset (#326)
Pytorch 0.4 compatibility + bugfixes (#299, #302)
Batch iteration now returns a tuple of (inputs), outputs by default without having to index attributes from Batch (#288)
[BREAKING] Iterator no longer repeats infinitely by default (now stops after epoch has completed) (#417)

Minor changes:

Handle moses tokenizer being migrated from nltk (#361)
Vector loading made more efficient and flexible (#353)
Allow special tokens to be added to the end of the vocabulary (#400)
Allow filtering unknown words from examples (#413)

Bugfixes:

Documentation (#382, #383, #393 #395, #410)
Create cache dir for pretrained embeddings if it doesn't exist (#301)
Various typos (#293, #369, #373, #344, #401, #404, #405, #418)
Dataset.split() not copying sort_key fixed (#279)
Various python 2.* vs python 3.* issues (#280)
Fix OOV token vector dimensionality (#308)
Lowercased type of TabularDataset (#315)
Fix splits method in various translation datasets (#377, #385, #392, #429)
Fix ParseTextField postprocessing (#386)
Fix SubwordVocab (#399)
Make NestedField GPU compatible and fix frequency saving (#409, #403)
Allow CSVreader params to be modified by user (#432)
Use tqdm progressbar in downloads (#425)

Assets 2

09 Apr 06:22

jekbradbury

v0.2.3

a2795e5

v0.2.3

Release notes coming shortly.

Assets 2

28 Dec 23:25

jekbradbury

v0.2.1

7a2e442

0.2.1: Bugfixes and More Datasets

This is a minor release; we have not included any breaking API changes but there are some new features that don't break existing APIs.

We have always intended to support lazy datasets (specifically, those implemented as Python generators) but this version includes a bugfix that makes that support more useful. See a demo of it in action here.

Datasets:

Added support for sequence tagging (e.g., NER/POS/chunking) datasets and wrapped the Universal Dependencies POS-tagged corpus (#157, thanks @sivareddyg!)

Features:

Added pad_first keyword argument to Field constructors, allowing left-padding in addition to right-padding (#161, thanks @GregorySenay!)
Support loading word vectors from local folder (#168, thanks @ahhegazy!)
Support using list (character tokenization) in ReversibleField (#188)
Added hooks for Sphinx/RTD documentation (#179, thanks @keon and @EntilZha, whose preliminary version is available at torch-text.readthedocs.io)
Added support for torchtext.__version__ (#179, thanks @keon!)

Bugfixes:

Fixed deprecated word vector usage in WT2 dataset (#166, thanks @keon!)
Fixed bug in word vector loading (#168, thanks @ahhegazy!)
Fixed bug in word vector aliases (#191, thanks @ryanleary!)
Fixed side effects of building a vocabulary (#193 + #181, thanks @donglixp!)
Fixed arithmetic mistake in language modeling dataset length calculation (#182, thanks @jihunchoi!)
Avoid materializing an otherwise-lazy dataset when using filter_pred (#194)
Fixed bug in raw float fields (#159)
Avoid providing a misleading len when using batch_size_fn (#192)

Assets 2

20 Oct 04:52

jekbradbury

v0.2.0

be09d01

Version 0.2: Reversible tokenization, new word vector API, and more datasets

Breaking changes:

By default, examples are now sorted within a batch by decreasing sequence length (#95, #139). This is required for use of PyTorch PackedSequences, and it can be flexibly overridden with a Dataset constructor flag.
The unknown token is now included as part of specials and can be overridden or removed in the Field constructor (part of #107).

New features:

New word vector API with classes for GloVe and FastText; string descriptors are still accepted for backwards compatibility (#94, #102, #115, #120, thanks @nelson-liu and @bmccann!)
Reversible tokenization (#107). Introduces a new Field subclass, ReversibleField, with a .reverse method that detokenizes. All implementations of ReversibleField should guarantee that the tokenization+detokenization round-trip is idempotent; torchtext provides wrappers for the revtok tokenizer and subword segmenter that satisfy this property.
Skip header line in CSV/TSV loading (#146)
RawFields that represent any data type without processing (#147, thanks @kylegao91!)

New datasets:

TREC (#92, thanks @bmccann!)
IMDb (#93, thanks @bmccann!)
Multi30k (#116, thanks @bmccann!)
IWSLT (#126, #128, thanks @bmccann!)
WMT14 (#138)

Bugfixes:

Fix pretrained word vector loading (#99, thanks @matt-peters!)
Fix JSON loader silently ignoring requested columns not present in the file (#105, thanks @nelson-liu!)
Many fixes for Python 2, especially surrounding Unicode (#105, #112, #135, #153 thanks @nelson-liu!)
Fix Pipeline.call behavior (#113, thanks @nelson-liu!)
Fix README example (#134, thanks @czhang99!)
Fix WikiText2 loader (#138)
Fix typo in MT loader (#142, thanks @sivareddyg!)
Fix Example.fromlist behavior on non-strings (#145)
Update test set URL for Multi30k (#149)
Fix SNLI data loader (#150, thanks @sivareddyg!)
Fix language modeling iterator (#151)
Remove transpose as a side effect of Field.reverse (#155)

Assets 2

Releases: pytorch/text

Torchtext 0.8.1 release notes

Highlights

Improvement

Docs

Torchtext 0.8.0 release notes

Improvements

Docs

Bug Fixes

0.7.0: a new dataset abstraction for data processing

Highlights

Legacy code and issues

New dataset abstraction

Backwards Incompatible Changes

New Features

Improvements

Docs

Bug Fixes

Deprecations

0.6.0: Drop Python2 support for torchtext

Highlights

Backward compatibility

Docs

Bug Fixes

0.5.0: A new abstraction for torchtext dataset

Highlights

[Experimental] New abstraction for torchtext dataset

SentencePiece binding

Backward compatibility

New Features

Improvements

Bug Fixes

0.4.0: Supervised learning datasets and baselines

Highlights

Supervised learning baselines

Community

Major New Features

New Features

Improvements

Bug Fixes

Backward Compatibility

0.3.1: Quality-of-life improvements and bugfixes

v0.2.3

0.2.1: Bugfixes and More Datasets

Datasets:

Features:

Bugfixes:

Version 0.2: Reversible tokenization, new word vector API, and more datasets

Breaking changes:

New features:

New datasets:

Bugfixes: