Skip to content

Version 0.2: Reversible tokenization, new word vector API, and more datasets

Compare
Choose a tag to compare
@jekbradbury jekbradbury released this 20 Oct 04:52
· 609 commits to master since this release

Breaking changes:

  • By default, examples are now sorted within a batch by decreasing sequence length (#95, #139). This is required for use of PyTorch PackedSequences, and it can be flexibly overridden with a Dataset constructor flag.
  • The unknown token is now included as part of specials and can be overridden or removed in the Field constructor (part of #107).

New features:

  • New word vector API with classes for GloVe and FastText; string descriptors are still accepted for backwards compatibility (#94, #102, #115, #120, thanks @nelson-liu and @bmccann!)
  • Reversible tokenization (#107). Introduces a new Field subclass, ReversibleField, with a .reverse method that detokenizes. All implementations of ReversibleField should guarantee that the tokenization+detokenization round-trip is idempotent; torchtext provides wrappers for the revtok tokenizer and subword segmenter that satisfy this property.
  • Skip header line in CSV/TSV loading (#146)
  • RawFields that represent any data type without processing (#147, thanks @kylegao91!)

New datasets:

Bugfixes:

  • Fix pretrained word vector loading (#99, thanks @matt-peters!)
  • Fix JSON loader silently ignoring requested columns not present in the file (#105, thanks @nelson-liu!)
  • Many fixes for Python 2, especially surrounding Unicode (#105, #112, #135, #153 thanks @nelson-liu!)
  • Fix Pipeline.call behavior (#113, thanks @nelson-liu!)
  • Fix README example (#134, thanks @czhang99!)
  • Fix WikiText2 loader (#138)
  • Fix typo in MT loader (#142, thanks @sivareddyg!)
  • Fix Example.fromlist behavior on non-strings (#145)
  • Update test set URL for Multi30k (#149)
  • Fix SNLI data loader (#150, thanks @sivareddyg!)
  • Fix language modeling iterator (#151)
  • Remove transpose as a side effect of Field.reverse (#155)