Skip to content

torchtext 0.10.0 Release Notes

Compare
Choose a tag to compare
@parmeet parmeet released this 15 Jun 16:03
· 3 commits to release/0.10 since this release
4da1de3

Highlights

In this release, we introduce a new Vocab module that replaces the current Vocab class. The new Vocab provides common functional APIs for NLP workflows. This module is backed by an efficient C++ implementation that reduces look-up time by up-to ~85% for batch look-up (refer to summary of #1248 and #1290 for further information on benchmarks), and provides support for TorchScript. We provide accompanying factory functions that can be used to build the Vocab object either through a python ordered dictionary or an Iterator that yields lists of tokens.

creating Vocab from text file

import io
from torchtext.vocab import build_vocab_from_iterator
# generator that yield list of tokens
def yield_tokens(file_path):
    with io.open(file_path, encoding = 'utf-8') as f:
       for line in f:
           yield line.strip().split()
# get Vocab object
vocab_obj = build_vocab_from_iterator(yield_tokens(file_path), specials=["<unk>"])

creating Vocab through ordered dict

from torchtext.vocab import vocab
from collections import Counter, OrderedDict
counter = Counter(["a", "a", "b", "b", "b"])
sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
ordered_dict = OrderedDict(sorted_by_freq_tuples)
vocab_obj = vocab(ordered_dict)

common API usage

# look-up index
vocab_obj["a"]

# batch look-up indices
vocab_obj.looup_indices(["a","b"])
# support forward API of PyTorch nn Modules
vocab_obj(["a","b"])

# batch look-up tokens
vocab_obj.lookup_tokens([0,1])

# set default index to return when token not found 
vocab_obj.set_default_index(0)
vocab_obj["out_of_vocabulary"] #prints 0

Backward Incompatible changes

  • We have retired the old Vocab class into the legacy folder (#1289) . Users relying on this class should be able to access it from torchtext.legacy. The Vocab module that replaces this class is not backward compatible. The most notable difference is that the Vectors object is not an attribute of new Vocab object. We recommend users to use the build_vocab_from_iterator factory function to construct the new Vocab module that provides similar initialization capabilities as the retired Vocab class.
# retired Vocab class 
from torchtext.legacy.vocab import Vocab as retired_vocab
from collections import Counter
tokens_list = ["a", "a", "b", "b", "b"]
counter = Counter(tokens_list)
vocab_obj = retired_vocab(counter, specials=["<unk>","<pad>"], specials_first=True)

# new Vocab Module
from torchtext.vocab import build_vocab_from_iterator
vocab_obj = build_vocab_from_iterator([tokens_list], specials=["<unk>","<pad>"], specials_first=True)
  • Removed legacy batch from torchtext.data package (#1307) that was kept around for backward compatibility reasons. Users can still access batch from torchtext.data.legacy package.

New Features

  • Introduced functional to convert Iterable-style to map-style datasets (#1299)
from torchtext.datasets import IMDB
from torchtext.data import to_map_style_dataset
train_iter = IMDB(split='train')
#convert iterator to map-style dataset
train_dataset = to_map_style_dataset(train_iter)
  • Introduced functional to filter raw wikipedia XML dumps (#1292)
from torchtext.data.functional import filter_wikipedia_xml
from torchtext.datasets import EnWik9
data_iter = EnWik9(split='train')
# filter data according to https://github.com/facebookresearch/fastText/blob/master/wikifil.pl
filter_data_iter = filter_wikipedia_xml(data_iter)
# Added datasets for http://www.statmt.org/wmt16/multimodal-task.html#task1
from torchtext.datasets import Multi30k
train_data, valid_data, test_data = Multi30k()
next(train_data)
# prints following 
#('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.\n',
# 'Two young, White males are outside near many bushes.\n')

Improvements

  • Separated experimental and legacy tests into separate subfolders (#1285)
  • Stored md5 hash instead of raw text data for in-built datasets testing (#1261)
  • Cleaned up CircleCI cache handling and optimization of daily cache (#1236, #1238)
  • Fixed CircleCI caching issue when new dataset is added (#1314)
  • Organized datasets by names in root folder and moved common file reading functions into dataset_utils (#1233)
  • Added unit-test to verify raw datasets name property (#1234)
  • Fixed jinja2 environment autoescape to enable select extensions (#1277)
  • Added yaml.safe_load instead of yaml.load (#1278)
  • Added defusedxml to parse untrusted XML data (#1279)
  • Added CodeQL and Bandit security checks as GitHub Actions (#1266)
  • Added benchmark code to compare Vocab module with python dict for batch look-up time (#1290)

Documentation

  • Fixing doc for nn modules (#1267)
  • Store artifacts of rendered docs so that rendered docs can be checked on each PR (#1288)
  • Add Google Analytics support (#1287)

Bug Fix

  • Fixed import issue in text classification example (#1256)
  • Fixed and re-organized data pipeline example (#1250)

Performance

  • used c10::string_view and fast-text dictionary inside C++ kernel of Vocab module (#1248)