News

GluonNLP will be featured in KDD 2019 Alaska! Check out our tutorial: From Shallow to Deep Language Representations: Pre-training, Fine-tuning, and Beyond.
GluonNLP was featured in JSALT 2019 in Montreal, 2019-6-14! Checkout https://jsalt19.mxnet.io.

Models and Scripts

BERT

BERT model pre-trained on OpenWebText Corpus, BooksCorpus, and English Wikipedia. The test score on GLUE Benchmark is reported below. Also improved usability of the BERT pre-training script: on-the-fly training data generation, sentencepiece, horovod, etc. (#799, #687, #806, #669, #665). Thank you @davisliang

Source	GluonNLP	google-research/bert	google-research/bert
Model	bert_12_768_12	bert_12_768_12	bert_24_1024_16
Dataset	`openwebtext_book_corpus_wiki_en_uncased`	`book_corpus_wiki_en_uncased`	`book_corpus_wiki_en_uncased`
SST-2	95.3	93.5	94.9
RTE	73.6	66.4	70.1
QQP	72.3	71.2	72.1
SQuAD 1.1	91.0/84.4	88.5/80.8	90.9/84.1
STS-B	87.5	85.8	86.5
MNLI-m/mm	85.3/84.9	84.6/83.4	86.7/85.9

The SciBERT model introduced by Iz Beltagy and Arman Cohan and Kyle Lo in "SciBERT: Pretrained Contextualized Embeddings for Scientific Text". The model checkpoints are converted from the original repository from AllenAI with the following datasets (#735):
- scibert_scivocab_uncased
- scibert_scivocab_cased
- scibert_basevocab_uncased
- scibert_basevocab_cased
The BioBERT model introduced by Lee, Jinhyuk, et al. in "BioBERT: a pre-trained biomedical language representation model for biomedical text mining". The model checkpoints are converted from the original repository with the following datasets (#735):
- biobert_v1.0_pmc_cased
- biobert_v1.0_pubmed_cased
- biobert_v1.0_pubmed_pmc_cased
- biobert_v1.1_pubmed_cased
The ClinicalBERT model introduced by Kexin Huang and Jaan Altosaar and Rajesh Ranganath in "ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission". The model checkpoints are converted from the original repository with the clinicalbert_uncased dataset (#735)
The ERNIE model introduced by Sun, Yu, et al. in "ERNIE: Enhanced Representation through Knowledge Integration". You can get the model checkpoints converted from the original repository with model.get_model("ernie_12_768_12", "baidu_ernie_uncased") (#759) thanks @paperplanet
BERT fine-tuning script for named entity recognition on CoNLL2003 with test F1 92.2 (#612).
BERT fine-tuning script for Chinese XNLI dataset with 78.3% validation accuracy. (#759) thanks @paperplanet
BERT fine-tuning script for intent classification and slot labelling on ATIS (95.9 F1) and SNIPS (95.9 F1). (#817)

GPT-2

The GPT-2 language model introduced by Radford, Alec, et al. in "Language Models are Unsupervised Multitask Learners". The model checkpoints are converted from the original repository, with a script to generate text from GPT-2 model (gpt2_117m, gpt2_345m) trained on the openai_webtext dataset (#761).

ESIM

The ESIM model for text matching introduced by Chen, Qian, et al. in "Enhanced LSTM for Natural Language Inference". (#689)

Data

Natural language understanding with datasets from the GLUE benchmark: CoLA, SST-2, MRPC, STS-B, MNLI, QQP, QNLI, WNLI, RTE (#682)
Sentiment analysis datasets: CR, MPQA (#663)
Intent classification and slot labeling datasets: ATIS and SNIPS (#816)

New Features

[Feature] support save model / trainer states to S3 (#700)
[Feature] support load model/trainer states from s3 (#702)
[Feature] Add SentencePieceTokenizer for BERT (#669)
[FEATURE] Flexible vocabulary (#732)
[API] Moving MaskedSoftmaxCELoss and LabelSmoothing to model API (#754) thanks @ThomasDelteil
[Feature] add the List batchify function (#812) thanks @ThomasDelteil
[FEATURE] Add LAMB optimizer (#733)

Bug Fixes

[BUGFIX] Fixes for BERT embedding, pretraining scripts (#640) thanks @Deseaus
[BUGFIX] Update hash of wiki_cn_cased and wiki_multilingual_cased vocab (#655)
fix bert forward call parameter mismatch (#695) thanks @paperplanet
[BUGFIX] Fix mlm_loss reporting for eval dataset (#696)
Fix _get_rnn_cell (#648) thanks @MarisaKirisame
[BUGFIX] fix mrpc dataset idx (#708)
[bugfix] fix hybrid beam search sampler(#710)
[BUGFIX] [DOC] Update nlp.model.get_model documentation and get_model API (#734)
[BUGFIX] Fix handling of duplicate special tokens in Vocabulary (#749)
[BUGFIX] Fix TokenEmbedding serialization with emb[emb.unknown_token] != 0 (#763)
[BUGFIX] Fix glue test result serialization (#773)
[BUGFIX] Fix init bug for multilevel BiLMEncoder (#783) thanks @Ishitori

API Changes

[API] Dropping support for wiki_multilingual and wiki_cn (#764)
[API] Remove get_bert_model from the public API list (#767)

Enhancements

[FEATURE] offer load_w2v_binary method to load w2v binary file (#620)
[Script] Add inference function for BERT classification (#639) thanks @TaoLv
[SCRIPT] - Add static BERT base export script (for use with MXNet Module API) (#672)
[Enhancement] One script to export bert for classification/regression/QA (#705)
[enhancement] refactor bert finetuning script (#692)
[Enhancement] only use the best model for inference for bert classification (#716)
[Dataset] redistribute conll2004 (#719)
[Enhancement] add periodic evaluation for BERT pre-training (#720)
[FEATURE]add XNLI task (#717)
[refactor] Refactor BERT script folder (#744)
[Enhancement] BERT pre-training data generation from sentencepiece vocab (#743)
[REFACTOR] Refactor TokenEmbedding to reduce number of places that initialize internals (#750)
[Refactor] Refactor BERT SQuAD inference code (#758)
[Enhancement] Fix dtype conversion, add sentencepiece support for SQuAD (#766)
[Dataset] Move MRPC dataset to API (#780)
[BiDAF-QANet] Common data processing logic for BiDAF and QANet (#739) thanks @Ishitori
[DATASET] add LCQMC, ChnSentiCorp dataset (#774) thanks @paperplanet
[Improvement] Implement parser evaluation in Python (#772)
[Enhancement] Add whole word masking for BERT (#770) thanks @basicv8vc
[Enhancement] Mix precision support for BERT finetuning (#793)
Generate BERT training samples in compressed format (#651)

Minor Fixes

Various documentation fixes: #635, #637, #647, #656, #664, #667, #670, #676, #678, #681, #698, #704, #731, #745, #762, #771, #746, #778, #800, #810, #807 #814 thanks @rongruosong @crcrpar @mrchypark @xwind-h
Fix BERT multiprocessing data creation bug which causes unnecessary dispatching to single worker (#649)
[BUGFIX] Update BERT test and pre-train script (#661)
update url for ws353 (#701)
bump up version (#742)
[DOC] Update textCNN results (#737)
padding value warning (#747)
[TUTORIAL][DOC] Tutorial Updates (#802) thanks @faramarzmunshi

Continuous Integration

skip failing tests in mxnet master (#685)
[CI] update nodes for CI (#686)
[CI] CI refactoring to speed up tests (#566)
[CI] fix codecov (#693)
use fixture for squad dataset tests (#699)
[CI] create zipped notebooks for link check (#712)
Fix test infrastructure for pytest > 4 and bump CI pytest version (#728)
[CI] set root in BERT tests (#738)
Fix conftest.py function_scope_seed (#748)
[CI] Fix links in contribute.rst (#752)
[CI] Update CI dependencies (#756)
Revert "[CI] Update CI dependencies (#756)" (#769)
[CI] AWS Batch serverless CI Pipeline for parallel notebook execution during website build step (#791)
[CI] Don't exit pipeline before displaying AWS Batch logfiles (#801)
[CI] Fix for "Don't exit pipeline before displaying AWS Batch logfile (#803)
add license checker (#804)
enable timeout (#813)
Fix website build on master branch (#819)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.0

News

Models and Scripts

BERT

GPT-2

ESIM

Data

New Features

Bug Fixes

API Changes

Enhancements

Minor Fixes

Continuous Integration