This README describes the Question Answering demo application that uses a Squad-tuned BERT model for inference.
Upon the start-up the demo application reads command line parameters and loads a network to Inference engine. It also fetch data from the user-provided url to populate the "context" text. The text is then used to search answers for user-provided questions.
Running the application with the -h
option yields the following usage message:
python3 question_answering_demo.py -h
The command yields the following usage message:
usage: bert_question_answering_demo.py [-h] -v VOCAB -m MODEL -i INPUT
[--questions QUESTION [QUESTION ...]]
[--input_names INPUT_NAMES]
[--output_names OUTPUT_NAMES]
[--model_squad_ver MODEL_SQUAD_VER]
[-q MAX_QUESTION_TOKEN_NUM]
[-a MAX_ANSWER_TOKEN_NUM] [-d DEVICE]
[-r] [-c]
Options:
-h, --help Show this help message and exit.
-v VOCAB, --vocab VOCAB
Required. path to the vocabulary file with tokens
-m MODEL, --model MODEL
Required. Path to an .xml file with a trained model
-i INPUT, --input INPUT
Required. URL to a page with context
--questions QUESTION [QUESTION ...]
Optional. Prepared questions
--input_names INPUT_NAMES
Optional. Inputs names for the network.
Default values are "input_ids,attention_mask,token_type_ids"
--output_names OUTPUT_NAMES
Required. Outputs names for the network.
Default values are "output_s,output_e"
--model_squad_ver MODEL_SQUAD_VER
Optional. SQUAD version used for model fine tuning
-q MAX_QUESTION_TOKEN_NUM, --max_question_token_num MAX_QUESTION_TOKEN_NUM
Optional. Maximum number of tokens in question (used with the reshape option)
-a MAX_ANSWER_TOKEN_NUM, --max_answer_token_num MAX_ANSWER_TOKEN_NUM
Optional. Maximum number of tokens in answer
-d DEVICE, --device DEVICE
Optional. Specify the target device to infer on; CPU
is acceptable. Sample will look for a suitable plugin
for device specified. Default value is CPU
-r, --reshape
Optional. Auto reshape sequence length
to the input context + max question len (to improve the speed)
-c, --colors
Optional. Nice coloring of the questions/answers.
Might not work on some terminals (like Windows* cmd console)
NOTE: Before running the demo with a trained model, make sure to convert the model to the Inference Engine's Intermediate Representation format (*.xml + *.bin) using the Model Optimizer tool. When using the pre-trained BERT from the model zoo (please see Model Downloader), the model is already converted to the IR.
The application reads text from the HTML page at the given url and then answers questions typed from the console.
The model and its parameters (inputs and outputs) are also important demo arguments.
Notice that since order of inputs for the model does matter, the demo script checks that the inputs specified
from the command-line match the actual network inputs.
When the reshape option (-r
) is specified, the script also attempts to reshape the network to the
length of the context plus length of the question (both in tokens), if the resulting value is smaller than the original
sequence length that the network expects. This is performance (speed) and memory footprint saving option.
Since some networks are not-reshapable (due to limitations of the internal layers) the reshaping might fail,
so you will need to run the demo without it.
Please see general reshape intro and limitations
The application outputs found answers to the same console.
Open Model Zoo Models feature example BERT-large trained on the Squad*. One specific flavor of that is so called "distilled" model (for that reason it comes with "small" in its name, but don't get confused as it is still originated from the BERT Large) that is indeed substantially smaller and faster.
The demo also works fine with official MLPerf* BERT ONNX models fine-tuned on the Squad dataset. Unlike [Open Model Zoo Models that come directly as the Intermediate Representation (IR), the MLPerf models should be explicitly converted with OpenVINO Model Optimizer. Specifically the example command-line (for the int8 model) is as follows:
python3 mo.py
-m <path_to_model>/bert_large_v1_1_fake_quant.onnx
--input "input_ids,attention_mask,token_type_ids"
--input_shape "[1,384],[1,384],[1,384]"
--keep_shape_ops
You can use the following command to try the demo (assuming the model from the Open Model Zoo, downloaded with the Model Downloader executed with "--name bert*"):
python3 bert_question_answering_demo.py
--vocab=<omz_dir>/models/intel/<model_name>/vocab.txt
--model=<path_to_model>/bert-small-uncased-whole-word-masking-squad-0001.xml
--input_names="input_ids,attention_mask,token_type_ids"
--output_names="output_s,output_e"
--input="https://en.wikipedia.org/wiki/Bert_(Sesame_Street)"
-c
The demo will use a wiki-page about the Bert character to answer your questions like "who is Bert", "how old is Bert", etc.
Notice that when the original "context" (text from the url) together with the question do not fit the model input (usually 384 tokens for the Bert-Large, or 128 for the Bert-Base), the demo splits the context into overlapping segments. Thus, for the long texts, the network is called multiple times. The results are then sorted by the probabilities.
Even though the demo reports inference performance (by measuring wall-clock time for individual inference calls), it is only baseline performance, as certain tricks like batching, throughput mode can be applied. Please use the full-blown Benchmark C++ Sample for any actual performance measurements.