In this example, we guide you through the end-to-end process of using pmtrendviz
to visualize trends in the biomedical literature.
The first step is to import the data into Elasticsearch, which can be done using the import
command.
To download and import the latest 100 (from 1166 in total) files from PubMed Baseline, run the following command:
pmtrendviz import -x pubmed -n 100
and confirm the prompt. By default, all data will be imported.
You may increase the number of threads with -t
to speed up the extraction process, however, beyond -t 4
, the bottleneck will probably be the download speed.
After importing the data, you have two options to create a model for the visualization, either by training your own model or by using one of the pre-trained models.
To list the available trainable and pre-trained models, run the following command:
pmtrendviz list --trainable --pretrained
To train a simple TF-IDF-based model on a random sample of 500k abstracts, run the following command:
pmtrendviz train -m tfidf_truncatedsvd_kmeans -x pubmed -n 500000 -s my-tfidf-model -p uniform
The trained model will be saved to the models
directory.
Alternatively, you can install a pre-trained model from Huggingface, for example by running the following command:
pmtrendviz install tfidf-3m-100
Since the visualization of the trends requires a large number of rather expensive predictions, we need to precompute the predictions with the precompute
command:
pmtrendviz precompute -x pubmed -m my-tfidf-model -b 10000
By default, this will precompute the predictions for all documents in the index, but you can also specify a maximum number of new predictions to make with -P
or a maximum number of seconds to run with -T
. You can also specify the order in which to predict the cluster of the abstracts with -p
, which can be either uniform
or forward
. The forward
option is faster, but will only prdocue a meaningful visualization if run on the whole dataset, while the uniform
option is slower, but will produce a meaningful visualization even if run on a smaller subset of the data.