Skip to content

Latest commit

 

History

History
256 lines (208 loc) · 11.5 KB

README.md

File metadata and controls

256 lines (208 loc) · 11.5 KB

vsm-dictionary-pubmed

Node.js CI codecov npm version Downloads License

Summary

vsm-dictionary-pubmed is an implementation of the 'VsmDictionary' parent-class/interface (from the package vsm-dictionary), that uses NCBI's Programming Utilities (E-utilities) API to interact with Entrez's PubMed MEDLINE database and retrieve bibliographic information for articles from the biomedical literature.

Note that PubMed is actually a search engine that is used to access biomedical literature not only from MEDLINE, but also from other life science journals and online books. So, even though PubMed is not the actual database that holds the literature data, it's commonly referred to as such and that's why we named this vsm-dictionary after it.

Install

Run: npm install

Example use

Node.js

Create a directory test-dir and inside run npm install vsm-dictionary-pubmed. Then, create a test.js file and include this code for example:

const DictionaryPubMed = require('vsm-dictionary-pubmed');
const dict = new DictionaryPubMed({ log: true, apiKey: ''});

dict.getEntryMatchesForString('logical modeling', { page: 1, perPage: 10 }, 
  (err, res) => {
    if (err) 
      console.log(JSON.stringify(err, null, 4));
    else
      console.log(JSON.stringify(res, null, 4));
  }
);

Then, run node test.js

Note that by using no API key (as in the example above - empty string or absent apiKey property) the upper limit of requests/sec to NCBI's Entrez system is 3. A registered NCBI user can request for an API key, which will increase this limit to 10 requests/sec (see blog post). This limit is very important because the vsm-autocomplete module that uses a vsm-dictionary as input, sends many such requests/sec since when someone types a string in the input-field component, it uses the getEntryMatchesForString function of the underlying vsm-dictionary (and typing fast for example can trigger many such calls). When the requests exceed the aforementioned limit in each case, an error object is returned from the Entrez servers (HTTP 429).

In order to account for this limit, we have implemented a rate limiter function that accumulates in a queue the requests to NCBI's servers (see below the specification for getEntries and getEntryMatchesForString to see the exact URL requests) and sends only one request per 200 ms - thus ensuring that we will never receive back that error when using a proper API key.

Browsers

<script src="https://unpkg.com/vsm-dictionary-pubmed@^1.0.0/dist/vsm-dictionary-pubmed.min.js"></script>

after which it is accessible as the global variable VsmDictionaryPubMed.

Tests

Run npm test, which runs the source code tests with Mocha.
If you want to quickly live test the E-utilities API, go to the test directory and run:

node getEntries.test.js
node getEntryMatchesForString.test.js

'Build' configuration

To use a VsmDictionary in Node.js, one can simply run npm install and then use require(). But it is also convenient to have a version of the code that can just be loaded via a <script>-tag in the browser.

Therefore, we included webpack.config.js, which is a Webpack configuration file for generating such a browser-ready package.

By running npm build, the built file will appear in a 'dist' subfolder. You can use it by including: <script src="../dist/vsm-dictionary-pubmed.min.js"></script> in the header of an HTML file.

Specification

Like all VsmDictionary subclass implementations, this package follows the parent class specification. In the next sections we will explain the mapping between the data offered by two of Entrez's E-utilities (esearch and esummary) and the corresponding VSM objects. Find the documentation for the API here: https://dataguide.nlm.nih.gov/eutilities/utilities.html.

Note that in the next functions, whenever we sent requests to NCBI's servers and receive an error response that is not a valid JSON string that we can parse, we formulate the error as a JSON object ourselves in the following format:

{
  status: <number>,
  error: <response> 
}

where the response from the server is JSON stringified.

Map PubMed Data to DictInfo VSM object

This specification relates to the function:
getDictInfos(options, cb)

If the options.filter.id is not properly defined or the https://www.ncbi.nlm.nih.gov/pubmed dictID is included in the list of ids used for filtering, getDictInfos returns a static object with the following properties:

Otherwise, an empty result is returned.

Map Esummary to Entry VSM object

This specification relates to the function:
getEntries(options, cb)

Firstly, if the options.filter.dictID is properly defined and in the list of dictIDs the https://www.ncbi.nlm.nih.gov/pubmed dictID is not included, then an empty array of entry objects is returned.

If the options.filter.id is properly defined (with IDs like https://www.ncbi.nlm.nih.gov/pubmed/12345) then we use a query like this:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=1,10,20,2&retmode=json&api_key=xyz

For the above URL, we provide a brief description for each sub-part:

  • The first part refers to the E-utilities base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
  • The second part refers to the E-utility used (esummary)
  • The third part refers to the database that we request data from (pubmed)
  • The fourth part is the entry IDs (PubMed-specific ids or PMIDs), comma separated (we extract the last part of the PubMed-specific URI for each ID). Note that for VSM the URI ID is something like: https://www.ncbi.nlm.nih.gov/pubmed/12345.
  • The fifth part defines the format of the returned data (JSON)
  • The last part defines the API key which is the (string) value of the property apiKey given to the DictionaryPubMed constructor.

Otherwise, we get an error object back since the API does not support the retrieval of all PubMed ids information (paginated):

{ 
  error: 'Not implemented' 
}

When using the E-utilities esummary API, we get back a JSON object with a result property whose value is the object of returned results. This object has as keys the PMIDs and values objects which include the information for each PMID (the summaries so to say). We now provide a mapping of each PMID's information object properties to VSM-entry specific properties:

PMID field Type Required VSM entry/match object property Notes
Object.keys(result) Array YES id The VSM entry id is the full URI, not just the PMID
Object.keys(result) Array YES str, terms[i].str The main term is 'PMID:<PMID>'
result[PMID].authors[0].name, result[PMID].source, result[PMID].pubdate, result[PMID].title Strings NO descr The descr form is: {main author's name} ({Journal} {publication year}), {title}
result[PMID].articleids Array YES z.articleIDs We map the whole array

Note that the whole point of the above mapping is to have a good enough descr string, so that a user (curator) will be able to distinguish an entry article from the others (the PMID is enough for the computer, but not for humans).

After mapping the results to VSM objects, we sort them based on the PMID value and then prune them according to the values options.page (default: 1) and options.perPage (default: 50).

Map Esearch to Match VSM object

This specification relates to the function:
getEntryMatchesForString(str, options, cb)

Firstly, if the options.filter.dictID is properly defined and in the list of dictIDs the https://www.ncbi.nlm.nih.gov/pubmed dictID is not included, then an empty array of match objects is returned.

Otherwise, we use two URLs: one to get the relevant PMIDs that match the requested string term (using the esearch endpoint) and one like in the getEntries case, to get the article summaries matching the previously-found PMIDs (using the esummary endpoint). An example of these two queries, when searching for logical modeling as str, would be:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=logical%20modeling&retmax=3&retstart=0&sort=most+recent&retmode=json
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=31515732,31407132,31347261&retmode=json

For the second URL, concerning the esummary endpoint, a description of each sub-part was given in the section above. For the first URL, concerning the esearch endpoing, we now provide a brief description for each sub-part:

  • The first part refers to the E-utilities base URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
  • The second part refers to the E-utility used (esearch)
  • The third part refers to the database that we request data from (pubmed)
  • The fourth part is the term that we request to search PMIDs for.
  • The retmax and retstart parameters define which and how many results will be included in the returned result. They depend on the options.page and options.perPage options. Default values are 50 and 0 respectively.
  • The sort parameter defines the returned order of the the PMIDs. The default value is most recent. Other acceptable values are:
    • journal
    • pub+date
    • relevance
    • title
    • author
      This option can be defined in the constructor:
const dict = new DictionaryPubMed({ sort: 'relevance' });
  • The last part defines the format of the returned data (JSON)
  • There can also be a part that defines the API key as in the esummary case.

The first URL returns an object (let's call it res) and we get the PMIDs associated with the searched term str as an array of strings (the value of the res.esearchresult.idlist). We then use the returned PMIDs to fill in the second URL and get back the respective article summaries which we map to VSM-match objects as shown in the table above for the getEntries(options, cb) case.

Note that the most efficient way to get back a specific article is to search using a string str that matches the PMID or the PMC or the DOI number of that article. For example any of the following str will return one result (VSM-match object corresponding to the article):

  • 7717779
  • PMID:7717779
  • Pmid: 7717779
  • pmiD: 7717779 (note that the PMID keyword is case-insensitive)
  • PMC1234567
  • 10.1097/00000658-199503000-00007 (not DOI: <doi string>)

License

This project is licensed under the AGPL license - see LICENSE.md.