Skip to content

mrtoronto/PubmedScraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Pubmed_Scraper

Program Goal

Allow users to scrap data from PubMed articles (or other Databases supported by E-utilities although the script is made to be used with PubMed) in an organized and efficient way. Data will be output as several sheets in an Excel document.

How to Run

  • Run main_internal_args.py from the terminal to start the process. Variables to control results can be edited within the main_internal_args.py.

Output File

  • The file that is output will have a filename like {query_name}_{date}_{number_of_results}res.xlsx.

  • The 'Master Table' will have non-flattened data. Most of the data are in lists but there is also a dictionary.

    • The dataframe used to make this table would be ideal for further work with the specific data pulled from a run.
  • After that, there are a number of flattened tables that are relevant to a specific feature.

  • These tables are easier to read but are more difficult to use to gain information on an individual article.

  • These features are:

    • Author
    • Keyword
    • Article ID
    • Abstract
    • Pubtype
    • MeSH Keywords

Program Flow

Scripts run in the following order:

  1. main_internal_args.py
  2. search_ids.py
  3. pubmed_scraper.py

main_internal_args.py

  • This script doesn't have a function call within it.
  • User can set run parameters here.
    • query_term
      • Term/phrase that will be used to query PubMed
      • i.e.:
        • translational+AND+microbiome
        • cannabis+AND+(inflammation+OR+nausea)
    • sort_order
      • Sort order to return articles in.
      • Not important if pulling ALL articles that match a term.
      • Important when pulling < ALL articles because some need to be left out.
      • Options include :
        • Default value is 'most+recent'
        • Others are 'journal', 'pub+date', 'relevance', 'title', 'author'
    • results
      • Number of results to return from the search
      • No upper limit. Script will stall longer with more results.
      • If it returns < results results, that's all the articles on PubMed
    • id_list_filename
      • Name of file with list of IDs to run.
      • Leave blank if using a search query.

search_ids.py

  • This script will take in a list of article IDs or a search query, create an .XML file with data on the specific articles found and return some info to be passed to the next function.

  • Function : get_article_ids()

    • Parameters :
      • query
        • Query that'll be searched.
      • filename
        • Filename of an ID list. Okay to leave blank while using a search query.
      • retmax
        • Number of results to return.
      • sort
        • Sort order for the results to be sorted using.
      • have_ids
        • Boolean to tell script whether to use an ID list or not. Defaults to False
      • api_key
        • API key is not necessary to run but will help with large runs.
        • Key increases access rate from 3 requests / second to 10 requests / second
    • Returns:
      • A list of 2 elements : [file_name_fetch, query_str]
        • file_name_fetch
          • Name of the "fetch" .XML file made during the run.
        • query_str
          • The query fed into the function at the beginning.

pubmed_scraper.py

  • This file will take in the filename for the fetch .XML file generated by search_ids.py, output a .xlsx file and return a string with run information.

  • Function: pubmed_xml_parse()

    • Parameters:
      • filename
        • Filename of the .XML 'fetch' file created by get_article_ids()
    • Returns:
      • return_string
        • A string made with the filename and number of results.
        • Will be printed in the console after a successful run.

Data

Data in data folder is a list of journal's impact factors and a few other stats for each journal. Data is from 2018 and downloaded from InCites Journal Citations Report. The table is not on github as it was acquired through a school proxy. The link for the data is here but it is behind a paywall.

About

Scraper to gather data from PubMed.com

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages