The goal of this project is to produce a high quality dataset of software used in the biomedical literature to facilitate analysis of adoption and impact of open-source scientific software. Our overall methodology is the following:
- Extract plain-text software mentions from the PMC-OA access using an NER Machine Learning Algorithm (developed by Ivana Williams) G
- Link the software mentions to repositories and generate metadata by querying a number of databases. We link mentions to: PyPI, Bioconductor, CRAN, SciCrunch and GitHub
- Disambiguate the software mentions
More detailed descriptions of the linking and disambiguation steps can be found below, together with instructions on how to run the code.
- We query the following databases, searching for exact matches for plain text sofware mentions in our dataset:
- PyPI Index: https://pypi.org/simple/
- Bioconductor Index: https://www.bioconductor.org/packages/release/bioc/
- CRAN Index: https://cran.r-project.org/web/packages/available_packages_by_name.html
- GitHub API: https://github.com
- SciCrunch API: https://scicrunch.org/resources
- We normalize the metadata files to a common schema.
Metadata files are normalized to the following fields:
Field | Description |
---|---|
ID | unique ID of software mention (generated by us) |
software_mention | plain-text software mention |
mapped_to | value the software_mention is being mapped to |
source | source of the mapping - eg Bioconductor Index, GitHub API |
platform | platform of software_mention - eg PyPI, CRAN |
package_url | URL linking software_mention to source |
description | description of software_mention |
homepage_url | homepage_url of software_mention |
other_urls | other related URLs |
license | software license |
github_repo | GitHub repository |
github_repo_license | GitHub repository license |
exact_match | whether or not this mapping was an exact match |
RRID | RRID for software_mention |
reference | journal articles linked to software_mention (identified either through DOI, pmid or RRID) |
scicrunch_synonyms | ynonyms for software_mention, retrieved from Scicrunch |
All the scripts for linking are under the linker
folder. Here are the instructions for running the code from scratch.
- This step will add a data folder structure.
python initialize.py
- Download the input data from the Dryad Link here. Add the input software_mentions file (e.g.
comm_IDs.tsv
) into thedata/input_files
folder. Do not unzip the file. The scripts assume a .gz extension.
This step will assign IDs to software mentions in the input file. It will also generate a mention2ID.pkl file which contains mappings from mention to an ID. It can generate this file from scratch or update an already existing mention2ID
file.
Generate mention2ID from scratch:
python assign_IDs.py --input-file (your_input_file) --mention2ID-file (your_output_file_for_mention2ID)
Example: python assign_IDs.py --input-file comm.tsv.gz --output-file comm_IDs.tsv.gz
Note: the script assumes that input_file is under data/input_files
.
Update an already existing mention2ID file:
python assign_IDs.py --input-file (your_input_file) --mention2ID-file (your_existing_file_for_mention2ID) --mention2ID-updated_file (your_updated_file_for_mention2ID)
At the end of this step, you should have:
mention2ID.pkl
file under thedata/intermediate_files
comm_IDs.tsv
file underdata/input_files
This step will filter the comm_IDs.tsv.gz
to exclude mentions that are marked as not-software by our expert bio-curation team.
The curated list of terms to be excluded is under data/curation/curation_top10k_mentions_binary_labels.csv
.
python filter_curated_terms.py --input-file (your_input_file) --output-curated-dataset (your_output_file)
Example: python filter_curated_terms.py --input-file comm_IDs.tsv.gz --output-curated-dataset comm_curated.tsv.gz
At the end of this step, you should have:
comm_curated.tsv.gz
file under thedata/input_files
-comm_IDs
with mentions marked as non-software filtered outcomm_with_labels.tsv.gz
file underdata/input_files
-comm_IDs
with an additional field corresponding to the software mention label (eg 'software', 'not-software', 'unclear', or 'not curated') to mark that the mention has not been curated
This step will link software mentions to the PyPI, CRAN, Bioconductor, Scicrunch and Github repositories.
Here are the instructions if you would like to compute the metadata files from scratch:
1. Generate keys for Accessing the APIs
You will need to have a number of access keys. You can get them for free from https://libraries.io, https://github.com, https://sparc.science. Then create a file keys with the following content:
export GITHUB_USER = ...
export GITHUB_TOKEN = ...
export SCICRUNCH_TOKEN = ...
Source this file as . keys
or source keys
2. Generate Metadata files
Generate metadata files from scratch.
Each of the commands below queries the specific database for linking and generating metadata for the software mentions.
There are a number of command-line parameters that can be tuned, more info in the scripts themselves.
Metadata files are normalized to a common schema and saved under data/metadata_files/normalized
. Raw versions are also saved under data/metadata_files/raw
.
Note that these scripts can take a long time to run, especially given the large number of mentions in the dataset. In particular, the Github API requests are subjected to a limit/per minute. We recommend parallelizing or using distributed computing. We used a Spark environment to speed up the process.
python bioconductor_linker.py --input-file comm_IDs.tsv.gz --generate-new
python cran_linker.py --input-file comm_IDs.tsv.gz --generate-new
python pypi_linker.py --input-file comm_IDs.tsv.gz --generate-new
python github_linker.py --input-file comm_IDs.tsv.gz --generate-new
python scicrunch_linker.py --input-file comm_IDs.tsv.gz --generate-new
Sanity-checking
To make sure that everything works the way it should, you could sanity check the code by running something like:
python bioconductor_linker.py --input-file comm_IDs.tsv --top-k 40
This will only try to link the first 40 mentions, for instance, and should take a fairly short time (minutes). Of course, you can do this for any of the metadata linking scripts.
At the end of this step, you should have:
- raw metadata files saved under the
data/metadata_files/raw
directory.pypi_raw_df.csv
cran_raw_df.csv
bioconductor_raw_df.csv
scicrunch_raw_df.csv
github_raw_df.csv
- normalized files (to a common schema) saved under the saved under
data/metadata_files/normalized
.pypi_df.csv
cran_df.csv
bioconductor_df.csv
scicrunch_df.csv
github_df.csv
Once the individual metadata files are computed, aggregate them together by running:
python generate_metadata_file.py
This step also does some post-processing of the individual metadata files.
At the end of this step, you should have:
metadata.csv
saved under thedata/output_files/
directory
We evaluate the linking algorithm using an expert team of biomedical curators. We ask them to evaluate 50 generated software-generated link pairs as one of: correct, incorrect or unclear.
The evaluation file is available as evaluation_linking.csv
and the script to compute the metrics is evaluation_linking.py
.
To get the evaluation metrics, run the evaluation script inside the evaluation
folder:
python evaluation_linking.py --linking-evaluation-file `../data/curation/evaluation_linking.csv`
For the disambiguation task, we use the following methodology:
1. Synonym Generation: we generate synonyms for mentions in our corpus through:
- Keywords-based synonym generation
- Scicrunch synonyms retrieval
- String similarity (Jaro Winkler) algorithms
2. Similarity matrix generation: based on synonyms generated in the previous step, we build a similarity matrix
3. Cluster Generation
- We retrieve the connected components through the similarity matrix
- For each connected component, we compute its distance matrix based on its similarity matrix
- We cluster a number of connected components by feeding the corresponding distance matrices into the DBSCAN algorithm
- We assign each cluster's name to the mention with the highest frequency in our corpus.
All the scripts for linking are under the linker
folder. Here are the instructions for running the code from scratch.
- Follow the steps under Linking Setup if you haven't already. In particular, steps in this section require that you generate or retrieve
mention2ID.pkl
if you haven't already in a previous step. - Generate a frequency dictionary
freq_dict.pkl
containing mappings from {synonym : frequency} by running:
python generate_freq_dict.py --input-file ../data/input_files/comm_IDs.tsv --output-file ../data/intermediate_files/freq_dict.pkl
This file will be later used in clustering.
At the end of this step, you should have:
mention2ID.pkl
freq_dict.pkl
Generate synonym files by:
python generate_synonyms_keywords.py
This step assumes that cran_df.csv
, pypi_df.csv
, bioconductor_df.csv
files exist under data/metadata_files/normalized
and the mention2ID.pkl
file exists under `data/intermediate_files
At the end of this step, you should have:
pypi_synonyms.pkl
cran_synonyms.pkl
bioconductor_synoynms.pkl
python generate_synonyms_scicrunch.py
This step assumes that scicrunch_df.csv
file exists under data/metadata_files/normalized
At the end of this step, you should have:
scicrunch_synoynms.pkl
extra_scicrunch_synonyms.pkl
python generate_synonyms_string_similarity.py
This step assumes that mention2ID.pkl
file exists under data/intermediate_files
.
This step could be time consuming, so we recommend running in batches. You have the option of choosing an ID_start as well as an ID_end, and a Spark implementation is also available.
python generate_synonyms_string_similarity.py --ID_start 0 --ID_start 100
The start/end IDs refer to the software mention IDs in mention2ID.pkl
After all the batched files are generated, combine all of them in one master file by running:
python generate_string_sim_dict.py
This will generate a string_similarity_dict.pkl
At the end of this step, you should have:
string_similarity_dict.pkl
This step combines all synonyms files into a master file and does some post-processing.
Assumes that the following files have already been generated:
pypi_synonyms.pkl
cran_synonyms.pkl
bioconductor_synoynms.pkl
scicrunch_synoynms.pkl
extra_scicrunch_synonyms.pkl
string_similarity_dict.pkl
mention2ID.pkl
python combine_all_synonyms.py
At the end of this step, you should have:
synonyms.csv
file underdata/disambiguation
This step involves computing the similarity matrix and clustering the mentions.
Note that this step assumes that the synoynms.csv
file is already computed and contains similarity scores between pairs of strings.
A frequency dictionary freq_dict.pkl
is also required to be able to run the clustering algorithm and assign the cluster name to the mention with highest frequency in the corpus. If you don't have this generated, create it using the steps under [Setup](### Step 1: Setup)
python clustering.py --synonyms-file <synonyms_file>
python clustering.py --synonyms-file ../data/disambiguation_files/synonyms.csv
We evaluate the disambiguation algorithm using an expert team of biomedical curators. We ask them to evaluate 5885 generated software-synonym pairs as one of: Exact, Narrow, Unclear, Not Synonym.
The evaluation file is available as evaluation_disambiguation.csv
and the script to compute the metrics is evaluation_disambiguation.py
.
To get the evaluation metrics, run the evaluation script inside the evaluation
folder:
python evaluation_disambiguation.py --linking-evaluation-file `../data/curation/evaluation_disambiguation.csv`
- most scripts assume software mentions files (e.g. comm.tsv, comm_IDs.tsv) are in the format comm.tsv.gz or comm_IDs.tsv.gz
- each script has a number of command line arguments parameters that can be passed on; more info is available inside each file
- You might get errors when trying to read the
publishers_collection
files with pandas. If using pandas, we recommend adding theerror_bad_lines
flag. This might incorrectly disregard a small number of lines.
publishers_collection_df = pd.read_csv('publishers_collections.tsv.gz', sep = '\\t', compression = 'gzip', engine = 'python', error_bad_lines = False)