Skip to content

4. Software Utilities

Bob Dolin edited this page Sep 21, 2023 · 12 revisions

This section describes software utilities in the /utilities folder. These utilities are used primarily to help load data into MongoDB (e.g. vcf2json formats VCF data into a JSON structure suitable for loading into MongoDB) and to support fast normalization (e.g. by replicating portions of NCBI variation services. Utilities include:

  • SPDI_Normalization: This code converts a chromosome-level variant, as derived from a VCF, into a contextual SPDI of the same build, using the algorithm described here. To run the code, you'll need to first download GRCh38 and GRCh37 Fasta files from NCBI Human Genome Resources page, and change the Python code to point to the downloaded files. The first time you run the code, the Fasta files get indexed, so it'll take longer.

  • bed2json: Converts a BED file into a format suitable for loading into MongoDB. Chromosome numbering must include 'chr': 'chr1', 'chrX', 'chrY', 'chrM'. BED file must be sorted by chromosome, by position (bedtools sort default)

  • run_vcf2json.py: Batch process that calls vcf2json for a set of VCF files, yielding three output files ('variantsData.json', 'phaseData.json', 'molecularConsequences.json') for loading into respective MongoDB collections. Does not update Patients or Tests collections. VCFs to be processed are listed in vcfData.csv, which must include columns vcf_filename, ref_build (populated with 'GRCh37' or 'GRCh38'), patient_id, test_date (yyyy-mm-dd), test_id, specimen_id, genomic_source_class (populate with 'germline', 'somatic', or 'mixed'), ratio_ad_dp (used for mitochondrial DNA processing, generally set it to 0.99), sample_position (zero-based, useful for multi-sample VCFs). vcf2json translation logic is based on vcf2fhir.

  • vcfPrepper: Implements the molecular consequence pipeline described on the Getting Started page.

Clone this wiki locally