The variant scoring repository provides a set of scripts for scoring genetic variants using a ChromBPNet model.
This script takes a list of variants in various input formats and generates scores for the variants using a ChromBPNet model. The output is a TSV file containing the scores for each variant.
python variant_scoring.py -l [VARIANTS_FILE] -g [GENOME_FASTA] -m [MODEL_PATH] -o [OUT_PREFIX] -s [CHROM_SIZES] [OTHER_ARGS]
-l or --list: (required) a TSV file containing a list of variants to score
-g or --genome: (required) a genome fasta file
-pg or --peak_genome: a genome fasta file for peaks
-m or --model: (required) the ChromBPNet model to use for variant scoring. For most use cases, this should be the bias-corrected model (chrombpnet_nobias.h5)
-o or --out_prefix: (required) the path to store SNP effect score predictions from the script. The directory should already exist
-s or --chrom_sizes: (required) the path to a TSV file with chromosome sizes
-ps or --peak_chrom_sizes: the path to a TSV file with chromosome sizes for the peak genome
-dm or --debug_mode: subsample 10000 variants for debug
-bs or --batch_size: the batch size to use for the model. Default is 512
-sc or --schema: the format for the input variants list. Choices are: 'bed', 'plink', 'chrombpnet', 'original'. Default is 'chrombpnet'
-p or --peaks: a bed file containing peak regions
-n or --num_shuf: the number of shuffled scores per SNP. Default is 10
-t or --total_shuf: the total number of shuffled scores across all SNPs. Overrides --num_shuf
-c or --chrom: only score SNPs in the selected chromosome
-r or --random_seed: the random seed for reproducibility when sampling. Default is 1234
--no_hdf5: do not save detailed predictions in hdf5 file
-fo or --forward_only: run variant scoring only on forward sequence
-st or --shap_type: the type of SHAP values to compute. Default is "counts"
- chrombpnet : ['chr', 'pos', 'allele1', 'allele2', 'variant_id']
- bed : ['chr', 'pos', 'end', 'allele1', 'allele2', 'variant_id']
- plink : ['chr', 'variant_id', 'ignore1', 'pos', 'allele1', 'allele2']
- original : ['chr', 'pos', 'variant_id', 'allele1', 'allele2']
This script takes variant scores generated by the variant_scoring.py script and generates a TSV file with the mean scores for each score type.
python variant_summary_across_folds.py -sd [VARIANT_SCORE_DIR] -sl [SCORE_LIST] -o [out_prefix] -s [SCHEMA]
-sd or --score_dir (required): Path to directory with variant scores that will be used to generate summary
-sl or --score_list: (required): Names of variant score files that will be used to generate summary
-o or --out_prefix (required): Path prefix for storing the summary file with average scores across folds; directory should already exist
-sc or --schema: the format for the input variants list. Choices are: 'bed', 'plink', 'chrombpnet', 'original'. Default is 'chrombpnet'
This script takes a list of variants and annotates each with their closest genes and any overlaps with peaks.
NOTE: This script assumes that the peaks and genes are in the same reference genome as the variants, and it does not perform any liftover operations.
python variant_annotation.py -sd [VARIANT_SCORE_DIR] -o [out_prefix] -p [PEAKS] -g [GENES] -s [SCHEMA]
-l or --list: (required) a TSV file containing a list of variants to annotate
-o or --out_prefix (required): Path prefix for storing the annotated file; directory should already exist
-p or --peaks (required): a bed file containing peak regions
-g or --genes: (required): A bed file with gene coordinates
-sc or --schema: the format for the input variants list. Choices are: 'bed', 'plink', 'chrombpnet', 'original'. Default is 'chrombpnet'
Note: pos (position) column is for 1-indexed SNP position, unless the schema is bed