Python Code using CyVCF2 to run custom GWAS models on imputed VCF files
Post Doc at Institute of Biological Psychiatry
This is code written in python, that uses the CyVCF2 library to parse imputed, bgzip'ed and tabix indexed VCF files and accepts custom models written in R as a command line option to run a genome wide association study, using rpy2, an interface, that facilitates r-functions to be seamlessly run in python code.
Usage can be obtained by running the following command:
python custom_gwas.py -h
which prints:
usage: custom_gwas.py [-h] --vcf VCF --pheno PHENO --covar COVAR --rscript RSCRIPT --model MODEL --out OUT
where
VCF => The bgzip, tab indexed VCF file containing dosage values in the format column 'DS'
PHENO => File of phenotypes with samples listed under the column name IID and a phenotype name of your choice
COVAR => File of covariates with samples listed under the column name IID and your choice of covariates, weights etc.
RSCRIPT => some file.R that contains one or more functions pertaining to the association models you wish to fit.
MODEL => Name of the particular function in the file.R that you wish to fit
OUT => Output file name
An example for the VCF, PHENO, COVAR and OUT files are provided in the test/ folder.
An example of a custom R-script and function is provided at /bin/custom_function.R
Ultimately the freedom is given to the user to specify columns of their choice in the phenotype and covariate files and thereby using them in the Rscript they provide in specifying the model. The only major requirements are that:
- VCF contains DS column in the FORMAT fields
- The pheno and covariate files can be interesected using an IID column
- The custom gwas Rscript returns a dataframe with the column names: estimate se p
The script can be tested on the test data using the following command:
python /bin/custom_gwas.py \
--vcf /test/test.vcf.gz \
--pheno /test/test_pheno.txt \
--covar /test/test_covar.txt \
--rscript /bin/custom_function.R \
--model custom_fit \
--out test_out.txt
The code has the following dependencies:
python
cyvcf2
numpy
pandas
rpy2
R
dplyr
broom
All dependencies can be installed using conda or the requisite environment can be created using the environment.yml file with the following command:
conda create --file environment.yml
For any questions or bug reports, please create an issue on github.