Annotations new features #54

Marcel-Mueck · 2024-02-20T13:47:08Z

What

The new update to the annotation pipeline includes all annotation tools used for the published version of deeprvat, as well as all processing steps needed to create ready-to-run annotation data for the published deeprvat version.

Updates since last push:

inclusion of AlphaMissense annotations as plugin in VEP
changed DAG in order to optimize parallelisation
Allele Frequency, MAF and AF_MB calculated from genotype file
pipeline includes filtering step for variants based on distance to the nearest exon
selection of required columns for deeprvat as well as renaming of columns to fit required input is now performed
abSplice pipeline is now integrated into the DAG
spliceAI scores are aggregated to spliceAI_delta_sores

Testing

since the annotation tools rely on pre-scored data, automatic tests via GitHub actions is not completely possible
the annotation pipeline was instead tested locally using example data

…into annotations-new-features

… functions

…into annotations-new-features

…ason

…into annotations-new-features

Corrected fill values for maf columns

meyerkm · 2024-02-23T10:01:26Z

@Marcel-Mueck There seems to be duplicate functions in annotations.py for the following functions:

deepripe_score_variant_onlyseq_all
merge_abscore
process_annotations

Also, looks like the Read-the-docs build is failing as the docstrings aren't in the correct Sphinx format. See here to adapt: https://sphinx-rtd-tutorial.readthedocs.io/en/latest/docstrings.html

…ations are dropped

…into annotations-new-features

…torch#123097)

…t files.

…into annotations-new-features

@endast

commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com>

@endast

commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com>

@endast

commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

@endast

commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

@endast

commit ae5c83e Author: Marcel Mück <mueckm1@gmail.com> Date: Mon Apr 15 11:01:03 2024 +0200 fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64) * fixed bugs in the annotation pipeline based on issues #61, #62 and #63. * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com>

@endast

* Add new test files * Update test_preprocess.py * Use parquet * Add brians code * Update preprocess.py * sort samples * Remove threads * Update exclude calls logic * Squashed commit of the following: commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> * Revert "Squashed commit of the following:" This reverts commit ebde7c1. * Remove unused import * don't use mkl 2024.1.0 * update micromamba@v1.8.1 * Isolate failing test * test genotype matrix * Revert "test genotype matrix" This reverts commit 6deee9b. * Revert "Isolate failing test" This reverts commit 6a11fe3. * fixup! Format Python code with psf/black pull_request * remove files * Delete variants.tsv.gz * Update test_preprocess.py * Update test_preprocess.py * fixup! Format Python code with psf/black pull_request * Update test_preprocess.py * Update test-runner.yml * one test * Revert "one test" This reverts commit 05e4578. * Revert "Update test-runner.yml" This reverts commit ff78d30. * update call filter test data * Update expected data * Update deeprvat_preprocessing_env.yml Remove joblib * Squashed commit of the following: commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> * Revert change of micromamba * Ruff check * Squashed commit of the following: commit ae5c83e Author: Marcel Mück <mueckm1@gmail.com> Date: Mon Apr 15 11:01:03 2024 +0200 fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64) * fixed bugs in the annotation pipeline based on issues #61, #62 and #63. * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> --------- Co-authored-by: PMBio <PMBio@users.noreply.github.com>

@endast

commit 24b3af5 Author: Magnus Wahlberg <endast@gmail.com> Date: Tue Apr 16 10:40:45 2024 +0200 Optimize preprocessing (#65) * Add new test files * Update test_preprocess.py * Use parquet * Add brians code * Update preprocess.py * sort samples * Remove threads * Update exclude calls logic * Squashed commit of the following: commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> * Revert "Squashed commit of the following:" This reverts commit ebde7c1. * Remove unused import * don't use mkl 2024.1.0 * update micromamba@v1.8.1 * Isolate failing test * test genotype matrix * Revert "test genotype matrix" This reverts commit 6deee9b. * Revert "Isolate failing test" This reverts commit 6a11fe3. * fixup! Format Python code with psf/black pull_request * remove files * Delete variants.tsv.gz * Update test_preprocess.py * Update test_preprocess.py * fixup! Format Python code with psf/black pull_request * Update test_preprocess.py * Update test-runner.yml * one test * Revert "one test" This reverts commit 05e4578. * Revert "Update test-runner.yml" This reverts commit ff78d30. * update call filter test data * Update expected data * Update deeprvat_preprocessing_env.yml Remove joblib * Squashed commit of the following: commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> * Revert change of micromamba * Ruff check * Squashed commit of the following: commit ae5c83e Author: Marcel Mück <mueckm1@gmail.com> Date: Mon Apr 15 11:01:03 2024 +0200 fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64) * fixed bugs in the annotation pipeline based on issues #61, #62 and #63. * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> --------- Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ae5c83e Author: Marcel Mück <mueckm1@gmail.com> Date: Mon Apr 15 11:01:03 2024 +0200 fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64) * fixed bugs in the annotation pipeline based on issues #61, #62 and #63. * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com>

@endast

* add qc_indmiss * Update preprocess_with_qc.snakefile * Fix csv * add process_individual_missingness cmd * add process_individual_missingness * Use separate variable for sample_path * Only write sample to indmiss file * add test_process_individual_missingness tests * Add sample missingness to workflow * Update dag images in doc * Update test_preprocess.py * add back create_excluded_samples_dir * Cleanup pipeline * fixup! Format Python code with psf/black pull_request * Update preprocess.py * fixup! Format Python code with psf/black pull_request * Fix ruff errors * Squashed commit of the following: commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> * Squashed commit of the following: commit ae5c83e Author: Marcel Mück <mueckm1@gmail.com> Date: Mon Apr 15 11:01:03 2024 +0200 fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64) * fixed bugs in the annotation pipeline based on issues #61, #62 and #63. * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> * Squashed commit of the following: commit 24b3af5 Author: Magnus Wahlberg <endast@gmail.com> Date: Tue Apr 16 10:40:45 2024 +0200 Optimize preprocessing (#65) * Add new test files * Update test_preprocess.py * Use parquet * Add brians code * Update preprocess.py * sort samples * Remove threads * Update exclude calls logic * Squashed commit of the following: commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> * Revert "Squashed commit of the following:" This reverts commit ebde7c1. * Remove unused import * don't use mkl 2024.1.0 * update micromamba@v1.8.1 * Isolate failing test * test genotype matrix * Revert "test genotype matrix" This reverts commit 6deee9b. * Revert "Isolate failing test" This reverts commit 6a11fe3. * fixup! Format Python code with psf/black pull_request * remove files * Delete variants.tsv.gz * Update test_preprocess.py * Update test_preprocess.py * fixup! Format Python code with psf/black pull_request * Update test_preprocess.py * Update test-runner.yml * one test * Revert "one test" This reverts commit 05e4578. * Revert "Update test-runner.yml" This reverts commit ff78d30. * update call filter test data * Update expected data * Update deeprvat_preprocessing_env.yml Remove joblib * Squashed commit of the following: commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> * Revert change of micromamba * Ruff check * Squashed commit of the following: commit ae5c83e Author: Marcel Mück <mueckm1@gmail.com> Date: Mon Apr 15 11:01:03 2024 +0200 fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64) * fixed bugs in the annotation pipeline based on issues #61, #62 and #63. * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> --------- Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ae5c83e Author: Marcel Mück <mueckm1@gmail.com> Date: Mon Apr 15 11:01:03 2024 +0200 fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64) * fixed bugs in the annotation pipeline based on issues #61, #62 and #63. * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> * Revert "Squashed commit of the following:" This reverts commit 4e9b47d. --------- Co-authored-by: PMBio <PMBio@users.noreply.github.com>

“Marcel-Mueck” and others added 5 commits February 20, 2024 14:04

added all changes from annotation-speedups branch

63f8737

added gtf and genotype mock file for github tests

cf556d1

Delete example/annotations/preprocessing_workdir/preprocessed directory

6129243

added mock genotype file

cb1b5c0

Merge branch 'annotations-new-features' of github.com:PMBio/deeprvat …

a4b98c5

…into annotations-new-features

Marcel-Mueck requested a review from endast February 20, 2024 13:47

PMBio and others added 14 commits February 20, 2024 13:47

fixup! Format Python code with psf/black pull_request

152d9cd

run black

6bd59be

Merge branch 'annotations-new-features' of github.com:PMBio/deeprvat …

69c8ad4

…into annotations-new-features

added docstrings to functions, removed unused exon distance filtering…

c73d011

… functions

fixup! Format Python code with psf/black pull_request

477cad2

removed link because sphinx complained for no reason

85f4a5a

removed indents in docstrings

0a6ec34

Merge branch 'annotations-new-features' of github.com:PMBio/deeprvat …

975483c

…into annotations-new-features

fixup! Format Python code with psf/black pull_request

6f60ddb

ran black on repository

a112a8d

Merge branch 'annotations-new-features' of github.com:PMBio/deeprvat …

36e103f

…into annotations-new-features

fixup! Format Python code with psf/black pull_request

597decf

removed example code in docstring because sphinx complained for no re…

2f20b1f

…ason

Merge branch 'annotations-new-features' of github.com:PMBio/deeprvat …

8259b46

…into annotations-new-features

Marcel-Mueck closed this Feb 20, 2024

Marcel-Mueck reopened this Feb 20, 2024

Marcel-Mueck added 2 commits February 21, 2024 16:42

Update annotation_colnames_filling_values.yaml

eb7feee

Corrected fill values for maf columns

corrected missspelling

3f023b1

Marcel-Mueck and others added 5 commits February 26, 2024 11:27

removed duplicate functions and corrected corrupt docstring

3b5b953

fixup! Format Python code with psf/black pull_request

1aec9e5

corrected corrupt sphinx docstring

7d58426

Changed protein_id merging and exon distance filtering, s.t. no annot…

10870f8

…ations are dropped

fixup! Format Python code with psf/black pull_request

c94e930

PMBio and others added 9 commits April 5, 2024 11:54

fixup! Format Python code with psf/black pull_request

da07527

specified mlk version to test if tf error persists

f971bba

Merge branch 'annotations-new-features' of github.com:PMBio/deeprvat …

b549638

…into annotations-new-features

added mock file for test run

3467a33

deleted (misspelled) mock fiel directory

cbf349a

excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/py…

58c4769

…torch#123097)

changed way file stems are assumed to include 'double ending' on inpu…

620b5c8

…t files.

Merge branch 'annotations-new-features' of github.com:PMBio/deeprvat …

29aed29

…into annotations-new-features

updated annotations.md to correct sef-reference links

e234bb5

Marcel-Mueck added enhancement New feature or request and removed bug Something isn't working labels Apr 9, 2024

“Marcel-Mueck” and others added 4 commits April 9, 2024 10:32

removed unused lines, removed pvcf from config file

cf7685f

fixup! Format Python code with psf/black pull_request

c05470f

changed if statement for gene_id_file

692c386

Merge branch 'annotations-new-features' of github.com:PMBio/deeprvat …

08de261

…into annotations-new-features

endast approved these changes Apr 9, 2024

View reviewed changes

Marcel-Mueck merged commit 101feb2 into main Apr 9, 2024
12 checks passed

Marcel-Mueck deleted the annotations-new-features branch April 9, 2024 09:57

This was referenced Apr 11, 2024

Version pointer in environment_spliceai_rocksdb.yaml #61

Closed

directory() missing in absplice_download.snakefile in the annotation pipeline #62

Closed

Inconsistent columns during concat_annotations #63

Closed

Jonas-B-Frank mentioned this pull request Apr 22, 2024

Missing columns in final annotations.parquet lead to errors in seed gene pipeline. Failing save_merge. #74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotations new features #54

Annotations new features #54

Marcel-Mueck commented Feb 20, 2024

meyerkm commented Feb 23, 2024

Annotations new features #54

Annotations new features #54

Conversation

Marcel-Mueck commented Feb 20, 2024

What

Testing

meyerkm commented Feb 23, 2024