Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
commit 24b3af5 Author: Magnus Wahlberg <endast@gmail.com> Date: Tue Apr 16 10:40:45 2024 +0200 Optimize preprocessing (#65) * Add new test files * Update test_preprocess.py * Use parquet * Add brians code * Update preprocess.py * sort samples * Remove threads * Update exclude calls logic * Squashed commit of the following: commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> * Revert "Squashed commit of the following:" This reverts commit ebde7c1. * Remove unused import * don't use mkl 2024.1.0 * update micromamba@v1.8.1 * Isolate failing test * test genotype matrix * Revert "test genotype matrix" This reverts commit 6deee9b. * Revert "Isolate failing test" This reverts commit 6a11fe3. * fixup! Format Python code with psf/black pull_request * remove files * Delete variants.tsv.gz * Update test_preprocess.py * Update test_preprocess.py * fixup! Format Python code with psf/black pull_request * Update test_preprocess.py * Update test-runner.yml * one test * Revert "one test" This reverts commit 05e4578. * Revert "Update test-runner.yml" This reverts commit ff78d30. * update call filter test data * Update expected data * Update deeprvat_preprocessing_env.yml Remove joblib * Squashed commit of the following: commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 628af87 Author: Marcel Mück <mueckm1@gmail.com> Date: Thu Apr 4 14:09:22 2024 +0200 Update preprocessing.md (#60) Corrected small spelling mistake commit 1356ed2 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Mar 1 14:55:55 2024 +0100 Update dense_gt.py (#56) bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training commit 4d9ef64 Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com> Date: Fri Feb 23 12:21:49 2024 +0100 Feature cv training (#55) * performance optimizations * train multiple repeats on single node in parallel * bug fix * fix bug in indexing when subset_samples() removed something * sleep between jobs; stop if any job fails * format with black * bug fixes * add test for MultiphenoDataloader * update environments * uncomment rules * bug fixes * subset samples in training_dataset rule * example config.yaml * use gpu queue for compute_burdens * bugfix since dask reading didn't work any more * allow evaluation of all repeat combinations * allow analysis of each n_repeats and for all repeat combinations * option to provide burden file * allow seed gene alpha to be defined in config * change sorting order to get the best model * adaptations to analyze multiple repeats and use script wo seed genes * allow to provide a sample file and do separate indexing for pheno and geno to ensure indices are correct * automatize generation of figure 3 (associations & repliation) * generate cv splits with related samples in the same split * average burdens * average burdens * cross-validation like trainign * add missing cv_utils * write average burdens or each combination to single zarr file to avoid zarr issues * add logging information * make maf column a param * add logging * pipeline replictaion and plotting * evaluate all repeat combis with and without seed genes * update lsf.yaml * small updates * per-gene pval aggregation * aggregate pval per gene * bugfix- only load burdens if not skip burdens * logging info * updates and fixes * load burdens only for genes analysed in current chunk to save memory * small changes to pipeline * standardizing/qt-transform of combined test set x/y arrays * my_quantile_transform for numpy arrays * bugfix * remove unnecessary code * remove unnecessary wildcards * make averaging part of associate.py * allow seed genes/baselines to be missing (to allow assoc. testing for non-training phenotypes) * updates * gene-specific common variant covariates for conditional analysis * bugfix * post-hoc conditioning on common variants * restructure pipelines * removing redundant options * add cv_utils cli * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default * removal of redundant wildcards, updates and fixes * bugfixes * baseline discoveries only required for training phenotypes * remove not needed code * update configs * formatting * manually merge changes from feature-regenie to account for gene-specific annotations * allow different sample orders in phenotype_df and genotypes.h5 * change sample ids to be bytes as it is in the real data * update pipelines * update gitignore * pipeline updates * manually update github actions to be like master * bug fixes * checkout tests from master * make phenotype indices string as they are in real data * 'add gene_id' column * manually merge with master so tests can pass * bugfixes * use gene_id column instead of gene_ids * pipeline updates and fixes * update test config * adding age2 and age_sex to example data * update config * set tests folder to main version * checkout preprocssing files from main * checkout from main * manually merge sample_id changes from main * pipeline bugfixes and renamings * fixup! Format Python code with psf/black pull_request * remove gene_ids column * integrating suggested PR changes * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ada0aaa Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com> Date: Wed Feb 21 15:56:14 2024 +0100 Feature regenie (#52) * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * convert burdens and phenotypes to SAIGE format * add function to make regenie input * modifications for regenie * bug fixes * update to use regenie * add function for mapping samples * implement burden export * add function to convert REGENIE output * don't show all unmapped samples if the list is long * don't parallelize REGENIE step 1 * separate pipelines with and without REGENIE * support gene-specific annotation * bug fix * bug fix * bug fix * bug fix * correct regenie_step1 --lowmem-prefix * modify to work standalone * add --association-only option * allow gene-specific annotation * go back to SEAK/statsmodels * bug fixes * remove SAIGE code, fix imports and conda envs * make pipelines more self-contained * don't require burdens.zarr when --skip-burdens is passed * udpate utils --------- Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> * Revert change of micromamba * Ruff check * Squashed commit of the following: commit ae5c83e Author: Marcel Mück <mueckm1@gmail.com> Date: Mon Apr 15 11:01:03 2024 +0200 fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64) * fixed bugs in the annotation pipeline based on issues #61, #62 and #63. * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> --------- Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit ae5c83e Author: Marcel Mück <mueckm1@gmail.com> Date: Mon Apr 15 11:01:03 2024 +0200 fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64) * fixed bugs in the annotation pipeline based on issues #61, #62 and #63. * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com> commit 101feb2 Author: Marcel Mück <mueckm1@gmail.com> Date: Tue Apr 9 11:56:54 2024 +0200 Annotations new features (#54) * added all changes from annotation-speedups branch * added gtf and genotype mock file for github tests * Delete example/annotations/preprocessing_workdir/preprocessed directory * Update annotation_colnames_filling_values.yaml * Corrected fill values for maf columns * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped * included rulegraph instead dag * based on suggestions from @endast * added version info for rockdb.yaml file * updated rulegraph Updated Documentation corrected nonfunctional links * added support for X/Y chromosomes, removed dependency on pvcf file * excluded mkl version 2024.1.0 since it is crashing pytorch(pytorch/pytorch#123097) * changed way file stems are assumed to include 'double ending' on input files. * removed unused lines, removed pvcf from config file * changed if statement for gene_id_file --------- Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”> Co-authored-by: PMBio <PMBio@users.noreply.github.com>
- Loading branch information