Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature cv training #55

Merged
merged 93 commits into from
Feb 23, 2024
Merged

Feature cv training #55

merged 93 commits into from
Feb 23, 2024

Conversation

HolEv
Copy link
Collaborator

@HolEv HolEv commented Feb 20, 2024

What

  • Implement cross-validation like training pipelines/cv_training

    • train the model on n-1 folds of the samples
    • compute the burdens on the held-out fold using this models
    • repeat this for all n-folds
    • for this, allow a sample file to be passed to dense_gt.py
  • Allow different orders of samples in phenotype_df and genotypes.h5 in dense_gt.py

    • so far, samples in phenotype_df and genotypes.h5 had to be in the same order.
    • this is changed now by introducing an additional index map for the genotypes.h5, which retrieves samples in the order of self.samples
  • Average burdens from multiple repeats and run association testing afterwards (deeprvat/associate.py) (as opposed to running the association testing on each cv individually)

  • Restructure the snakefiles as it had been already started in the main branch

    • require baseline result only for training phenotypes
  • Update evaluate.py

    • no repeats required any more
    • make bonferroni correction the default multiple testing correction
    • don't combine baseline discoveries with DeepRVAT discoveries
    • Re-test also the 'seed genes' since we don't evaluate on the same sample-gene combinations any more as we trained on (thanks to the cv-based training procedure)
  • use additional covariates age2 and age*sex and correct for statin usage

    • updated the example data to have these fields
  • update example data to have bit sample ids in genotypes.h5 and string sample ids in phenotype_df

  • implement conditional analysis for common variants pipelines/association_testing_control_for_common_variants.snakefile

Testing

  • quite extensively tested (many reasonable experiments done) but still need to check why github tests fail so far

Copy link
Contributor

@bfclarke bfclarke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks very good, thanks! I left a few comments that could optionally be addressed. The one that we should really fix before merging is about using a with statement when opening the genotype file - otherwise it could end up in a locked state if the script is interrupted while it's open.

deeprvat/data/dense_gt.py Outdated Show resolved Hide resolved
deeprvat/data/dense_gt.py Outdated Show resolved Hide resolved
deeprvat/data/dense_gt.py Show resolved Hide resolved
deeprvat/data/dense_gt.py Show resolved Hide resolved
@bfclarke bfclarke merged commit 4d9ef64 into main Feb 23, 2024
1 check passed
endast added a commit that referenced this pull request Apr 10, 2024
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
endast added a commit that referenced this pull request Apr 12, 2024
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
Marcel-Mueck pushed a commit that referenced this pull request Apr 16, 2024
* Add new test files

* Update test_preprocess.py

* Use parquet

* Add brians code

* Update preprocess.py

* sort samples

* Remove threads

* Update exclude calls logic

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

* Revert "Squashed commit of the following:"

This reverts commit ebde7c1.

* Remove unused import

* don't use mkl 2024.1.0

* update micromamba@v1.8.1

* Isolate failing test

* test genotype matrix

* Revert "test genotype matrix"

This reverts commit 6deee9b.

* Revert "Isolate failing test"

This reverts commit 6a11fe3.

* fixup! Format Python code with psf/black pull_request

* remove files

* Delete variants.tsv.gz

* Update test_preprocess.py

* Update test_preprocess.py

* fixup! Format Python code with psf/black pull_request

* Update test_preprocess.py

* Update test-runner.yml

* one test

* Revert "one test"

This reverts commit 05e4578.

* Revert "Update test-runner.yml"

This reverts commit ff78d30.

* update call filter test data

* Update expected data

* Update deeprvat_preprocessing_env.yml

Remove joblib

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

* Revert change of micromamba

* Ruff check

* Squashed commit of the following:

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

---------

Co-authored-by: PMBio <PMBio@users.noreply.github.com>
endast added a commit that referenced this pull request Apr 16, 2024
commit 24b3af5
Author: Magnus Wahlberg <endast@gmail.com>
Date:   Tue Apr 16 10:40:45 2024 +0200

    Optimize preprocessing (#65)

    * Add new test files

    * Update test_preprocess.py

    * Use parquet

    * Add brians code

    * Update preprocess.py

    * sort samples

    * Remove threads

    * Update exclude calls logic

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert "Squashed commit of the following:"

    This reverts commit ebde7c1.

    * Remove unused import

    * don't use mkl 2024.1.0

    * update micromamba@v1.8.1

    * Isolate failing test

    * test genotype matrix

    * Revert "test genotype matrix"

    This reverts commit 6deee9b.

    * Revert "Isolate failing test"

    This reverts commit 6a11fe3.

    * fixup! Format Python code with psf/black pull_request

    * remove files

    * Delete variants.tsv.gz

    * Update test_preprocess.py

    * Update test_preprocess.py

    * fixup! Format Python code with psf/black pull_request

    * Update test_preprocess.py

    * Update test-runner.yml

    * one test

    * Revert "one test"

    This reverts commit 05e4578.

    * Revert "Update test-runner.yml"

    This reverts commit ff78d30.

    * update call filter test data

    * Update expected data

    * Update deeprvat_preprocessing_env.yml

    Remove joblib

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert change of micromamba

    * Ruff check

    * Squashed commit of the following:

    commit ae5c83e
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Mon Apr 15 11:01:03 2024 +0200

        fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

        * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    ---------

    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>
endast added a commit that referenced this pull request Apr 16, 2024
* add qc_indmiss

* Update preprocess_with_qc.snakefile

* Fix csv

* add process_individual_missingness cmd

* add process_individual_missingness

* Use separate variable for sample_path

* Only write sample to indmiss file

* add test_process_individual_missingness tests

* Add sample missingness to workflow

* Update dag images in doc

* Update test_preprocess.py

* add back create_excluded_samples_dir

* Cleanup pipeline

* fixup! Format Python code with psf/black pull_request

* Update preprocess.py

* fixup! Format Python code with psf/black pull_request

* Fix ruff errors

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Squashed commit of the following:

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Squashed commit of the following:

commit 24b3af5
Author: Magnus Wahlberg <endast@gmail.com>
Date:   Tue Apr 16 10:40:45 2024 +0200

    Optimize preprocessing (#65)

    * Add new test files

    * Update test_preprocess.py

    * Use parquet

    * Add brians code

    * Update preprocess.py

    * sort samples

    * Remove threads

    * Update exclude calls logic

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert "Squashed commit of the following:"

    This reverts commit ebde7c1.

    * Remove unused import

    * don't use mkl 2024.1.0

    * update micromamba@v1.8.1

    * Isolate failing test

    * test genotype matrix

    * Revert "test genotype matrix"

    This reverts commit 6deee9b.

    * Revert "Isolate failing test"

    This reverts commit 6a11fe3.

    * fixup! Format Python code with psf/black pull_request

    * remove files

    * Delete variants.tsv.gz

    * Update test_preprocess.py

    * Update test_preprocess.py

    * fixup! Format Python code with psf/black pull_request

    * Update test_preprocess.py

    * Update test-runner.yml

    * one test

    * Revert "one test"

    This reverts commit 05e4578.

    * Revert "Update test-runner.yml"

    This reverts commit ff78d30.

    * update call filter test data

    * Update expected data

    * Update deeprvat_preprocessing_env.yml

    Remove joblib

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert change of micromamba

    * Ruff check

    * Squashed commit of the following:

    commit ae5c83e
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Mon Apr 15 11:01:03 2024 +0200

        fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

        * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    ---------

    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Revert "Squashed commit of the following:"

This reverts commit 4e9b47d.

---------

Co-authored-by: PMBio <PMBio@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants