Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotations new features #54

Merged
merged 54 commits into from
Apr 9, 2024
Merged

Annotations new features #54

merged 54 commits into from
Apr 9, 2024

Conversation

Marcel-Mueck
Copy link
Collaborator

What

The new update to the annotation pipeline includes all annotation tools used for the published version of deeprvat, as well as all processing steps needed to create ready-to-run annotation data for the published deeprvat version.
annotation_rulegraph

Updates since last push:

  • inclusion of AlphaMissense annotations as plugin in VEP
  • changed DAG in order to optimize parallelisation
  • Allele Frequency, MAF and AF_MB calculated from genotype file
  • pipeline includes filtering step for variants based on distance to the nearest exon
  • selection of required columns for deeprvat as well as renaming of columns to fit required input is now performed
  • abSplice pipeline is now integrated into the DAG
  • spliceAI scores are aggregated to spliceAI_delta_sores

Testing

  • since the annotation tools rely on pre-scored data, automatic tests via GitHub actions is not completely possible
  • the annotation pipeline was instead tested locally using example data

@Marcel-Mueck Marcel-Mueck requested a review from endast February 20, 2024 13:47
@meyerkm
Copy link
Collaborator

meyerkm commented Feb 23, 2024

@Marcel-Mueck There seems to be duplicate functions in annotations.py for the following functions:

  • deepripe_score_variant_onlyseq_all
  • merge_abscore
  • process_annotations

Also, looks like the Read-the-docs build is failing as the docstrings aren't in the correct Sphinx format. See here to adapt: https://sphinx-rtd-tutorial.readthedocs.io/en/latest/docstrings.html

@Marcel-Mueck Marcel-Mueck added enhancement New feature or request and removed bug Something isn't working labels Apr 9, 2024
@Marcel-Mueck Marcel-Mueck merged commit 101feb2 into main Apr 9, 2024
12 checks passed
@Marcel-Mueck Marcel-Mueck deleted the annotations-new-features branch April 9, 2024 09:57
endast added a commit that referenced this pull request Apr 10, 2024
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>
endast added a commit that referenced this pull request Apr 10, 2024
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>
endast added a commit that referenced this pull request Apr 10, 2024
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
endast added a commit that referenced this pull request Apr 12, 2024
commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
endast added a commit that referenced this pull request Apr 15, 2024
commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>
Marcel-Mueck pushed a commit that referenced this pull request Apr 16, 2024
* Add new test files

* Update test_preprocess.py

* Use parquet

* Add brians code

* Update preprocess.py

* sort samples

* Remove threads

* Update exclude calls logic

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

* Revert "Squashed commit of the following:"

This reverts commit ebde7c1.

* Remove unused import

* don't use mkl 2024.1.0

* update micromamba@v1.8.1

* Isolate failing test

* test genotype matrix

* Revert "test genotype matrix"

This reverts commit 6deee9b.

* Revert "Isolate failing test"

This reverts commit 6a11fe3.

* fixup! Format Python code with psf/black pull_request

* remove files

* Delete variants.tsv.gz

* Update test_preprocess.py

* Update test_preprocess.py

* fixup! Format Python code with psf/black pull_request

* Update test_preprocess.py

* Update test-runner.yml

* one test

* Revert "one test"

This reverts commit 05e4578.

* Revert "Update test-runner.yml"

This reverts commit ff78d30.

* update call filter test data

* Update expected data

* Update deeprvat_preprocessing_env.yml

Remove joblib

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 628af87
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Thu Apr 4 14:09:22 2024 +0200

    Update preprocessing.md (#60)

    Corrected small spelling mistake

commit 1356ed2
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Mar 1 14:55:55 2024 +0100

    Update dense_gt.py (#56)

    bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

commit 4d9ef64
Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
Date:   Fri Feb 23 12:21:49 2024 +0100

    Feature cv training (#55)

    * performance optimizations

    * train multiple repeats on single node in parallel

    * bug fix

    * fix bug in indexing when subset_samples() removed something

    * sleep between jobs; stop if any job fails

    * format with black

    * bug fixes

    * add test for MultiphenoDataloader

    * update environments

    * uncomment rules

    * bug fixes

    * subset samples in training_dataset rule

    * example config.yaml

    * use gpu queue for compute_burdens

    * bugfix since dask reading didn't work any more

    * allow evaluation of all repeat combinations

    * allow analysis of each n_repeats and for all repeat combinations

    * option to provide burden file

    * allow seed gene alpha to be defined in config

    * change sorting order to get the best model

    * adaptations to analyze multiple repeats and use script wo seed genes

    * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

    * automatize generation of figure 3 (associations & repliation)

    * generate cv splits with related samples in the same split

    * average burdens

    * average burdens

    * cross-validation like trainign

    * add missing cv_utils

    * write average burdens or each combination to single zarr file to avoid zarr issues

    * add logging information

    * make maf column a param

    * add logging

    * pipeline replictaion and plotting

    * evaluate all repeat combis with and without seed genes

    * update lsf.yaml

    * small updates

    * per-gene pval aggregation

    * aggregate pval per gene

    * bugfix- only load burdens if not skip burdens

    * logging info

    * updates and fixes

    * load burdens only for genes analysed in current chunk to save memory

    * small changes to pipeline

    * standardizing/qt-transform of combined test set x/y arrays

    * my_quantile_transform for numpy arrays

    * bugfix

    * remove unnecessary code

    * remove unnecessary wildcards

    * make averaging part of associate.py

    * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

    * updates

    * gene-specific common variant covariates for conditional analysis

    * bugfix

    * post-hoc conditioning on common variants

    * restructure pipelines

    * removing redundant options

    * add cv_utils cli

    * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

    * removal of redundant wildcards, updates and fixes

    * bugfixes

    * baseline discoveries only required for training phenotypes

    * remove not needed code

    * update configs

    * formatting

    * manually merge changes from feature-regenie to account for gene-specific annotations

    * allow different sample orders in phenotype_df and genotypes.h5

    * change sample ids to be bytes as it is in the real data

    * update pipelines

    * update gitignore

    * pipeline updates

    * manually update github actions to be like master

    * bug fixes

    * checkout tests from master

    * make phenotype indices string as they are in real data

    * 'add gene_id' column

    * manually merge with master so tests can pass

    * bugfixes

    * use gene_id column instead of gene_ids

    * pipeline updates and fixes

    * update test config

    * adding age2 and age_sex to example data

    * update config

    * set tests folder to  main version

    * checkout preprocssing files from main

    * checkout from main

    * manually merge sample_id changes from main

    * pipeline bugfixes and renamings

    * fixup! Format Python code with psf/black pull_request

    * remove gene_ids column

    * integrating suggested PR changes

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ada0aaa
Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
Date:   Wed Feb 21 15:56:14 2024 +0100

    Feature regenie (#52)

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * convert burdens and phenotypes to SAIGE format

    * add function to make regenie input

    * modifications for regenie

    * bug fixes

    * update to use regenie

    * add function for mapping samples

    * implement burden export

    * add function to convert REGENIE output

    * don't show all unmapped samples if the list is long

    * don't parallelize REGENIE step 1

    * separate pipelines with and without REGENIE

    * support gene-specific annotation

    * bug fix

    * bug fix

    * bug fix

    * bug fix

    * correct regenie_step1 --lowmem-prefix

    * modify to work standalone

    * add --association-only option

    * allow gene-specific annotation

    * go back to SEAK/statsmodels

    * bug fixes

    * remove SAIGE code, fix imports and conda envs

    * make pipelines more self-contained

    * don't require burdens.zarr when --skip-burdens is passed

    * udpate utils

    ---------

    Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

* Revert change of micromamba

* Ruff check

* Squashed commit of the following:

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

---------

Co-authored-by: PMBio <PMBio@users.noreply.github.com>
endast added a commit that referenced this pull request Apr 16, 2024
commit 24b3af5
Author: Magnus Wahlberg <endast@gmail.com>
Date:   Tue Apr 16 10:40:45 2024 +0200

    Optimize preprocessing (#65)

    * Add new test files

    * Update test_preprocess.py

    * Use parquet

    * Add brians code

    * Update preprocess.py

    * sort samples

    * Remove threads

    * Update exclude calls logic

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert "Squashed commit of the following:"

    This reverts commit ebde7c1.

    * Remove unused import

    * don't use mkl 2024.1.0

    * update micromamba@v1.8.1

    * Isolate failing test

    * test genotype matrix

    * Revert "test genotype matrix"

    This reverts commit 6deee9b.

    * Revert "Isolate failing test"

    This reverts commit 6a11fe3.

    * fixup! Format Python code with psf/black pull_request

    * remove files

    * Delete variants.tsv.gz

    * Update test_preprocess.py

    * Update test_preprocess.py

    * fixup! Format Python code with psf/black pull_request

    * Update test_preprocess.py

    * Update test-runner.yml

    * one test

    * Revert "one test"

    This reverts commit 05e4578.

    * Revert "Update test-runner.yml"

    This reverts commit ff78d30.

    * update call filter test data

    * Update expected data

    * Update deeprvat_preprocessing_env.yml

    Remove joblib

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert change of micromamba

    * Ruff check

    * Squashed commit of the following:

    commit ae5c83e
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Mon Apr 15 11:01:03 2024 +0200

        fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

        * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    ---------

    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>
endast added a commit that referenced this pull request Apr 16, 2024
* add qc_indmiss

* Update preprocess_with_qc.snakefile

* Fix csv

* add process_individual_missingness cmd

* add process_individual_missingness

* Use separate variable for sample_path

* Only write sample to indmiss file

* add test_process_individual_missingness tests

* Add sample missingness to workflow

* Update dag images in doc

* Update test_preprocess.py

* add back create_excluded_samples_dir

* Cleanup pipeline

* fixup! Format Python code with psf/black pull_request

* Update preprocess.py

* fixup! Format Python code with psf/black pull_request

* Fix ruff errors

* Squashed commit of the following:

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Squashed commit of the following:

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Squashed commit of the following:

commit 24b3af5
Author: Magnus Wahlberg <endast@gmail.com>
Date:   Tue Apr 16 10:40:45 2024 +0200

    Optimize preprocessing (#65)

    * Add new test files

    * Update test_preprocess.py

    * Use parquet

    * Add brians code

    * Update preprocess.py

    * sort samples

    * Remove threads

    * Update exclude calls logic

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert "Squashed commit of the following:"

    This reverts commit ebde7c1.

    * Remove unused import

    * don't use mkl 2024.1.0

    * update micromamba@v1.8.1

    * Isolate failing test

    * test genotype matrix

    * Revert "test genotype matrix"

    This reverts commit 6deee9b.

    * Revert "Isolate failing test"

    This reverts commit 6a11fe3.

    * fixup! Format Python code with psf/black pull_request

    * remove files

    * Delete variants.tsv.gz

    * Update test_preprocess.py

    * Update test_preprocess.py

    * fixup! Format Python code with psf/black pull_request

    * Update test_preprocess.py

    * Update test-runner.yml

    * one test

    * Revert "one test"

    This reverts commit 05e4578.

    * Revert "Update test-runner.yml"

    This reverts commit ff78d30.

    * update call filter test data

    * Update expected data

    * Update deeprvat_preprocessing_env.yml

    Remove joblib

    * Squashed commit of the following:

    commit 101feb2
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Tue Apr 9 11:56:54 2024 +0200

        Annotations new features (#54)

        * added all changes from annotation-speedups branch

        * added gtf and genotype mock file for github tests

        * Delete example/annotations/preprocessing_workdir/preprocessed directory

        * Update annotation_colnames_filling_values.yaml

        * Corrected fill values for maf columns

        * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

        * included rulegraph instead dag

        * based on  suggestions from @endast

        * added version info for rockdb.yaml file

        * updated rulegraph

        Updated Documentation

        corrected nonfunctional links

        * added support for X/Y chromosomes, removed dependency on pvcf file

        * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

        * changed way file stems are assumed to include 'double ending' on input files.

        * removed unused lines, removed pvcf from config file

        * changed if statement for gene_id_file

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit 628af87
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Thu Apr 4 14:09:22 2024 +0200

        Update preprocessing.md (#60)

        Corrected small spelling mistake

    commit 1356ed2
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Mar 1 14:55:55 2024 +0100

        Update dense_gt.py (#56)

        bugfix (had forgotten to remove sample_file = none) but the sample file is needed during cv training

    commit 4d9ef64
    Author: Eva Holtkamp <59055511+HolEv@users.noreply.github.com>
    Date:   Fri Feb 23 12:21:49 2024 +0100

        Feature cv training (#55)

        * performance optimizations

        * train multiple repeats on single node in parallel

        * bug fix

        * fix bug in indexing when subset_samples() removed something

        * sleep between jobs; stop if any job fails

        * format with black

        * bug fixes

        * add test for MultiphenoDataloader

        * update environments

        * uncomment rules

        * bug fixes

        * subset samples in training_dataset rule

        * example config.yaml

        * use gpu queue for compute_burdens

        * bugfix since dask reading didn't work any more

        * allow evaluation of all repeat combinations

        * allow analysis of each n_repeats and for all repeat combinations

        * option to provide burden file

        * allow seed gene alpha to be defined in config

        * change sorting order to get the best model

        * adaptations to analyze multiple repeats and use script wo seed genes

        * allow to  provide a sample file and do separate indexing for pheno and geno to ensure indices are correct

        * automatize generation of figure 3 (associations & repliation)

        * generate cv splits with related samples in the same split

        * average burdens

        * average burdens

        * cross-validation like trainign

        * add missing cv_utils

        * write average burdens or each combination to single zarr file to avoid zarr issues

        * add logging information

        * make maf column a param

        * add logging

        * pipeline replictaion and plotting

        * evaluate all repeat combis with and without seed genes

        * update lsf.yaml

        * small updates

        * per-gene pval aggregation

        * aggregate pval per gene

        * bugfix- only load burdens if not skip burdens

        * logging info

        * updates and fixes

        * load burdens only for genes analysed in current chunk to save memory

        * small changes to pipeline

        * standardizing/qt-transform of combined test set x/y arrays

        * my_quantile_transform for numpy arrays

        * bugfix

        * remove unnecessary code

        * remove unnecessary wildcards

        * make averaging part of associate.py

        * allow seed genes/baselines to be  missing (to allow assoc. testing for non-training phenotypes)

        * updates

        * gene-specific common variant covariates for conditional analysis

        * bugfix

        * post-hoc conditioning on common variants

        * restructure pipelines

        * removing redundant options

        * add cv_utils cli

        * simplify script (only evaluate one repeat combi/average burdens); aggregate baseline pvalues; make bonferroni correction default

        * removal of redundant wildcards, updates and fixes

        * bugfixes

        * baseline discoveries only required for training phenotypes

        * remove not needed code

        * update configs

        * formatting

        * manually merge changes from feature-regenie to account for gene-specific annotations

        * allow different sample orders in phenotype_df and genotypes.h5

        * change sample ids to be bytes as it is in the real data

        * update pipelines

        * update gitignore

        * pipeline updates

        * manually update github actions to be like master

        * bug fixes

        * checkout tests from master

        * make phenotype indices string as they are in real data

        * 'add gene_id' column

        * manually merge with master so tests can pass

        * bugfixes

        * use gene_id column instead of gene_ids

        * pipeline updates and fixes

        * update test config

        * adding age2 and age_sex to example data

        * update config

        * set tests folder to  main version

        * checkout preprocssing files from main

        * checkout from main

        * manually merge sample_id changes from main

        * pipeline bugfixes and renamings

        * fixup! Format Python code with psf/black pull_request

        * remove gene_ids column

        * integrating suggested PR changes

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    commit ada0aaa
    Author: Brian Clarke <9725212+bfclarke@users.noreply.github.com>
    Date:   Wed Feb 21 15:56:14 2024 +0100

        Feature regenie (#52)

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * convert burdens and phenotypes to SAIGE format

        * add function to make regenie input

        * modifications for regenie

        * bug fixes

        * update to use regenie

        * add function for mapping samples

        * implement burden export

        * add function to convert REGENIE output

        * don't show all unmapped samples if the list is long

        * don't parallelize REGENIE step 1

        * separate pipelines with and without REGENIE

        * support gene-specific annotation

        * bug fix

        * bug fix

        * bug fix

        * bug fix

        * correct regenie_step1 --lowmem-prefix

        * modify to work standalone

        * add --association-only option

        * allow gene-specific annotation

        * go back to SEAK/statsmodels

        * bug fixes

        * remove SAIGE code, fix imports and conda envs

        * make pipelines more self-contained

        * don't require burdens.zarr when --skip-burdens is passed

        * udpate utils

        ---------

        Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>

    * Revert change of micromamba

    * Ruff check

    * Squashed commit of the following:

    commit ae5c83e
    Author: Marcel Mück <mueckm1@gmail.com>
    Date:   Mon Apr 15 11:01:03 2024 +0200

        fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

        * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

        * fixup! Format Python code with psf/black pull_request

        ---------

        Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
        Co-authored-by: PMBio <PMBio@users.noreply.github.com>

    ---------

    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit ae5c83e
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Mon Apr 15 11:01:03 2024 +0200

    fixed bugs in the annotation pipeline based on issues #61, #62 and #63. (#64)

    * fixed bugs in the annotation pipeline based on issues #61, #62 and #63.

    * fixup! Format Python code with psf/black pull_request

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

commit 101feb2
Author: Marcel Mück <mueckm1@gmail.com>
Date:   Tue Apr 9 11:56:54 2024 +0200

    Annotations new features (#54)

    * added all changes from annotation-speedups branch

    * added gtf and genotype mock file for github tests

    * Delete example/annotations/preprocessing_workdir/preprocessed directory

    * Update annotation_colnames_filling_values.yaml

    * Corrected fill values for maf columns

    * Changed protein_id merging and exon distance filtering, s.t. no annotations are dropped

    * included rulegraph instead dag

    * based on  suggestions from @endast

    * added version info for rockdb.yaml file

    * updated rulegraph

    Updated Documentation

    corrected nonfunctional links

    * added support for X/Y chromosomes, removed dependency on pvcf file

    * excluded mkl version 2024.1.0  since it is crashing pytorch(pytorch/pytorch#123097)

    * changed way file stems are assumed to include 'double ending' on input files.

    * removed unused lines, removed pvcf from config file

    * changed if statement for gene_id_file

    ---------

    Co-authored-by: “Marcel-Mueck” <“mueckm1@gmail.com”>
    Co-authored-by: PMBio <PMBio@users.noreply.github.com>

* Revert "Squashed commit of the following:"

This reverts commit 4e9b47d.

---------

Co-authored-by: PMBio <PMBio@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants