Skip to content

Commit

Permalink
Simplify configuration file for running deepRVAT (#99)
Browse files Browse the repository at this point in the history
* reorganize config file locations

* change runner config to deeprvat_config.yaml

* pipeline and script to create deeprvat_config.yaml

* update paths to new config dir for smoke tests

* update config name in deeprvat smoke test

* update github actions config file path

* fixup! Format Python code with psf/black pull_request

* update github actions config file path

* Nesting pl_trainer and early_stopping into training config key

* Removing training phenotypes from pretrained model config setup.
Simplifying association testing phenotypes .

* Add in evaluation config section with correction_method and alpha parameters

* fixup! Format Python code with psf/black pull_request

* fix evaluation alpha key

* moving seed-gene-results correction-method to user-facing config

* fixup! Format Python code with psf/black pull_request

* Adding dir paths in config yaml to deeprvat repo and pretrained models

* bug fix - catching key errors

* integrate deeprvat_config.yaml file generation into existing snakefile pipelines
-fixed configfile path in annotation pipeline

* Making input config more descriptive

* breakout cv options in config

* incorporating regenie option into config

* typo fix

* fixup! Format Python code with psf/black pull_request

* seed gene discovery pipeline config refactor

* fixup! Format Python code with psf/black pull_request

* make association testing data name more clear

* fixup! Format Python code with psf/black pull_request

* remove no-longer needed config file.
See deeprvat/example/config/ for new input config files

* move items out of input config and into base config

* add in deeprvat training/association testing sample file option to config

* seed-gene-discovery subset sample file option added

* fixup! Format Python code with psf/black pull_request

* restructuring of docs

* update docs

* Add in default if phenotypes_for_training not specified

* add in config_generate.log file to view stdout

* setting y_transformation as optional config parameter

* Making association testing and training data thresholds as optional configurations

* fix-up docs

* fixup! Format Python code with psf/black pull_request

* add in pretrained-model-path config defaults. Add in MAF threshold requirement.

* fixup! Format Python code with psf/black pull_request

* bug fix cv config key options

* bug-fix cv config name

* fixup! Format Python code with psf/black pull_request

* update data_key default to association_testing_data in associate.py

* fixup! Format Python code with psf/black pull_request

* reduce excessive looping

* fixup! Format Python code with psf/black pull_request

* add in missing final rule evaluate as rule all in pretrained models run

* fix-up gh actions and pytests

* add extra check to allow user to override configfile with snakemake --configfile argument

* bug-fix gh actions

* set default to disable gpu usage

* point to example data for gh actions

* Update docs to pass tests

* fix example data path for gh actions

* Fix config path

* rename config.yaml files for better organization

* fix pretrained model path for gh actions

* unset gpu usage for gh actions

* bug-fix gh actions

* reduce training phenotypes and n-repeats for gh actions

* remove unnecessary todos

* fixup! Format Python code with psf/black pull_request

---------

Co-authored-by: PMBio <PMBio@users.noreply.github.com>
Co-authored-by: Brian Clarke <brian.clarke@dkfz.de>
Co-authored-by: Eva Holtkamp <eva.holtkamp@gmx.de>
Co-authored-by: Magnus Wahlberg <endast@gmail.com>
  • Loading branch information
5 people authored Jun 25, 2024
1 parent 22715bc commit 46bf983
Show file tree
Hide file tree
Showing 63 changed files with 3,306 additions and 1,073 deletions.
31 changes: 19 additions & 12 deletions .github/workflows/pipeline-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,15 @@ jobs:
with:
pipeline_file: ./pipelines/run_training.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/

Pipeline-Tests-RunTraining:
needs: Smoke-RunTraining
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/run_training.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/
dry_run: false

# Association Testing Pretrained Pipeline
Expand All @@ -24,15 +26,15 @@ jobs:
with:
pipeline_file: ./pipelines/association_testing_pretrained.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && ln -s ../pretrained_models
prerun_cmd: cp ./tests/deeprvat/pretrained/deeprvat_config.yaml ./example/

Pipeline-Tests-Training-Association-Testing:
needs: Smoke-Association-Testing-Pretrained
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/association_testing_pretrained.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && ln -s ../pretrained_models
prerun_cmd: cp ./tests/deeprvat/pretrained/deeprvat_config.yaml ./example/
dry_run: false

# Association Testing Pretrained Regenie
Expand All @@ -41,15 +43,15 @@ jobs:
with:
pipeline_file: ./pipelines/association_testing_pretrained_regenie.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && ln -s ../pretrained_models
prerun_cmd: cp ./tests/deeprvat/regenie/pretrained/deeprvat_config.yaml ./example/

Pipeline-Tests-Association-Testing-Pretrained-Regenie:
needs: Smoke-Association-Testing-Pretrained-Regenie
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/association_testing_pretrained_regenie.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && ln -s ../pretrained_models
prerun_cmd: cp ./tests/deeprvat/regenie/pretrained/deeprvat_config.yaml ./example/
dry_run: false

# Association Testing Training
Expand All @@ -58,13 +60,15 @@ jobs:
with:
pipeline_file: ./pipelines/training_association_testing.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/

Pipeline-Tests-Association-Testing-Training:
needs: Smoke-Association-Testing-Training
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/training_association_testing.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/
dry_run: false

# Association Testing Training Regenie
Expand All @@ -73,13 +77,15 @@ jobs:
with:
pipeline_file: ./pipelines/training_association_testing_regenie.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/regenie/training_association_testing/deeprvat_config.yaml ./example/

Pipeline-Tests-Training-Association-Testing-Regenie:
needs: Smoke-Association-Testing-Training-Regenie
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/training_association_testing_regenie.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/regenie/training_association_testing/deeprvat_config.yaml ./example/
dry_run: false

# Seed Gene Discovery
Expand All @@ -88,16 +94,17 @@ jobs:
with:
pipeline_file: ./pipelines/seed_gene_discovery.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && cp ../deeprvat/seed_gene_discovery/config.yaml .
prerun_cmd: cp ./tests/seed_gene_discovery/sg_discovery_config.yaml ./example/

Pipeline-Tests-Seed-Gene-Discovery:
needs: Smoke-Seed-Gene-Discovery
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/seed_gene_discovery.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && cp ../deeprvat/seed_gene_discovery/config.yaml .
prerun_cmd: cp ./tests/seed_gene_discovery/sg_discovery_config.yaml ./example/
dry_run: false


# Preprocessing With QC
Smoke-Preprocessing-With-QC:
Expand All @@ -106,7 +113,7 @@ jobs:
pipeline_file: ./pipelines/preprocess_with_qc.snakefile
environment_file: ./deeprvat_preprocessing_env.yml
pipeline_directory: ./example/preprocess
pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
download_fasta_data: true
fasta_download_path: ./example/preprocess/workdir/reference

Expand All @@ -117,7 +124,7 @@ jobs:
pipeline_file: ./pipelines/preprocess_with_qc.snakefile
environment_file: ./deeprvat_preprocessing_env.yml
pipeline_directory: ./example/preprocess
pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
dry_run: false
download_fasta_data: true
fasta_download_path: ./example/preprocess/workdir/reference
Expand All @@ -129,7 +136,7 @@ jobs:
pipeline_file: ./pipelines/preprocess_no_qc.snakefile
environment_file: ./deeprvat_preprocessing_env.yml
pipeline_directory: ./example/preprocess
pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
download_fasta_data: true
fasta_download_path: ./example/preprocess/workdir/reference

Expand All @@ -140,7 +147,7 @@ jobs:
pipeline_file: ./pipelines/preprocess_no_qc.snakefile
environment_file: ./deeprvat_preprocessing_env.yml
pipeline_directory: ./example/preprocess
pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
dry_run: false
download_fasta_data: true
fasta_download_path: ./example/preprocess/workdir/reference
Expand All @@ -151,7 +158,7 @@ jobs:
with:
pipeline_file: ./pipelines/annotations.snakefile
environment_file: ./deeprvat_annotations.yml
pipeline_config: ./pipelines/config/deeprvat_annotation_config.yaml
pipeline_config: ./example/config/deeprvat_annotation_config.yaml
pipeline_directory: ./example/annotations
download_fasta_data: true
fasta_download_path: ./example/annotations/reference
Expand All @@ -168,7 +175,7 @@ jobs:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz \
-O ./example/annotations/reference/gencode.v44.annotation.gtf.gz
pipeline_directory: ./example/annotations
pipeline_config: ./pipelines/config/deeprvat_annotation_config_minimal.yaml
pipeline_config: ./example/config/deeprvat_annotation_config_minimal.yaml
dry_run: false
download_fasta_data: true
fasta_download_path: ./example/annotations/reference
7 changes: 2 additions & 5 deletions .github/workflows/run-pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ on:
jobs:
Run-Pipeline:
runs-on: ubuntu-latest
env:
CUDA_VISIBLE_DEVICES: -1
steps:
- name: Check out repository code
uses: actions/checkout@v4
Expand Down Expand Up @@ -72,11 +74,6 @@ jobs:
if: inputs.prerun_cmd
run: ${{inputs.prerun_cmd}}
shell: bash -el {0}
- name: Set to 0 GPUs in config
if: inputs.no_gpu
# There are no GPUs on the gh worker, so we can disable it in the config
run: "sed -i 's/gpus: 1/gpus: 0/' ./example/config.yaml"
shell: bash -el {0}
- name: "Running pipeline ${{ github.jobs[github.job].name }}"
run: |
python -m snakemake ${{ (inputs.dry_run && '-n') || '' }} \
Expand Down
Empty file removed deeprvat/config.py
Empty file.
12 changes: 8 additions & 4 deletions deeprvat/cv_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
)
logger = logging.getLogger(__name__)
DATA_SLOT_DICT = {
"deeprvat": ["data", "training_data"],
"deeprvat": ["association_testing_data", "training_data"],
"seed_genes": ["data"],
}

Expand Down Expand Up @@ -75,7 +75,9 @@ def spread_config(
]
logger.info(config["baseline_results"])
logger.info(f"Writing config for module {module}")
with open(f"{out_path}/{module_folder_dict[module]}/config.yaml", "w") as f:
with open(
f"{out_path}/{module_folder_dict[module]}/deeprvat_config.yaml", "w"
) as f:
yaml.dump(config, f)


Expand Down Expand Up @@ -172,8 +174,10 @@ def combine_test_set_burdens(
x[start_idx:end_idx] = this_x
start_idx = end_idx

y_transformation = config["data"]["dataset_config"].get("y_transformation", None)
standardize_xpheno = config["data"]["dataset_config"].get(
y_transformation = config["association_testing_data"]["dataset_config"].get(
"y_transformation", None
)
standardize_xpheno = config["association_testing_data"]["dataset_config"].get(
"standardize_xpheno", True
)

Expand Down
28 changes: 15 additions & 13 deletions deeprvat/deeprvat/associate.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ def cli():
def make_dataset_(
config: Dict,
debug: bool = False,
data_key="data",
data_key="association_testing_data",
samples: Optional[List[int]] = None,
) -> Dataset:
"""
Expand All @@ -126,7 +126,7 @@ def make_dataset_(
:type config: Dict
:param debug: Flag for debugging, defaults to False.
:type debug: bool
:param data_key: Key for dataset configuration in the config dictionary, defaults to "data".
:param data_key: Key for dataset configuration in the config dictionary, defaults to "association_testing_data".
:type data_key: str
:param samples: List of sample indices to include in the dataset, defaults to None.
:type samples: List[int]
Expand Down Expand Up @@ -163,7 +163,7 @@ def make_dataset_(

@cli.command()
@click.option("--debug", is_flag=True)
@click.option("--data-key", type=str, default="data")
@click.option("--data-key", type=str, default="association_testing_data")
@click.argument("config-file", type=click.Path(exists=True))
@click.argument("out-file", type=click.Path())
def make_dataset(debug: bool, data_key: str, config_file: str, out_file: str):
Expand All @@ -172,7 +172,7 @@ def make_dataset(debug: bool, data_key: str, config_file: str, out_file: str):
:param debug: Flag for debugging.
:type debug: bool
:param data_key: Key for dataset configuration in the config dictionary, defaults to "data".
:param data_key: Key for dataset configuration in the config dictionary, defaults to "association_testing_data".
:type data_key: str
:param config_file: Path to the configuration file.
:type config_file: str
Expand Down Expand Up @@ -245,7 +245,7 @@ def compute_burdens_(
}
)

data_config = config["data"]
data_config = config["association_testing_data"]

ds_full = ds.dataset if isinstance(ds, Subset) else ds
collate_fn = getattr(ds_full, "collate_fn", None)
Expand Down Expand Up @@ -700,7 +700,9 @@ def reverse_models(
with open(data_config_file) as f:
data_config = yaml.safe_load(f)

annotation_file = data_config["data"]["dataset_config"]["annotation_file"]
annotation_file = data_config["association_testing_data"]["dataset_config"][
"annotation_file"
]

if torch.cuda.is_available():
logger.info("Using GPU")
Expand All @@ -712,7 +714,7 @@ def reverse_models(
# plof_df = (
# dd.read_parquet(
# annotation_file,
# columns=data_config["data"]["dataset_config"]["rare_embedding"]["config"][
# columns=data_config["association_testing_data"]["dataset_config"]["rare_embedding"]["config"][
# "annotations"
# ],
# )
Expand All @@ -722,9 +724,9 @@ def reverse_models(

plof_df = pd.read_parquet(
annotation_file,
columns=data_config["data"]["dataset_config"]["rare_embedding"]["config"][
"annotations"
],
columns=data_config["association_testing_data"]["dataset_config"][
"rare_embedding"
]["config"]["annotations"],
)
plof_df = plof_df[plof_df[PLOF_COLS].eq(1).any(axis=1)]

Expand Down Expand Up @@ -956,7 +958,7 @@ def regress_on_gene_scoretest(
:rtype: Tuple[List[str], List[float], List[float]]
"""
burdens = burdens.reshape(burdens.shape[0], -1)
assert np.all(burdens != 0) # TODO check this!
assert np.all(burdens != 0) # because DeepRVAT burdens are corrently all non-zero
logger.info(f"Burdens shape: {burdens.shape}")

if np.all(np.abs(burdens) < 1e-6):
Expand Down Expand Up @@ -1120,7 +1122,7 @@ def regress_(

genes_betas_pvals = [x for x in genes_betas_pvals if x is not None]
regressed_genes, betas, pvals = separate_parallel_results(genes_betas_pvals)
y_phenotypes = config["data"]["dataset_config"]["y_phenotypes"]
y_phenotypes = config["association_testing_data"]["dataset_config"]["y_phenotypes"]
regressed_phenotypes = [y_phenotypes] * len(regressed_genes)
result = pd.DataFrame(
{
Expand Down Expand Up @@ -1579,7 +1581,7 @@ def regress_common_(
genes_betas_pvals.append(gene_stats)
genes_betas_pvals = [x for x in genes_betas_pvals if x is not None]
regressed_genes, betas, pvals = separate_parallel_results(genes_betas_pvals)
y_phenotypes = config["data"]["dataset_config"]["y_phenotypes"]
y_phenotypes = config["association_testing_data"]["dataset_config"]["y_phenotypes"]
regressed_phenotypes = [y_phenotypes] * len(regressed_genes)
result = pd.DataFrame(
{
Expand Down
Loading

0 comments on commit 46bf983

Please sign in to comment.