Simplify configuration file for running deepRVAT (#99)

* reorganize config file locations * change runner config to deeprvat_config.yaml * pipeline and script to create deeprvat_config.yaml * update paths to new config dir for smoke tests * update config name in deeprvat smoke test * update github actions config file path * fixup! Format Python code with psf/black pull_request * update github actions config file path * Nesting pl_trainer and early_stopping into training config key * Removing training phenotypes from pretrained model config setup. Simplifying association testing phenotypes . * Add in evaluation config section with correction_method and alpha parameters * fixup! Format Python code with psf/black pull_request * fix evaluation alpha key * moving seed-gene-results correction-method to user-facing config * fixup! Format Python code with psf/black pull_request * Adding dir paths in config yaml to deeprvat repo and pretrained models * bug fix - catching key errors * integrate deeprvat_config.yaml file generation into existing snakefile pipelines -fixed configfile path in annotation pipeline * Making input config more descriptive * breakout cv options in config * incorporating regenie option into config * typo fix * fixup! Format Python code with psf/black pull_request * seed gene discovery pipeline config refactor * fixup! Format Python code with psf/black pull_request * make association testing data name more clear * fixup! Format Python code with psf/black pull_request * remove no-longer needed config file. See deeprvat/example/config/ for new input config files * move items out of input config and into base config * add in deeprvat training/association testing sample file option to config * seed-gene-discovery subset sample file option added * fixup! Format Python code with psf/black pull_request * restructuring of docs * update docs * Add in default if phenotypes_for_training not specified * add in config_generate.log file to view stdout * setting y_transformation as optional config parameter * Making association testing and training data thresholds as optional configurations * fix-up docs * fixup! Format Python code with psf/black pull_request * add in pretrained-model-path config defaults. Add in MAF threshold requirement. * fixup! Format Python code with psf/black pull_request * bug fix cv config key options * bug-fix cv config name * fixup! Format Python code with psf/black pull_request * update data_key default to association_testing_data in associate.py * fixup! Format Python code with psf/black pull_request * reduce excessive looping * fixup! Format Python code with psf/black pull_request * add in missing final rule evaluate as rule all in pretrained models run * fix-up gh actions and pytests * add extra check to allow user to override configfile with snakemake --configfile argument * bug-fix gh actions * set default to disable gpu usage * point to example data for gh actions * Update docs to pass tests * fix example data path for gh actions * Fix config path * rename config.yaml files for better organization * fix pretrained model path for gh actions * unset gpu usage for gh actions * bug-fix gh actions * reduce training phenotypes and n-repeats for gh actions * remove unnecessary todos * fixup! Format Python code with psf/black pull_request --------- Co-authored-by: PMBio <PMBio@users.noreply.github.com> Co-authored-by: Brian Clarke <brian.clarke@dkfz.de> Co-authored-by: Eva Holtkamp <eva.holtkamp@gmx.de> Co-authored-by: Magnus Wahlberg <endast@gmail.com>
PMBio · Jun 25, 2024 · 46bf983 · 46bf983
1 parent 22715bc
commit 46bf983
Show file tree

Hide file tree

Showing 63 changed files with 3,306 additions and 1,073 deletions.
diff --git a/.github/workflows/pipeline-tests.yml b/.github/workflows/pipeline-tests.yml
@@ -9,13 +9,15 @@ jobs:
     with:
       pipeline_file: ./pipelines/run_training.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
+      prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/
 
   Pipeline-Tests-RunTraining:
     needs: Smoke-RunTraining
     uses: ./.github/workflows/run-pipeline.yml
     with:
       pipeline_file: ./pipelines/run_training.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
+      prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/
       dry_run: false
 
   # Association Testing Pretrained Pipeline
@@ -24,15 +26,15 @@ jobs:
     with:
       pipeline_file: ./pipelines/association_testing_pretrained.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
-      prerun_cmd: cd ./example && ln -s ../pretrained_models
+      prerun_cmd: cp ./tests/deeprvat/pretrained/deeprvat_config.yaml ./example/
 
   Pipeline-Tests-Training-Association-Testing:
     needs: Smoke-Association-Testing-Pretrained
     uses: ./.github/workflows/run-pipeline.yml
     with:
       pipeline_file: ./pipelines/association_testing_pretrained.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
-      prerun_cmd: cd ./example && ln -s ../pretrained_models
+      prerun_cmd: cp ./tests/deeprvat/pretrained/deeprvat_config.yaml ./example/
       dry_run: false
 
   # Association Testing Pretrained Regenie
@@ -41,15 +43,15 @@ jobs:
     with:
       pipeline_file: ./pipelines/association_testing_pretrained_regenie.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
-      prerun_cmd: cd ./example && ln -s ../pretrained_models
+      prerun_cmd: cp ./tests/deeprvat/regenie/pretrained/deeprvat_config.yaml ./example/
 
   Pipeline-Tests-Association-Testing-Pretrained-Regenie:
     needs: Smoke-Association-Testing-Pretrained-Regenie
     uses: ./.github/workflows/run-pipeline.yml
     with:
       pipeline_file: ./pipelines/association_testing_pretrained_regenie.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
-      prerun_cmd: cd ./example && ln -s ../pretrained_models
+      prerun_cmd: cp ./tests/deeprvat/regenie/pretrained/deeprvat_config.yaml ./example/
       dry_run: false
 
   # Association Testing Training
@@ -58,13 +60,15 @@ jobs:
     with:
       pipeline_file: ./pipelines/training_association_testing.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
+      prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/
 
   Pipeline-Tests-Association-Testing-Training:
     needs: Smoke-Association-Testing-Training
     uses: ./.github/workflows/run-pipeline.yml
     with:
       pipeline_file: ./pipelines/training_association_testing.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
+      prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/
       dry_run: false
 
   # Association Testing Training Regenie
@@ -73,13 +77,15 @@ jobs:
     with:
       pipeline_file: ./pipelines/training_association_testing_regenie.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
+      prerun_cmd: cp ./tests/deeprvat/regenie/training_association_testing/deeprvat_config.yaml ./example/
 
   Pipeline-Tests-Training-Association-Testing-Regenie:
     needs: Smoke-Association-Testing-Training-Regenie
     uses: ./.github/workflows/run-pipeline.yml
     with:
       pipeline_file: ./pipelines/training_association_testing_regenie.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
+      prerun_cmd: cp ./tests/deeprvat/regenie/training_association_testing/deeprvat_config.yaml ./example/
       dry_run: false
 
   # Seed Gene Discovery
@@ -88,16 +94,17 @@ jobs:
     with:
       pipeline_file: ./pipelines/seed_gene_discovery.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
-      prerun_cmd: cd ./example && cp ../deeprvat/seed_gene_discovery/config.yaml .
+      prerun_cmd: cp ./tests/seed_gene_discovery/sg_discovery_config.yaml ./example/
 
   Pipeline-Tests-Seed-Gene-Discovery:
     needs: Smoke-Seed-Gene-Discovery
     uses: ./.github/workflows/run-pipeline.yml
     with:
       pipeline_file: ./pipelines/seed_gene_discovery.snakefile
       environment_file: ./deeprvat_env_no_gpu.yml
-      prerun_cmd: cd ./example && cp ../deeprvat/seed_gene_discovery/config.yaml .
+      prerun_cmd: cp ./tests/seed_gene_discovery/sg_discovery_config.yaml ./example/
       dry_run: false
+
 
   # Preprocessing With QC
   Smoke-Preprocessing-With-QC:
@@ -106,7 +113,7 @@ jobs:
       pipeline_file: ./pipelines/preprocess_with_qc.snakefile
       environment_file: ./deeprvat_preprocessing_env.yml
       pipeline_directory: ./example/preprocess
-      pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
+      pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
       download_fasta_data: true
       fasta_download_path: ./example/preprocess/workdir/reference
 
@@ -117,7 +124,7 @@ jobs:
       pipeline_file: ./pipelines/preprocess_with_qc.snakefile
       environment_file: ./deeprvat_preprocessing_env.yml
       pipeline_directory: ./example/preprocess
-      pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
+      pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
       dry_run: false
       download_fasta_data: true
       fasta_download_path: ./example/preprocess/workdir/reference
@@ -129,7 +136,7 @@ jobs:
       pipeline_file: ./pipelines/preprocess_no_qc.snakefile
       environment_file: ./deeprvat_preprocessing_env.yml
       pipeline_directory: ./example/preprocess
-      pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
+      pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
       download_fasta_data: true
       fasta_download_path: ./example/preprocess/workdir/reference
 
@@ -140,7 +147,7 @@ jobs:
       pipeline_file: ./pipelines/preprocess_no_qc.snakefile
       environment_file: ./deeprvat_preprocessing_env.yml
       pipeline_directory: ./example/preprocess
-      pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
+      pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
       dry_run: false
       download_fasta_data: true
       fasta_download_path: ./example/preprocess/workdir/reference
@@ -151,7 +158,7 @@ jobs:
     with:
       pipeline_file: ./pipelines/annotations.snakefile
       environment_file: ./deeprvat_annotations.yml
-      pipeline_config: ./pipelines/config/deeprvat_annotation_config.yaml
+      pipeline_config: ./example/config/deeprvat_annotation_config.yaml
       pipeline_directory: ./example/annotations
       download_fasta_data: true
       fasta_download_path: ./example/annotations/reference
@@ -168,7 +175,7 @@ jobs:
         wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz \
         -O ./example/annotations/reference/gencode.v44.annotation.gtf.gz
       pipeline_directory: ./example/annotations
-      pipeline_config: ./pipelines/config/deeprvat_annotation_config_minimal.yaml
+      pipeline_config: ./example/config/deeprvat_annotation_config_minimal.yaml
       dry_run: false
       download_fasta_data: true
       fasta_download_path: ./example/annotations/reference
diff --git a/.github/workflows/run-pipeline.yml b/.github/workflows/run-pipeline.yml
@@ -44,6 +44,8 @@ on:
 jobs:
   Run-Pipeline:
       runs-on: ubuntu-latest
+      env:
+        CUDA_VISIBLE_DEVICES: -1
       steps:
         - name: Check out repository code
           uses: actions/checkout@v4
@@ -72,11 +74,6 @@ jobs:
           if: inputs.prerun_cmd
           run: ${{inputs.prerun_cmd}}
           shell: bash -el {0}
-        - name: Set to 0 GPUs in config
-          if: inputs.no_gpu
-          # There are no GPUs on the gh worker, so we can disable it in the config
-          run: "sed -i 's/gpus: 1/gpus: 0/' ./example/config.yaml"
-          shell: bash -el {0}
         - name: "Running pipeline ${{ github.jobs[github.job].name }}"
           run: |
             python -m snakemake ${{ (inputs.dry_run && '-n') || '' }} \

diff --git a/deeprvat/config.py b/deeprvat/config.py
diff --git a/deeprvat/cv_utils.py b/deeprvat/cv_utils.py
@@ -25,7 +25,7 @@
 )
 logger = logging.getLogger(__name__)
 DATA_SLOT_DICT = {
-    "deeprvat": ["data", "training_data"],
+    "deeprvat": ["association_testing_data", "training_data"],
     "seed_genes": ["data"],
 }
 
@@ -75,7 +75,9 @@ def spread_config(
                 ]
             logger.info(config["baseline_results"])
         logger.info(f"Writing config for module {module}")
-        with open(f"{out_path}/{module_folder_dict[module]}/config.yaml", "w") as f:
+        with open(
+            f"{out_path}/{module_folder_dict[module]}/deeprvat_config.yaml", "w"
+        ) as f:
             yaml.dump(config, f)
 
 
@@ -172,8 +174,10 @@ def combine_test_set_burdens(
         x[start_idx:end_idx] = this_x
         start_idx = end_idx
 
-    y_transformation = config["data"]["dataset_config"].get("y_transformation", None)
-    standardize_xpheno = config["data"]["dataset_config"].get(
+    y_transformation = config["association_testing_data"]["dataset_config"].get(
+        "y_transformation", None
+    )
+    standardize_xpheno = config["association_testing_data"]["dataset_config"].get(
         "standardize_xpheno", True
     )
 

diff --git a/deeprvat/deeprvat/associate.py b/deeprvat/deeprvat/associate.py
@@ -116,7 +116,7 @@ def cli():
 def make_dataset_(
     config: Dict,
     debug: bool = False,
-    data_key="data",
+    data_key="association_testing_data",
     samples: Optional[List[int]] = None,
 ) -> Dataset:
     """
@@ -126,7 +126,7 @@ def make_dataset_(
     :type config: Dict
     :param debug: Flag for debugging, defaults to False.
     :type debug: bool
-    :param data_key: Key for dataset configuration in the config dictionary, defaults to "data".
+    :param data_key: Key for dataset configuration in the config dictionary, defaults to "association_testing_data".
     :type data_key: str
     :param samples: List of sample indices to include in the dataset, defaults to None.
     :type samples: List[int]
@@ -163,7 +163,7 @@ def make_dataset_(
 
 @cli.command()
 @click.option("--debug", is_flag=True)
-@click.option("--data-key", type=str, default="data")
+@click.option("--data-key", type=str, default="association_testing_data")
 @click.argument("config-file", type=click.Path(exists=True))
 @click.argument("out-file", type=click.Path())
 def make_dataset(debug: bool, data_key: str, config_file: str, out_file: str):
@@ -172,7 +172,7 @@ def make_dataset(debug: bool, data_key: str, config_file: str, out_file: str):
 
     :param debug: Flag for debugging.
     :type debug: bool
-    :param data_key: Key for dataset configuration in the config dictionary, defaults to "data".
+    :param data_key: Key for dataset configuration in the config dictionary, defaults to "association_testing_data".
     :type data_key: str
     :param config_file: Path to the configuration file.
     :type config_file: str
@@ -245,7 +245,7 @@ def compute_burdens_(
             }
         )
 
-    data_config = config["data"]
+    data_config = config["association_testing_data"]
 
     ds_full = ds.dataset if isinstance(ds, Subset) else ds
     collate_fn = getattr(ds_full, "collate_fn", None)
@@ -700,7 +700,9 @@ def reverse_models(
     with open(data_config_file) as f:
         data_config = yaml.safe_load(f)
 
-    annotation_file = data_config["data"]["dataset_config"]["annotation_file"]
+    annotation_file = data_config["association_testing_data"]["dataset_config"][
+        "annotation_file"
+    ]
 
     if torch.cuda.is_available():
         logger.info("Using GPU")
@@ -712,7 +714,7 @@ def reverse_models(
     # plof_df = (
     #     dd.read_parquet(
     #         annotation_file,
-    #         columns=data_config["data"]["dataset_config"]["rare_embedding"]["config"][
+    #         columns=data_config["association_testing_data"]["dataset_config"]["rare_embedding"]["config"][
     #             "annotations"
     #         ],
     #     )
@@ -722,9 +724,9 @@ def reverse_models(
 
     plof_df = pd.read_parquet(
         annotation_file,
-        columns=data_config["data"]["dataset_config"]["rare_embedding"]["config"][
-            "annotations"
-        ],
+        columns=data_config["association_testing_data"]["dataset_config"][
+            "rare_embedding"
+        ]["config"]["annotations"],
     )
     plof_df = plof_df[plof_df[PLOF_COLS].eq(1).any(axis=1)]
 
@@ -956,7 +958,7 @@ def regress_on_gene_scoretest(
     :rtype: Tuple[List[str], List[float], List[float]]
     """
     burdens = burdens.reshape(burdens.shape[0], -1)
-    assert np.all(burdens != 0)  # TODO check this!
+    assert np.all(burdens != 0)  # because DeepRVAT burdens are corrently all non-zero
     logger.info(f"Burdens shape: {burdens.shape}")
 
     if np.all(np.abs(burdens) < 1e-6):
@@ -1120,7 +1122,7 @@ def regress_(
 
     genes_betas_pvals = [x for x in genes_betas_pvals if x is not None]
     regressed_genes, betas, pvals = separate_parallel_results(genes_betas_pvals)
-    y_phenotypes = config["data"]["dataset_config"]["y_phenotypes"]
+    y_phenotypes = config["association_testing_data"]["dataset_config"]["y_phenotypes"]
     regressed_phenotypes = [y_phenotypes] * len(regressed_genes)
     result = pd.DataFrame(
         {
@@ -1579,7 +1581,7 @@ def regress_common_(
         genes_betas_pvals.append(gene_stats)
     genes_betas_pvals = [x for x in genes_betas_pvals if x is not None]
     regressed_genes, betas, pvals = separate_parallel_results(genes_betas_pvals)
-    y_phenotypes = config["data"]["dataset_config"]["y_phenotypes"]
+    y_phenotypes = config["association_testing_data"]["dataset_config"]["y_phenotypes"]
     regressed_phenotypes = [y_phenotypes] * len(regressed_genes)
     result = pd.DataFrame(
         {