Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify configuration file for running deepRVAT #99

Merged
merged 76 commits into from
Jun 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
0d84871
reorganize config file locations
meyerkm May 22, 2024
4560460
change runner config to deeprvat_config.yaml
meyerkm May 22, 2024
c9d027e
pipeline and script to create deeprvat_config.yaml
meyerkm May 22, 2024
e00ad74
update paths to new config dir for smoke tests
meyerkm May 23, 2024
6195812
update config name in deeprvat smoke test
meyerkm May 23, 2024
b399c76
update github actions config file path
meyerkm May 23, 2024
c306c90
fixup! Format Python code with psf/black pull_request
May 23, 2024
5165d76
update github actions config file path
meyerkm May 23, 2024
3c1aa74
Merge branch 'simplify-config-files' of https://github.com/PMBio/deep…
meyerkm May 23, 2024
5c774e6
Merge remote-tracking branch 'origin/main' into simplify-config-files
meyerkm May 28, 2024
eb87908
Nesting pl_trainer and early_stopping into training config key
meyerkm May 28, 2024
43fc84f
Removing training phenotypes from pretrained model config setup.
meyerkm May 28, 2024
f56b90a
Add in evaluation config section with correction_method and alpha par…
meyerkm May 28, 2024
64e79bb
fixup! Format Python code with psf/black pull_request
May 28, 2024
0dd831d
fix evaluation alpha key
meyerkm May 29, 2024
cae5521
Merge branch 'simplify-config-files' of https://github.com/PMBio/deep…
meyerkm May 29, 2024
47d6992
moving seed-gene-results correction-method to user-facing config
meyerkm May 29, 2024
d114b9a
fixup! Format Python code with psf/black pull_request
May 29, 2024
4b18e56
Adding dir paths in config yaml to deeprvat repo and pretrained models
meyerkm Jun 6, 2024
d2d87e4
bug fix - catching key errors
meyerkm Jun 6, 2024
0a117d1
integrate deeprvat_config.yaml file generation into existing snakefil…
meyerkm Jun 6, 2024
514fbf5
Making input config more descriptive
meyerkm Jun 6, 2024
d2a0e86
breakout cv options in config
meyerkm Jun 6, 2024
59de80c
incorporating regenie option into config
meyerkm Jun 7, 2024
345e4be
Merge remote-tracking branch 'origin/main' into simplify-config-files
meyerkm Jun 7, 2024
0ae1a01
typo fix
meyerkm Jun 7, 2024
d780bdf
fixup! Format Python code with psf/black pull_request
Jun 7, 2024
866501f
seed gene discovery pipeline config refactor
meyerkm Jun 7, 2024
0e54d2e
sync-up branch
meyerkm Jun 7, 2024
92e0819
fixup! Format Python code with psf/black pull_request
Jun 7, 2024
3654631
make association testing data name more clear
meyerkm Jun 10, 2024
2fd84ea
fixup! Format Python code with psf/black pull_request
Jun 10, 2024
ad91b85
remove no-longer needed config file.
meyerkm Jun 11, 2024
d5843ef
move items out of input config and into base config
meyerkm Jun 11, 2024
fbc91a1
Merge branch 'simplify-config-files' of https://github.com/PMBio/deep…
meyerkm Jun 11, 2024
dc00e8e
add in deeprvat training/association testing sample file option to co…
meyerkm Jun 11, 2024
f1e32bb
Merge remote-tracking branch 'origin/main' into simplify-config-files
meyerkm Jun 11, 2024
364d672
seed-gene-discovery subset sample file option added
meyerkm Jun 11, 2024
1296e25
fixup! Format Python code with psf/black pull_request
Jun 11, 2024
d0f3eed
restructuring of docs
bfclarke Jun 11, 2024
4ad3ed3
update docs
HolEv Jun 12, 2024
12db255
Add in default if phenotypes_for_training not specified
meyerkm Jun 13, 2024
a033c17
add in config_generate.log file to view stdout
meyerkm Jun 13, 2024
325de0c
setting y_transformation as optional config parameter
meyerkm Jun 13, 2024
81136b6
Making association testing and training data thresholds as optional c…
meyerkm Jun 13, 2024
8d3506d
fix-up docs
meyerkm Jun 13, 2024
4e410ac
Merge remote-tracking branch 'origin/main' into simplify-config-files
meyerkm Jun 13, 2024
9ddc837
fixup! Format Python code with psf/black pull_request
Jun 13, 2024
1d69661
add in pretrained-model-path config defaults. Add in MAF threshold re…
meyerkm Jun 14, 2024
fe65e0a
fixup! Format Python code with psf/black pull_request
Jun 14, 2024
fae7c7e
bug fix cv config key options
meyerkm Jun 14, 2024
df75b64
Merge branch 'simplify-config-files' of https://github.com/PMBio/deep…
meyerkm Jun 14, 2024
3080012
bug-fix cv config name
meyerkm Jun 14, 2024
6536fdd
fixup! Format Python code with psf/black pull_request
Jun 14, 2024
7d8978e
update data_key default to association_testing_data in associate.py
meyerkm Jun 14, 2024
7754a2a
fixup! Format Python code with psf/black pull_request
Jun 14, 2024
d8de606
reduce excessive looping
meyerkm Jun 17, 2024
6852610
Merge branch 'simplify-config-files' of https://github.com/PMBio/deep…
meyerkm Jun 17, 2024
1e39b2b
fixup! Format Python code with psf/black pull_request
Jun 17, 2024
d6752f0
add in missing final rule evaluate as rule all in pretrained models run
meyerkm Jun 18, 2024
59d3e4c
fix-up gh actions and pytests
meyerkm Jun 18, 2024
e49b3ec
Merge branch 'simplify-config-files' of https://github.com/PMBio/deep…
meyerkm Jun 18, 2024
647696d
add extra check to allow user to override configfile with snakemake -…
meyerkm Jun 18, 2024
8827279
bug-fix gh actions
meyerkm Jun 18, 2024
3900b2b
set default to disable gpu usage
meyerkm Jun 18, 2024
d0fe225
point to example data for gh actions
meyerkm Jun 18, 2024
557d490
Update docs to pass tests
endast Jun 18, 2024
a332331
fix example data path for gh actions
meyerkm Jun 18, 2024
31a1fee
Fix config path
endast Jun 18, 2024
1a8de28
rename config.yaml files for better organization
meyerkm Jun 19, 2024
418904a
fix pretrained model path for gh actions
meyerkm Jun 19, 2024
21846fe
unset gpu usage for gh actions
meyerkm Jun 19, 2024
cb1c96f
bug-fix gh actions
meyerkm Jun 19, 2024
f87f818
reduce training phenotypes and n-repeats for gh actions
meyerkm Jun 19, 2024
fa4f5e7
remove unnecessary todos
HolEv Jun 24, 2024
20c8b01
fixup! Format Python code with psf/black pull_request
Jun 24, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 19 additions & 12 deletions .github/workflows/pipeline-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,15 @@ jobs:
with:
pipeline_file: ./pipelines/run_training.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/

Pipeline-Tests-RunTraining:
needs: Smoke-RunTraining
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/run_training.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/
dry_run: false

# Association Testing Pretrained Pipeline
Expand All @@ -24,15 +26,15 @@ jobs:
with:
pipeline_file: ./pipelines/association_testing_pretrained.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && ln -s ../pretrained_models
prerun_cmd: cp ./tests/deeprvat/pretrained/deeprvat_config.yaml ./example/

Pipeline-Tests-Training-Association-Testing:
needs: Smoke-Association-Testing-Pretrained
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/association_testing_pretrained.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && ln -s ../pretrained_models
prerun_cmd: cp ./tests/deeprvat/pretrained/deeprvat_config.yaml ./example/
dry_run: false

# Association Testing Pretrained Regenie
Expand All @@ -41,15 +43,15 @@ jobs:
with:
pipeline_file: ./pipelines/association_testing_pretrained_regenie.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && ln -s ../pretrained_models
prerun_cmd: cp ./tests/deeprvat/regenie/pretrained/deeprvat_config.yaml ./example/

Pipeline-Tests-Association-Testing-Pretrained-Regenie:
needs: Smoke-Association-Testing-Pretrained-Regenie
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/association_testing_pretrained_regenie.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && ln -s ../pretrained_models
prerun_cmd: cp ./tests/deeprvat/regenie/pretrained/deeprvat_config.yaml ./example/
dry_run: false

# Association Testing Training
Expand All @@ -58,13 +60,15 @@ jobs:
with:
pipeline_file: ./pipelines/training_association_testing.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/

Pipeline-Tests-Association-Testing-Training:
needs: Smoke-Association-Testing-Training
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/training_association_testing.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/training_association_testing/deeprvat_config.yaml ./example/
dry_run: false

# Association Testing Training Regenie
Expand All @@ -73,13 +77,15 @@ jobs:
with:
pipeline_file: ./pipelines/training_association_testing_regenie.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/regenie/training_association_testing/deeprvat_config.yaml ./example/

Pipeline-Tests-Training-Association-Testing-Regenie:
needs: Smoke-Association-Testing-Training-Regenie
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/training_association_testing_regenie.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cp ./tests/deeprvat/regenie/training_association_testing/deeprvat_config.yaml ./example/
dry_run: false

# Seed Gene Discovery
Expand All @@ -88,16 +94,17 @@ jobs:
with:
pipeline_file: ./pipelines/seed_gene_discovery.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && cp ../deeprvat/seed_gene_discovery/config.yaml .
prerun_cmd: cp ./tests/seed_gene_discovery/sg_discovery_config.yaml ./example/

Pipeline-Tests-Seed-Gene-Discovery:
needs: Smoke-Seed-Gene-Discovery
uses: ./.github/workflows/run-pipeline.yml
with:
pipeline_file: ./pipelines/seed_gene_discovery.snakefile
environment_file: ./deeprvat_env_no_gpu.yml
prerun_cmd: cd ./example && cp ../deeprvat/seed_gene_discovery/config.yaml .
prerun_cmd: cp ./tests/seed_gene_discovery/sg_discovery_config.yaml ./example/
dry_run: false


# Preprocessing With QC
Smoke-Preprocessing-With-QC:
Expand All @@ -106,7 +113,7 @@ jobs:
pipeline_file: ./pipelines/preprocess_with_qc.snakefile
environment_file: ./deeprvat_preprocessing_env.yml
pipeline_directory: ./example/preprocess
pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
download_fasta_data: true
fasta_download_path: ./example/preprocess/workdir/reference

Expand All @@ -117,7 +124,7 @@ jobs:
pipeline_file: ./pipelines/preprocess_with_qc.snakefile
environment_file: ./deeprvat_preprocessing_env.yml
pipeline_directory: ./example/preprocess
pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
dry_run: false
download_fasta_data: true
fasta_download_path: ./example/preprocess/workdir/reference
Expand All @@ -129,7 +136,7 @@ jobs:
pipeline_file: ./pipelines/preprocess_no_qc.snakefile
environment_file: ./deeprvat_preprocessing_env.yml
pipeline_directory: ./example/preprocess
pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
download_fasta_data: true
fasta_download_path: ./example/preprocess/workdir/reference

Expand All @@ -140,7 +147,7 @@ jobs:
pipeline_file: ./pipelines/preprocess_no_qc.snakefile
environment_file: ./deeprvat_preprocessing_env.yml
pipeline_directory: ./example/preprocess
pipeline_config: ./pipelines/config/deeprvat_preprocess_config.yaml
pipeline_config: ./example/config/deeprvat_preprocess_config.yaml
dry_run: false
download_fasta_data: true
fasta_download_path: ./example/preprocess/workdir/reference
Expand All @@ -151,7 +158,7 @@ jobs:
with:
pipeline_file: ./pipelines/annotations.snakefile
environment_file: ./deeprvat_annotations.yml
pipeline_config: ./pipelines/config/deeprvat_annotation_config.yaml
pipeline_config: ./example/config/deeprvat_annotation_config.yaml
pipeline_directory: ./example/annotations
download_fasta_data: true
fasta_download_path: ./example/annotations/reference
Expand All @@ -168,7 +175,7 @@ jobs:
wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz \
-O ./example/annotations/reference/gencode.v44.annotation.gtf.gz
pipeline_directory: ./example/annotations
pipeline_config: ./pipelines/config/deeprvat_annotation_config_minimal.yaml
pipeline_config: ./example/config/deeprvat_annotation_config_minimal.yaml
dry_run: false
download_fasta_data: true
fasta_download_path: ./example/annotations/reference
7 changes: 2 additions & 5 deletions .github/workflows/run-pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ on:
jobs:
Run-Pipeline:
runs-on: ubuntu-latest
env:
CUDA_VISIBLE_DEVICES: -1
steps:
- name: Check out repository code
uses: actions/checkout@v4
Expand Down Expand Up @@ -72,11 +74,6 @@ jobs:
if: inputs.prerun_cmd
run: ${{inputs.prerun_cmd}}
shell: bash -el {0}
- name: Set to 0 GPUs in config
if: inputs.no_gpu
# There are no GPUs on the gh worker, so we can disable it in the config
run: "sed -i 's/gpus: 1/gpus: 0/' ./example/config.yaml"
shell: bash -el {0}
- name: "Running pipeline ${{ github.jobs[github.job].name }}"
run: |
python -m snakemake ${{ (inputs.dry_run && '-n') || '' }} \
Expand Down
Empty file removed deeprvat/config.py
Empty file.
12 changes: 8 additions & 4 deletions deeprvat/cv_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@
)
logger = logging.getLogger(__name__)
DATA_SLOT_DICT = {
"deeprvat": ["data", "training_data"],
"deeprvat": ["association_testing_data", "training_data"],
"seed_genes": ["data"],
}

Expand Down Expand Up @@ -75,7 +75,9 @@ def spread_config(
]
logger.info(config["baseline_results"])
logger.info(f"Writing config for module {module}")
with open(f"{out_path}/{module_folder_dict[module]}/config.yaml", "w") as f:
with open(
f"{out_path}/{module_folder_dict[module]}/deeprvat_config.yaml", "w"
) as f:
yaml.dump(config, f)


Expand Down Expand Up @@ -172,8 +174,10 @@ def combine_test_set_burdens(
x[start_idx:end_idx] = this_x
start_idx = end_idx

y_transformation = config["data"]["dataset_config"].get("y_transformation", None)
standardize_xpheno = config["data"]["dataset_config"].get(
y_transformation = config["association_testing_data"]["dataset_config"].get(
"y_transformation", None
)
standardize_xpheno = config["association_testing_data"]["dataset_config"].get(
"standardize_xpheno", True
)

Expand Down
28 changes: 15 additions & 13 deletions deeprvat/deeprvat/associate.py
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ def cli():
def make_dataset_(
config: Dict,
debug: bool = False,
data_key="data",
data_key="association_testing_data",
samples: Optional[List[int]] = None,
) -> Dataset:
"""
Expand All @@ -126,7 +126,7 @@ def make_dataset_(
:type config: Dict
:param debug: Flag for debugging, defaults to False.
:type debug: bool
:param data_key: Key for dataset configuration in the config dictionary, defaults to "data".
:param data_key: Key for dataset configuration in the config dictionary, defaults to "association_testing_data".
:type data_key: str
:param samples: List of sample indices to include in the dataset, defaults to None.
:type samples: List[int]
Expand Down Expand Up @@ -163,7 +163,7 @@ def make_dataset_(

@cli.command()
@click.option("--debug", is_flag=True)
@click.option("--data-key", type=str, default="data")
@click.option("--data-key", type=str, default="association_testing_data")
@click.argument("config-file", type=click.Path(exists=True))
@click.argument("out-file", type=click.Path())
def make_dataset(debug: bool, data_key: str, config_file: str, out_file: str):
Expand All @@ -172,7 +172,7 @@ def make_dataset(debug: bool, data_key: str, config_file: str, out_file: str):
:param debug: Flag for debugging.
:type debug: bool
:param data_key: Key for dataset configuration in the config dictionary, defaults to "data".
:param data_key: Key for dataset configuration in the config dictionary, defaults to "association_testing_data".
:type data_key: str
:param config_file: Path to the configuration file.
:type config_file: str
Expand Down Expand Up @@ -245,7 +245,7 @@ def compute_burdens_(
}
)

data_config = config["data"]
data_config = config["association_testing_data"]

ds_full = ds.dataset if isinstance(ds, Subset) else ds
collate_fn = getattr(ds_full, "collate_fn", None)
Expand Down Expand Up @@ -700,7 +700,9 @@ def reverse_models(
with open(data_config_file) as f:
data_config = yaml.safe_load(f)

annotation_file = data_config["data"]["dataset_config"]["annotation_file"]
annotation_file = data_config["association_testing_data"]["dataset_config"][
"annotation_file"
]

if torch.cuda.is_available():
logger.info("Using GPU")
Expand All @@ -712,7 +714,7 @@ def reverse_models(
# plof_df = (
# dd.read_parquet(
# annotation_file,
# columns=data_config["data"]["dataset_config"]["rare_embedding"]["config"][
# columns=data_config["association_testing_data"]["dataset_config"]["rare_embedding"]["config"][
# "annotations"
# ],
# )
Expand All @@ -722,9 +724,9 @@ def reverse_models(

plof_df = pd.read_parquet(
annotation_file,
columns=data_config["data"]["dataset_config"]["rare_embedding"]["config"][
"annotations"
],
columns=data_config["association_testing_data"]["dataset_config"][
"rare_embedding"
]["config"]["annotations"],
)
plof_df = plof_df[plof_df[PLOF_COLS].eq(1).any(axis=1)]

Expand Down Expand Up @@ -956,7 +958,7 @@ def regress_on_gene_scoretest(
:rtype: Tuple[List[str], List[float], List[float]]
"""
burdens = burdens.reshape(burdens.shape[0], -1)
assert np.all(burdens != 0) # TODO check this!
assert np.all(burdens != 0) # because DeepRVAT burdens are corrently all non-zero
logger.info(f"Burdens shape: {burdens.shape}")

if np.all(np.abs(burdens) < 1e-6):
Expand Down Expand Up @@ -1120,7 +1122,7 @@ def regress_(

genes_betas_pvals = [x for x in genes_betas_pvals if x is not None]
regressed_genes, betas, pvals = separate_parallel_results(genes_betas_pvals)
y_phenotypes = config["data"]["dataset_config"]["y_phenotypes"]
y_phenotypes = config["association_testing_data"]["dataset_config"]["y_phenotypes"]
regressed_phenotypes = [y_phenotypes] * len(regressed_genes)
result = pd.DataFrame(
{
Expand Down Expand Up @@ -1579,7 +1581,7 @@ def regress_common_(
genes_betas_pvals.append(gene_stats)
genes_betas_pvals = [x for x in genes_betas_pvals if x is not None]
regressed_genes, betas, pvals = separate_parallel_results(genes_betas_pvals)
y_phenotypes = config["data"]["dataset_config"]["y_phenotypes"]
y_phenotypes = config["association_testing_data"]["dataset_config"]["y_phenotypes"]
regressed_phenotypes = [y_phenotypes] * len(regressed_genes)
result = pd.DataFrame(
{
Expand Down
Loading