Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature cv training #55

Merged
merged 93 commits into from
Feb 23, 2024
Merged
Show file tree
Hide file tree
Changes from 91 commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
6761a90
performance optimizations
bfclarke Nov 20, 2023
684eb23
train multiple repeats on single node in parallel
bfclarke Nov 22, 2023
4ad34d1
bug fix
bfclarke Nov 23, 2023
fffe0b4
fix bug in indexing when subset_samples() removed something
bfclarke Nov 24, 2023
8fd60b7
sleep between jobs; stop if any job fails
bfclarke Nov 24, 2023
d121eff
format with black
bfclarke Nov 27, 2023
93c8227
bug fixes
bfclarke Nov 28, 2023
a5d9a1f
add test for MultiphenoDataloader
bfclarke Nov 28, 2023
5bdeea5
update environments
bfclarke Nov 28, 2023
fe1680d
uncomment rules
bfclarke Nov 28, 2023
58db4c4
bug fixes
bfclarke Nov 28, 2023
72c8fc7
subset samples in training_dataset rule
bfclarke Nov 30, 2023
5dd6f5e
example config.yaml
bfclarke Nov 30, 2023
b985e77
use gpu queue for compute_burdens
bfclarke Nov 30, 2023
47e306d
bugfix since dask reading didn't work any more
HolEv Dec 14, 2023
3f27d09
allow evaluation of all repeat combinations
HolEv Dec 16, 2023
d0a0d7b
allow analysis of each n_repeats and for all repeat combinations
HolEv Dec 17, 2023
1e9587a
option to provide burden file
HolEv Dec 20, 2023
37efd46
allow seed gene alpha to be defined in config
HolEv Dec 20, 2023
e488f4a
change sorting order to get the best model
HolEv Dec 20, 2023
0d4833f
adaptations to analyze multiple repeats and use script wo seed genes
HolEv Dec 20, 2023
1ba2619
allow to provide a sample file and do separate indexing for pheno an…
HolEv Dec 20, 2023
b04d308
automatize generation of figure 3 (associations & repliation)
HolEv Dec 20, 2023
07dbde4
generate cv splits with related samples in the same split
HolEv Dec 20, 2023
c4f90fe
average burdens
HolEv Jan 3, 2024
6bfc4dd
average burdens
HolEv Jan 3, 2024
98a0b8c
cross-validation like trainign
HolEv Jan 11, 2024
e10ec47
add missing cv_utils
HolEv Jan 11, 2024
b43b7e5
write average burdens or each combination to single zarr file to avoi…
HolEv Jan 11, 2024
7acc840
add logging information
HolEv Jan 11, 2024
ed0ebfd
make maf column a param
HolEv Jan 16, 2024
18a9e3c
add logging
HolEv Jan 16, 2024
08603b7
pipeline replictaion and plotting
HolEv Jan 16, 2024
4fdb112
evaluate all repeat combis with and without seed genes
HolEv Jan 16, 2024
e7eb810
update lsf.yaml
HolEv Jan 16, 2024
ebbd7d0
small updates
HolEv Jan 18, 2024
c94466e
per-gene pval aggregation
HolEv Jan 18, 2024
b489f5b
aggregate pval per gene
HolEv Jan 18, 2024
90be746
bugfix- only load burdens if not skip burdens
HolEv Jan 29, 2024
291904f
logging info
HolEv Jan 29, 2024
425036c
updates and fixes
HolEv Jan 29, 2024
f7c64a1
load burdens only for genes analysed in current chunk to save memory
HolEv Jan 29, 2024
3772651
small changes to pipeline
HolEv Feb 12, 2024
74c905b
standardizing/qt-transform of combined test set x/y arrays
HolEv Feb 12, 2024
24c3475
my_quantile_transform for numpy arrays
HolEv Feb 12, 2024
c150ba7
bugfix
HolEv Feb 12, 2024
59b5ed3
remove unnecessary code
HolEv Feb 12, 2024
296c6fe
remove unnecessary wildcards
HolEv Feb 12, 2024
7fe5dc7
make averaging part of associate.py
HolEv Feb 12, 2024
618f7fc
allow seed genes/baselines to be missing (to allow assoc. testing fo…
HolEv Feb 12, 2024
17a104b
updates
HolEv Feb 12, 2024
bfecab6
gene-specific common variant covariates for conditional analysis
HolEv Feb 12, 2024
6d8a935
bugfix
HolEv Feb 15, 2024
bc13830
post-hoc conditioning on common variants
HolEv Feb 16, 2024
d105fc3
restructure pipelines
HolEv Feb 16, 2024
d303fe7
removing redundant options
HolEv Feb 19, 2024
15efe9d
add cv_utils cli
HolEv Feb 19, 2024
64974bf
simplify script (only evaluate one repeat combi/average burdens); agg…
HolEv Feb 19, 2024
dd0852c
removal of redundant wildcards, updates and fixes
HolEv Feb 19, 2024
24e8ce2
bugfixes
HolEv Feb 20, 2024
8ca566c
baseline discoveries only required for training phenotypes
HolEv Feb 20, 2024
85147f9
remove not needed code
HolEv Feb 20, 2024
07a6c83
update configs
HolEv Feb 20, 2024
c22dcc5
formatting
HolEv Feb 20, 2024
011717a
manually merge changes from feature-regenie to account for gene-speci…
HolEv Feb 21, 2024
43bdba8
allow different sample orders in phenotype_df and genotypes.h5
HolEv Feb 21, 2024
6d76d7c
change sample ids to be bytes as it is in the real data
HolEv Feb 21, 2024
6d06cf5
update pipelines
HolEv Feb 21, 2024
17dfb1f
update gitignore
HolEv Feb 21, 2024
51caa5f
pipeline updates
HolEv Feb 21, 2024
33f35a1
manually update github actions to be like master
HolEv Feb 21, 2024
06a9274
bug fixes
HolEv Feb 21, 2024
32ca35f
checkout tests from master
HolEv Feb 21, 2024
c47c30f
make phenotype indices string as they are in real data
HolEv Feb 21, 2024
d0aaa63
'add gene_id' column
HolEv Feb 21, 2024
4c43dc6
manually merge with master so tests can pass
HolEv Feb 21, 2024
0b18840
bugfixes
HolEv Feb 21, 2024
3a9ea91
use gene_id column instead of gene_ids
HolEv Feb 21, 2024
7938ad5
pipeline updates and fixes
HolEv Feb 21, 2024
cb6a1fc
update test config
HolEv Feb 22, 2024
fd3db26
adding age2 and age_sex to example data
HolEv Feb 22, 2024
cb5ddd1
update config
HolEv Feb 22, 2024
3c62850
set tests folder to main version
HolEv Feb 22, 2024
5ac374e
checkout preprocssing files from main
HolEv Feb 22, 2024
d1c3d5f
checkout from main
HolEv Feb 22, 2024
4f1c61c
manually merge sample_id changes from main
HolEv Feb 22, 2024
6acdcf4
fixing merge conflicts with main
HolEv Feb 22, 2024
e42b478
pipeline bugfixes and renamings
HolEv Feb 22, 2024
071479b
fixup! Format Python code with psf/black pull_request
Feb 22, 2024
89bac25
remove gene_ids column
HolEv Feb 22, 2024
97f3a0b
Merge branch 'feature-cv-training' of https://github.com/PMBio/deeprv…
HolEv Feb 22, 2024
a0c1583
integrating suggested PR changes
HolEv Feb 23, 2024
f86de06
fixup! Format Python code with psf/black pull_request
Feb 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -164,4 +164,6 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
.idea/
/docs/apidocs/

pipelines/deprecated_pipelines

215 changes: 215 additions & 0 deletions deeprvat/cv_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
import pandas as pd
import yaml
import os
import sys
from typing import Optional
import re

# import pickle
import logging
import click
import copy
import zarr
import numpy as np
from numcodecs import Blosc
from pathlib import Path
from deeprvat.utils import (
standardize_series,
my_quantile_transform,
)

logging.basicConfig(
format="[%(asctime)s] %(levelname)s:%(name)s: %(message)s",
level="INFO",
stream=sys.stdout,
)
logger = logging.getLogger(__name__)
DATA_SLOT_DICT = {
"deeprvat": ["data", "training_data"],
"seed_genes": ["data"],
}

module_folder_dict = {
"seed_genes": "baseline",
"deeprvat": "deeprvat",
"alternative_burdens": "alternative_burdens",
}


@click.group()
def cli():
pass


@cli.command()
@click.option("--module", "-m", multiple=True)
@click.option("--fold", type=int)
@click.option("--fold-specific-baseline", is_flag=True)
@click.option("--n-folds", type=int, default=5)
@click.argument("input_config", type=click.Path(exists=True))
@click.argument("out_path", type=click.Path(), default="./")
def spread_config(
input_config, out_path, module, fold_specific_baseline, fold, n_folds
):
data_modules = module

with open(input_config) as f:
config_template = yaml.safe_load(f)
split = "train"
cv_path = f"{config_template['cv_path']}/{n_folds}_fold"
for module in data_modules:
config = copy.deepcopy(config_template)
data_slots = DATA_SLOT_DICT[module]
for data_slot in data_slots:
sample_file = f"{cv_path}/samples_{split}{fold}.pkl"
logger.info(f"setting sample file {sample_file}")
config[data_slot]["dataset_config"]["sample_file"] = sample_file

if (module == "deeprvat") | (module == "deeprvat_pretrained"):
logger.info("Writing baseline directories")
old_baseline = copy.deepcopy(config["baseline_results"])
if fold_specific_baseline:
config["baseline_results"] = [
{"base": f'{r["base"]}/cv_split{fold}/baseline', "type": r["type"]}
for r in old_baseline
]
logger.info(config["baseline_results"])
logger.info(f"Writing config for module {module}")
with open(f"{out_path}/{module_folder_dict[module]}/config.yaml", "w") as f:
yaml.dump(config, f)


@cli.command()
@click.option("--fold", type=int)
@click.option("--n-folds", type=int, default=5)
@click.argument("input_config", type=click.Path(exists=True))
@click.argument("out_file", type=click.Path())
def generate_test_config(input_config, out_file, fold, n_folds):
with open(input_config) as f:
config = yaml.safe_load(f)
cv_path = f"{config['cv_path']}/{n_folds}_fold"
split = "test"
sample_file = f"{cv_path}/samples_{split}{fold}.pkl"
logger.info(f"setting sample file {sample_file}")
for data_slot in DATA_SLOT_DICT["deeprvat"]:
config[data_slot]["dataset_config"]["sample_file"] = sample_file
with open(out_file, "w") as f:
yaml.dump(config, f)


@cli.command()
@click.option("--link-burdens", type=click.Path())
@click.option("--burden-dirs", "-b", multiple=True)
@click.argument("out_dir", type=click.Path(), default="./")
@click.argument("config_file", type=click.Path(exists=True))
def combine_test_set_burdens(
out_dir,
link_burdens,
burden_dirs,
config_file,
):
with open(config_file) as f:
config = yaml.safe_load(f)
compression_level = 1
skip_burdens = link_burdens is not None
n_total_samples = []
for burden_dir in burden_dirs:
print(burden_dir)
this_y = zarr.open(f"{burden_dir}/y.zarr")
this_x = zarr.open(f"{burden_dir}/x.zarr")
# this_burdens = zarr.open(f'{burden_dir}/burdens.zarr')

assert this_y.shape[0] == this_x.shape[0] # == this_burdens.shape[0]
n_total_samples.append(this_y.shape[0])

n_total_samples = np.sum(n_total_samples)
print(f"Total number of samples {n_total_samples}")
if not skip_burdens:
this_burdens = zarr.open(
f"{burden_dir}/burdens.zarr"
) # any burden tensor (here from the last file to get dims 1 -n)
burdens = zarr.open(
Path(out_dir) / "burdens.zarr",
mode="a",
shape=(n_total_samples,) + this_burdens.shape[1:],
chunks=(1000, 1000),
dtype=np.float32,
compressor=Blosc(clevel=compression_level),
)
print(f"burdens shape: {burdens.shape}")
else:
burdens = None

y = zarr.open(
Path(out_dir) / "y.zarr",
mode="a",
shape=(n_total_samples,) + this_y.shape[1:],
chunks=(None, None),
dtype=np.float32,
compressor=Blosc(clevel=compression_level),
)
x = zarr.open(
Path(out_dir) / "x.zarr",
mode="a",
shape=(n_total_samples,) + this_x.shape[1:],
chunks=(None, None),
dtype=np.float32,
compressor=Blosc(clevel=compression_level),
)

start_idx = 0

for burden_dir in burden_dirs:
this_y = zarr.open(f"{burden_dir}/y.zarr")[:]
end_idx = start_idx + this_y.shape[0]
this_x = zarr.open(f"{burden_dir}/x.zarr")[:]
if not skip_burdens:
logger.info("writing burdens")
this_burdens = zarr.open(f"{burden_dir}/burdens.zarr")[:]
burdens[start_idx:end_idx] = this_burdens
print((start_idx, end_idx))
y[start_idx:end_idx] = this_y
x[start_idx:end_idx] = this_x
start_idx = end_idx

y_transformation = config["data"]["dataset_config"].get("y_transformation", None)
standardize_xpheno = config["data"]["dataset_config"].get(
"standardize_xpheno", True
)

## Analogously to what is done in densegt
if standardize_xpheno:
this_x = x[:]
logger.info(" Standardizing combined covariates")
for col in range(this_x.shape[1]):
this_x[:, col] = standardize_series(this_x[:, col])
x[:] = this_x
if y_transformation is not None:
this_y = y[:]
n_unique_y_val = np.count_nonzero(~np.isnan(np.unique(this_y)))
if n_unique_y_val == 2:
logger.warning(
"Not applying y transformation because y only has two values and seems to be binary"
)
y_transformation = None
if y_transformation is not None:
if y_transformation == "standardize":
logger.info(" Standardizing combined target phenotype (y)")
for col in range(this_y.shape[1]):
this_y[:, col] = standardize_series(this_y[:, col])
elif y_transformation == "quantile_transform":
logger.info(f" Quantile transforming combined target phenotype (y)")
for col in range(this_y.shape[1]):
this_y[:, col] = my_quantile_transform(this_y[:, col])
y[:] = this_y
print("done")
if link_burdens is not None:
source_path = Path(out_dir) / "burdens.zarr"
source_path.unlink(missing_ok=True)
source_path.symlink_to(link_burdens)
genes = np.load(f"{burden_dirs[0]}/genes.npy")
np.save(Path(out_dir) / "genes.npy", genes)


if __name__ == "__main__":
cli()
Loading