+ +
+

Advanced regression models for association analysis with individual level data#

+
+

Description#

+

This notebook performs various advanced statistical analysis on multiple xQTL in a given region. Current procedures implemented include:

+
    +
  1. Univariate analysis

    +
      +
    • SuSiE

    • +
    • Univeriate TWAS weights: LASSO, Elastic net, mr.ash, SuSiE, Bayes alphabet soup

    • +
    • Cross validation of TWAS methods (optional but highly recommended if TWAS weights are computed)

    • +
    +
  2. +
  3. Functional data (epigenomic xQTL) analysis

    +
      +
    • fSuSiE

    • +
    +
  4. +
  5. Multivariate analysis

    +
      +
    • mvSuSiE

    • +
    • Multivariate TWAS weights: mvSuSiE and mr.mash

    • +
    +
  6. +
+
+
+

Input#

+
    +
  1. A list of regions to be analyzed (optional); the last column of this file should be region name.

  2. +
  3. Either a list of per chromosome genotype files, or one file for genotype data of the entire genome. Genotype data has to be in PLINK bed format.

  4. +
  5. Vector of lists of phenotype files per region to be analyzed, in UCSC bed.gz with index in bed.gz.tbi formats.

  6. +
  7. Vector of covariate files corresponding to the lists above.

  8. +
  9. Customized association windows file for variants (cis or trans). If it is not provided, a fixed sized window will be used around the region (a cis-window)

  10. +
  11. Optionally a vector of names of the phenotypic conditions in the form of cond1 cond2 cond3 separated with whitespace.

  12. +
+

Input 2 and 3 should be outputs from genotype_per_region and annotate_coord modules in previous preprocessing steps. 4 should be output of covariate_preprocessing pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates.

+
+

Example genotype data#

+
#chr        path
+chr21 /mnt/mfs/statgen/xqtl_workflow_testing/protocol_example.genotype.chr21.bed
+chr22 /mnt/mfs/statgen/xqtl_workflow_testing/protocol_example.genotype.chr22.bed
+
+
+

Alternatively, simply use protocol_example.genotype.chr21_22.bed if all chromosomes are in the same file.

+
+
+

Example phenotype list#

+
#chr    start   end ID  path
+chr12   752578  752579  ENSG00000060237  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
+chr12   990508  990509  ENSG00000082805  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
+chr12   2794969 2794970 ENSG00000004478  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
+chr12   4649113 4649114 ENSG00000139180  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
+chr12   6124769 6124770 ENSG00000110799  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
+chr12   6534516 6534517 ENSG00000111640  /home/gw/GIT/github/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/MWE/output/phenotype/protocol_example.protein.bed.gz
+
+
+
+
+

Example association-window file#

+

It should have strictly 4 columns, with the header a commented out line:

+
#chr    start    end    gene_id
+chr10   0    6480000    ENSG00000008128
+chr1    0    6480000    ENSG00000008130
+chr1    0    6480000    ENSG00000067606
+chr1    0    7101193    ENSG00000069424
+chr1    0    7960000    ENSG00000069812
+chr1    0    6480000    ENSG00000078369
+chr1    0    6480000    ENSG00000078808
+
+
+

The key is that the 4th column ID should match with the 4th column ID in the phenotype list. Otherwise the association-window to analyze will not be found.

+
+
+

About indels#

+

Option --no-indel will remove indel from analysis. FIXME: Gao need to provide more guidelines how to deal with indels in practice.

+
+
+
+

Output#

+

For each analysis region, the output is SuSiE model fitted and saved in RDS format.

+
+
+

Minimal Working Example Steps#

+
+

i. SuSiE with TWAS weights from multiple methods#

+

Timing [FIXME]

+

Below we duplicate the examples for phenotype and covariates to demonstrate that when there are multiple phenotypes for the same genotype it is possible to use this pipeline to analyze all of them (more than two is accepted as well).

+

Here using --region-name we focus the analysis on 3 genes. In practice if this parameter is dropped, the union of all regions in all phenotype region lists will be analyzed. It is possible for some of the regions there are no genotype data, in which case the pipeline will output RDS files with a warning message to indicate the lack of genotype data to analyze.

+

Note: Suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk.

+
+
+
sos run pipeline/mnm_regression.ipynb susie_twas  \
+    --name protocol_example_protein  \
+    --genoFile input/xqtl_association/protocol_example.genotype.chr21_22.bed   \
+    --phenoFile output/phenotype/protocol_example.protein.region_list.txt \
+                output/phenotype/protocol_example.protein.region_list.txt \
+    --covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
+              output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz  \
+    --customized-association-windows input/xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
+    --region-name ENSG00000241973_P42356 ENSG00000160209_O00764 ENSG00000100412_Q99798 \
+    --phenotype-names trait_A trait_B \
+    --container oras://ghcr.io/cumc/pecotmr_apptainer:latest
+
+
+
+
+

It is also possible to analyze a selected list of regions using option --region-list. The last column of this file will be used for the list to analyze. Here for example use the same list of regions as we used for customized association-window:

+
+
+
sos run xqtl-pipeline/pipeline/mnm_regression.ipynb susie_twas  \
+    --name protocol_example_protein  \
+    --genoFile xqtl_association/protocol_example.genotype.chr21_22.bed   \
+    --phenoFile output/phenotype/protocol_example.protein.region_list.txt \
+                output/phenotype/protocol_example.protein.region_list.txt \
+    --covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
+              output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz  \
+    --customized-association-windows xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
+    --region-list xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \
+    --phenotype-names trait_A trait_B \
+    --container oras://ghcr.io/cumc/pecotmr_apptainer:latest
+
+
+
+
+

Note: When both --region-name and --region-list are used, the union of regions from these parameters will be analyzed.

+

FIXME: We should probably just explain these parameters, will work better for conversion script

+

To perform fine-mapping only without TWAS weights,

+
sos run pipeline/mnm_regression.ipynb susie_twas --no-twas-weights ... # rest of parameters the same. 
+
+
+

To perform fine-mapping and TWAS weights without cross validation,

+
sos run pipeline/mnm_regression.ipynb susie_twas --twas-cv-folds 0 ... # rest of parameters the same. 
+
+
+

It is also possible to specify a subset of samples to analyze, using --keep-samples parameter. For example we create a file to keep the ID of 50 samples,

+
+
+
zcat output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz | head -1 | awk '{for (i=2; i<=51; i++) printf $i " "; print ""}'> output/keep_samples.txt
+
+
+
+
+

then use them in our analysis,

+
sos run xqtl-pipeline/pipeline/mnm_regression.ipynb susie_twas --keep-samples output/keep_samples.txt ... # rest of parameters the same
+
+
+
+
+

ii. fSuSiE#

+

Timing [FIXME]

+

Note: Suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk.

+
+
+
sos run pipeline/mnm_regression.ipynb fsusie \
+    --name protocol_example_methylation \
+    --genoFile xqtl_association/protocol_example.genotype.chr21_22.plink_per_chrom.txt \
+    --phenoFile output/phenotype_by_region/protocol_example.methylation.bed.phenotype_by_region_files.txt \
+                output/phenotype_by_region/protocol_example.methylation.bed.phenotype_by_region_files.txt  \
+    --covFile output/covariate/protocol_example.methylation.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
+              output/covariate/protocol_example.methylation.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \
+    --container oras://ghcr.io/cumc/pecotmr_apptainer:latest
+
+
+
+
+
+
+

iii. mvSuSiE with TWAS weights from mr.mash#

+
+
+
sos run pipeline/mnm_regression.ipynb mnm \
+    --name multi_trait \
+    --genoFile /mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/geno_by_chrom/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.11.bed \
+    --phenoFile /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl/Mic/output/data_preprocessing/phenotype_data/Mic.log2cpm.region_list.txt \
+      /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl/Ast/output/data_preprocessing/phenotype_data/Ast.log2cpm.region_list.txt \
+      /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl/Oli/output/data_preprocessing/phenotype_data/Oli.log2cpm.region_list.txt \
+    --covFile /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl/Mic/output/data_preprocessing/covariate_data/Mic.log2cpm.Mic.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
+      /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl/Ast/output/data_preprocessing/covariate_data/Ast.log2cpm.Ast.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
+      /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl/Oli/output/data_preprocessing/covariate_data/Oli.log2cpm.Oli.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.snuc_pseudo_bulk.related.plink_qc.extracted.pca.projected.Marchenko_PC.gz \
+    --customized-association-windows /mnt/vast/hpc/csg/cl4215/mrmash/workflow/susie_twas/TADB_enhanced_cis.coding.bed \
+    --cwd output/ \
+    --region-name ENSG00000073921 \
+    --phenotype-names Mic_DeJager_eQTL Ast_DeJager_eQTL Oli_DeJager_eQTL \
+    --mixture_prior /home/aw3600/MR_KMT_analysis/PTWAS/twas_mr_test/mash_test/mixture_prior_example.EZ.prior.rds \
+    --max_cv_variants 1000 \
+    --ld_reference_meta_file /mnt/vast/hpc/csg/data_public/20240120_ADSP_LD_matrix/ld_meta_file.tsv
+
+
+
+
+
+
+
+

Troubleshooting#

+ + + + + + + + + + + + + + + + + +

Step

Substep

Problem

Possible Reason

Solution

+
+
+

Command interface#

+
+
+
sos run mnm_regression.ipynb -h
+
+
+
+
+
+
+

Workflow implementation#

+
+
+
[global]
+# It is required to input the name of the analysis
+parameter: name = str
+parameter: cwd = path("output")
+# A list of file paths for genotype data, or the genotype data itself. 
+parameter: genoFile = path
+# One or multiple lists of file paths for phenotype data.
+parameter: phenoFile = paths
+# One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.
+parameter: phenoIDFile = paths()
+# Covariate file path
+parameter: covFile = paths
+# Optional: if a region list is provide the analysis will be focused on provided region. 
+# The LAST column of this list will contain the ID of regions to focus on
+# Otherwise, all regions with both genotype and phenotype files will be analyzed
+parameter: region_list = path()
+# Optional: if a region name is provided 
+# the analysis would be focused on the union of provides region list and region names
+parameter: region_name = []
+# Only focus on a subset of samples
+parameter: keep_samples = path()
+# An optional list documenting the custom association window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).
+# If this list is not provided, the default `window` parameter (see below) will be used.
+parameter: customized_association_windows = path()
+# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp
+# When this is set to negative, we will rely on using customized_association_windows
+parameter: cis_window = -1
+# save data object or not
+parameter: save_data = False
+# Name of phenotypes
+parameter: phenotype_names = [f'{x:bn}' for x in phenoFile]
+# And indicator whether it is trans-analysis, ie, not using phenotypic coordinate information
+parameter: trans_analysis = False
+parameter: seed = 999
+
+# association analysis paramters
+# initial number of single effects for SuSiE
+parameter: init_L = 8
+# maximum number of single effects to use for SuSiE
+parameter: max_L = 30
+# remove a variant if it has more than imiss missing individual level data
+parameter: imiss = 1.0
+# MAF cutoff
+parameter: maf = 0.0025
+# MAC cutoff, on top of MAF cutoff
+parameter: mac = 5
+# Remove indels if indel = False
+parameter: indel = True
+parameter: pip_cutoff = 0.025
+parameter: coverage = [0.95, 0.7, 0.5]
+# Perform Fine-mapping
+parameter: fine_mapping = True
+# Compute TWAS weights as well
+parameter: twas_weights = True
+# Perform K folds valiation CV for TWAS
+# Set it to zero if this is to be skipped
+parameter: twas_cv_folds = 5
+parameter: twas_cv_threads = twas_cv_folds
+# maximum number of variants to consider for CV
+# We will randomly pick a subset of it for CV purpose
+parameter: max_cv_variants = 5000
+# Further limit CV to only using common variants
+parameter: min_cv_maf = 0.05
+parameter: ld_reference_meta_file = path()
+# Analysis environment settings
+parameter: container = ""
+import re
+parameter: entrypoint= ('micromamba run -a "" -n' + ' ' + re.sub(r'(_apptainer:latest|_docker:latest|\.sif)$', '', container.split('/')[-1])) if container else ""
+# For cluster jobs, number commands to run per job
+parameter: job_size = 200
+# Wall clock time expected
+parameter: walltime = "1h"
+# Memory expected
+parameter: mem = "20G"
+# Number of threads
+parameter: numThreads = 1
+
+def group_by_region(lst, partition):
+    # from itertools import accumulate
+    # partition = [len(x) for x in partition]
+    # Compute the cumulative sums once
+    # cumsum_vector = list(accumulate(partition))
+    # Use slicing based on the cumulative sums
+    # return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]
+    return partition
+
+import os
+import pandas as pd
+
+def adapt_file_path(file_path, reference_file):
+    """
+    Adapt a single file path based on its existence and a reference file's path.
+
+    Args:
+    - file_path (str): The file path to adapt.
+    - reference_file (str): File path to use as a reference for adaptation.
+
+    Returns:
+    - str: Adapted file path.
+
+    Raises:
+    - FileNotFoundError: If no valid file path is found.
+    """
+    reference_path = os.path.dirname(reference_file)
+
+    # Check if the file exists
+    if os.path.isfile(file_path):
+        return file_path
+
+    # Check file name without path
+    file_name = os.path.basename(file_path)
+    if os.path.isfile(file_name):
+        return file_name
+
+    # Check file name in reference file's directory
+    file_in_ref_dir = os.path.join(reference_path, file_name)
+    if os.path.isfile(file_in_ref_dir):
+        return file_in_ref_dir
+
+    # Check original file path prefixed with reference file's directory
+    file_prefixed = os.path.join(reference_path, file_path)
+    if os.path.isfile(file_prefixed):
+        return file_prefixed
+
+    # If all checks fail, raise an error
+    raise FileNotFoundError(f"No valid path found for file: {file_path}")
+
+def adapt_file_path_all(df, column_name, reference_file):
+    return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))
+
+
+
+
+
+
+
[get_analysis_regions: shared = "regional_data"]
+# input is genoFile, phenoFile, covFile and optionally region_list. If region_list presents then we only analyze what's contained in the list.
+# regional_data should be a dictionary like:
+#{'data': [("genotype_1.bed", "phenotype_1.bed.gz", "covariate_1.gz"), ("genotype_2.bed", "phenotype_1.bed.gz", "phenotype_2.bed.gz", "covariate_1.gz", "covariate_2.gz") ... ],
+# 'meta_info': [("chr12:752578-752579","chr12:752577-752580", "gene_1", "trait_1"), ("chr13:852580-852581","chr13:852579-852580", "gene_2", "trait_1", "trait_2") ... ]}
+import numpy as np
+
+def preload_id_map(id_map_files):
+    id_maps = {}
+    for id_map_file in id_map_files:
+        if id_map_file is not None and os.path.isfile(id_map_file):
+            df = pd.read_csv(id_map_file, sep='\s+', header=None, comment='#', names=['old_ID', 'new_ID'])
+            id_maps[id_map_file] = df.set_index('old_ID')['new_ID'].to_dict()
+    return id_maps
+
+def load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps):
+    pheno_df = pd.read_csv(pheno_path, sep="\s+", header=0)
+    pheno_df['Original_ID'] = pheno_df['ID']
+    if id_map_path in preloaded_id_maps:
+        id_map = preloaded_id_maps[id_map_path]
+        pheno_df['ID'] = pheno_df['ID'].map(id_map).fillna(pheno_df['ID'])
+    return pheno_df
+
+def filter_by_region_ids(data, region_ids):
+    if region_ids:
+        data = data[data['ID'].isin(region_ids)]
+    return data
+
+def custom_join(series):
+    # Initialize an empty list to hold the processed items
+    result = []
+    for item in series:
+        if ',' in item:
+            # If the item contains commas, split by comma and convert to tuple
+            result.append(tuple(item.split(',')))
+        else:
+            # If the item does not contain commas, add it directly
+            result.append(item)
+    # Convert the list of items to a tuple and return
+    return tuple(result)
+
+def aggregate_phenotype_data(accumulated_pheno_df):
+    if not accumulated_pheno_df.empty:
+        accumulated_pheno_df = accumulated_pheno_df.groupby(['#chr','ID','cond','path','cov_path'], as_index=False).agg({
+            '#chr': lambda x: np.unique(x).astype(str)[0],
+            'ID': lambda x: np.unique(x).astype(str)[0],
+            'Original_ID': ','.join,
+            'start': 'min',
+            'end': 'max'
+        }).groupby(['#chr','ID'], as_index=False).agg({
+            'cond': ','.join,
+            'path': ','.join,
+            'Original_ID': custom_join,
+            'cov_path': ','.join,
+            'start': 'min',
+            'end': 'max'
+        })
+    return accumulated_pheno_df
+
+def process_cis_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, preloaded_id_maps):
+    '''
+    Example output:
+    #chr    start      end    ID  Original_ID   path     cov_path             cond
+    chr12   752578   752579  ENSG00000060237  Q9H4A3,P62873  protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B
+    '''
+    accumulated_pheno_df = pd.DataFrame()
+    pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files
+
+    for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):
+        if not os.path.isfile(cov_path):
+            raise FileNotFoundError(f"No valid path found for file: {cov_path}")
+        pheno_df = load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps)
+        pheno_df = filter_by_region_ids(pheno_df, region_ids)
+        
+        if not pheno_df.empty:
+            pheno_df.iloc[:, 4] = adapt_file_path_all(pheno_df, pheno_df.columns[4], f"{pheno_path:a}")
+            pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name)           
+            accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)
+
+    accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)
+    return accumulated_pheno_df
+
+def process_trans_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, customized_association_windows):
+    '''
+    Example output:
+    #chr    start      end    ID  Original_ID   path     cov_path             cond
+    chr21   0   0  chr21_18133254_19330300  carnitine,benzoate,hippurate  metabolon_1.bed.gz,metabolon_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B
+    '''
+    
+    if not os.path.isfile(customized_association_windows):
+        raise ValueError("Customized association analysis window must be specified for trans analysis.")
+    accumulated_pheno_df = pd.DataFrame()
+    pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files
+    genotype_windows = pd.read_csv(customized_association_windows, comment="#", header=None, names=["#chr","start","end","ID"], sep="\t")
+    genotype_windows = filter_by_region_ids(genotype_windows, region_ids)
+    if genotype_windows.empty:
+        return accumulated_pheno_df
+    
+    for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):
+        if not os.path.isfile(cov_path):
+            raise FileNotFoundError(f"No valid path found for file: {cov_path}")
+        pheno_df = pd.read_csv(pheno_path, sep="\s+", header=0, names=['Original_ID', 'path'])
+        if not pheno_df.empty:
+            pheno_df.iloc[:, -1] = adapt_file_path_all(pheno_df, pheno_df.columns[-1], f"{pheno_path:a}")
+            pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name)
+            # Here we combine genotype_windows which contains "#chr" and "ID" to pheno_df by creating a cartesian product
+            pheno_df = pd.merge(genotype_windows.assign(key=1), pheno_df.assign(key=1), on='key').drop('key', axis=1)
+            # then set start and end columns to zero
+            pheno_df['start'] = 0
+            pheno_df['end'] = 0
+            if id_map_path is not None:
+                # Filter pheno_df by specific association-window and phenotype pairs
+                association_analysis_pair = pd.read_csv(id_map_path, sep='\s+', header=None, comment='#', names=['ID', 'Original_ID'])
+                pheno_df = pd.merge(pheno_df, association_analysis_pair, on=['ID', 'Original_ID'])
+            accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)
+
+    accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)
+    return accumulated_pheno_df
+
+# Load phenotype meta data
+if len(phenoFile) != len(covFile):
+    raise ValueError("Number of input phenotypes files must match that of covariates files")
+if len(phenoFile) != len(phenotype_names):
+    raise ValueError("Number of input phenotypes files must match the number of phenotype names")
+if len(phenoIDFile) > 0 and len(phenoFile) != len(phenoIDFile):
+    raise ValueError("Number of input phenotypes files must match the number of phenotype ID mapping files")
+
+# Load genotype meta data
+if f"{genoFile:x}" == ".bed":
+    geno_meta_data = pd.DataFrame([("chr"+str(x), f"{genoFile:a}") for x in range(1,23)] + [("chrX", f"{genoFile:a}")], columns=['#chr', 'geno_path'])
+else:
+    geno_meta_data = pd.read_csv(f"{genoFile:a}", sep = "\s+", header=0)
+    geno_meta_data.iloc[:, 1] = adapt_file_path_all(geno_meta_data, geno_meta_data.columns[1], f"{genoFile:a}")
+    geno_meta_data.columns = ['#chr', 'geno_path']
+    geno_meta_data['#chr'] = geno_meta_data['#chr'].apply(lambda x: str(x) if str(x).startswith('chr') else f'chr{x}')
+
+# Checking the DataFrame
+valid_chr_values = [f'chr{x}' for x in range(1, 23)] + ['chrX']
+if not all(value in valid_chr_values for value in geno_meta_data['#chr']):
+    raise ValueError("Invalid chromosome values found. Allowed values are chr1 to chr22 and chrX.")
+
+region_ids = []
+# If region_list is provided, read the file and extract IDs
+if region_list.is_file():
+    region_list_df = pd.read_csv(region_list, delim_whitespace=True, header=None, comment = "#")
+    region_ids = region_list_df.iloc[:, -1].unique()  # Extracting the last column for IDs
+
+# If region_name is provided, include those IDs as well
+# --region-name A B C will result in a list of ["A", "B", "C"] here
+if len(region_name) > 0:
+    region_ids = list(set(region_ids).union(set(region_name)))
+
+if trans_analysis:
+    meta_data = process_trans_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, customized_association_windows)
+else:
+    meta_data = process_cis_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, preload_id_map(phenoIDFile))
+    
+if not meta_data.empty:
+    meta_data = meta_data.merge(geno_meta_data, on='#chr', how='inner')
+    # Adjust association-window
+    if os.path.isfile(customized_association_windows):
+        print(f"Loading customized association analysis window from {customized_association_windows}")
+        association_windows_list = pd.read_csv(customized_association_windows, comment="#", header=None, names=["#chr","start","end","ID"], sep="\t")
+        meta_data = pd.merge(meta_data, association_windows_list, on=['#chr', 'ID'], how='left', suffixes=('', '_association'))
+        mismatches = meta_data[meta_data['start_association'].isna()]
+        if not mismatches.empty:
+            print("First 5 mismatches:")
+            print(mismatches[['ID']].head())
+            raise ValueError(f"{len(mismatches)} regions to analyze cannot be found in ``{customized_association_windows}``. Please check your ``{customized_association_windows}`` database to make sure it contains all association-window definitions. ")
+    else:
+        if cis_window < 0 :
+            raise ValueError("Please either input valid path to association-window file via ``--customized-association-windows``, or set ``--cis-window`` to a non-negative integer.")
+        if cis_window == 0:
+            print("Warning: only variants within the range of start and end of molecular phenotype will be considered since cis_window is set to zero and no customized association window file was found. Please make sure this is by design.")
+        meta_data['start_association'] = meta_data['start'].apply(lambda x: max(x - cis_window, 0))
+        meta_data['end_association'] = meta_data['end'] + cis_window
+
+    # Example meta_data:
+    # #chr    start      end    start_association       end_association           ID  Original_ID   path     cov_path             cond             coordinate     geno_path
+    # 0  chr12   752578   752579  652578   852579  ENSG00000060237  Q9H4A3,P62873  protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B    chr12:752578-752579  protocol_example.genotype.chr21_22.bed       
+    # Create the final dictionary
+    regional_data = {
+        'data': [(row['geno_path'], *row['path'].split(','), *row['cov_path'].split(',')) for _, row in meta_data.iterrows()],
+        'meta_info': [(f"{row['#chr']}:{row['start']}-{row['end']}", # this is the phenotypic region to extract data from
+                       f"{row['#chr']}:{row['start_association']}-{row['end_association']}", # this is the association window region
+                       row['ID'], row['Original_ID'], *row['cond'].split(',')) for _, row in meta_data.iterrows()]
+    }
+else:
+    regional_data = {'data':[], 'meta_info':[]}
+
+
+
+
+
+
+

Univariate analysis: SuSiE and TWAS#

+
+
+
[susie_twas_1]
+# If this value is greater than 0, an initial single effect analysis will be performed 
+# to determine if follow up analysis will be continued or to simply return NULL
+parameter: skip_analysis_pip_cutoff = []
+depends: sos_variable("regional_data")
+# Check if both 'data' and 'meta_info' are empty lists
+stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
+
+if len(skip_analysis_pip_cutoff) == 0:
+    skip_analysis_pip_cutoff = [-1.0] * len(phenoFile)
+if len(skip_analysis_pip_cutoff) == 1:
+    skip_analysis_pip_cutoff = skip_analysis_pip_cutoff * len(phenoFile)
+if len(skip_analysis_pip_cutoff) != len(phenoFile):
+    raise ValueError(f"``skip_analysis_pip_cutoff`` should have either length 1 or length the same as phenotype files ({len(phenoFile)} in this case)")
+
+meta_info = regional_data['meta_info']
+input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
+output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[2]}.univariate{"_susie" if fine_mapping else ""}{"_twas_weights" if twas_weights else ""}.rds'
+task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
+R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
+    options(warn=1)
+    library(pecotmr)
+    # extract subset of samples
+    keep_samples = NULL
+    if (${"TRUE" if keep_samples.is_file() else "FALSE"}) {
+      keep_samples = unlist(strsplit(readLines(${keep_samples:ar}), "\\s+"))
+      message(paste(length(keep_samples), "samples are selected to be loaded for analysis"))
+    }
+    # Load regional association data
+    tryCatch({
+    fdat = load_regional_univariate_data(genotype = ${_input[0]:anr},
+                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1:len(_input)//2+1]])}),
+                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[len(_input)//2+1:]])}),
+                                          region = ${("'%s'" % _meta_info[0]) if int(_meta_info[0].split('-')[-1])>0 else 'NULL'}, # if the end position is zero return NULL
+                                          association_window = "${_meta_info[1]}",
+                                          conditions = c(${",".join(['"%s"' % x for x in _meta_info[4:]])}),
+                                          maf_cutoff = ${maf},
+                                          mac_cutoff = ${mac},
+                                          imiss_cutoff = ${imiss},
+                                          keep_indel = ${"TRUE" if indel else "FALSE"},
+                                          keep_samples = keep_samples,
+                                          extract_region_name = list(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])}),
+                                          phenotype_header = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"},
+                                          region_name_col = ${"4" if int(_meta_info[0].split('-')[-1])>0 else "1"},
+                                          scale_residuals = FALSE)
+    }, NoSNPsError = function(e) {
+        message("Error: ", paste(e$message, "${_meta_info[2] + '@' + _meta_info[1]}"))
+        #saveRDS(NULL, ${_output:ar})
+        saveRDS(list(${_meta_info[2]} = e$message), ${_output:ar}, compress='xz')
+        quit(save="no")
+    })
+  
+    if (${"TRUE" if not (fine_mapping or twas_weights) else "FALSE"}) {
+      # only export data
+      saveRDS(list(${_meta_info[2]} = fdat), ${_output:ar}, compress='xz')
+      quit(save="no")
+    } else {
+      if (${"TRUE" if save_data else "FALSE"}) {
+          # save data object for debug purpose
+          saveRDS(list(${_meta_info[2]} = fdat), "${_output:ann}.univariate.rds", compress='xz')
+      }
+    }
+    # Univeriate analysis suite
+    library(susieR)
+    run_univariate_pipeline <- function(X, Y, X_scalar, Y_scalar, maf, dropped_samples, pip_cutoff_to_skip = 0) {
+      if (pip_cutoff_to_skip>0) {
+          # return a NULL set if the top loci model does not show any potentially significant variants
+          top_model_pip = susie(X,Y,L=1)$pip
+          if (!any(top_model_pip>pip_cutoff_to_skip)) {
+              message(paste("Skipping follow-up analysis: No signals above PIP threshold", pip_cutoff_to_skip, "in initial model screening."))
+              return(list())
+          } else {
+              message(paste("Follow-up on region because signals above PIP threshold", pip_cutoff_to_skip, "were detected in initial model screening."))
+          }
+      }
+      st = proc.time()
+      if (${"TRUE" if fine_mapping else "FALSE"}) {
+          res = susie_wrapper(X, Y, init_L=${init_L}, max_L=${max_L}, refine=TRUE, coverage = ${coverage[0]})
+          res = susie_post_processor(res, X, Y, X_scalar, Y_scalar, maf,
+                                 secondary_coverage = c(${",".join([str(x) for x in coverage[1:]])}), signal_cutoff = ${pip_cutoff},
+                                 other_quantities = list(dropped_samples = dropped_samples))
+      }
+      else {
+          res = list()
+      }
+      if ( ${"TRUE" if twas_weights else "FALSE"} ) {
+        twas_weights_output <- twas_weights_pipeline(X, Y, maf, susie_fit=res$susie_result_trimmed, 
+                                     ld_reference_meta_file = ${('"%s"' % ld_reference_meta_file) if not ld_reference_meta_file.is_dir() else "NULL"},
+                                     X_scalar = X_scalar, y_scalar = Y_scalar,
+                                     cv_folds = ${twas_cv_folds}, coverage=${coverage[0]}, secondary_coverage=c(${",".join([str(x) for x in coverage[1:]])}), signal_cutoff = ${pip_cutoff},
+                                     min_cv_maf=${min_cv_maf}, max_cv_variants=${max_cv_variants}, cv_seed=${seed}, cv_threads=${twas_cv_threads})
+        # clean up the output database
+        res = c(res, twas_weights_output)
+        res$twas_weights = lapply(res$twas_weights, function(x) { rownames(x) <- NULL; return(x) })
+      }
+      res$total_time_elapsed = proc.time() - st
+      if ("${_meta_info[2]}" != "${_meta_info[3]}") {
+          region_name = c("${_meta_info[2]}", c(${",".join([("c('"+x+"')") if isinstance(x, str) else ("c"+ str(x)) for x in _meta_info[3]])}))
+      } else {
+          region_name = "${_meta_info[2]}"
+      }
+      res$region_info = list(region_coord=parse_region("${_meta_info[0]}"), grange=parse_region("${_meta_info[1]}"), region_name=region_name)
+      return (res)
+    }
+  
+    fitted = list()
+    condition_names = vector()
+    empty_elements_cnt = 0
+    r = 1
+    while (r<=length(fdat$residual_Y)) {
+      dropped_samples = list(X=fdat$dropped_sample$dropped_samples_X[[r]], 
+                             y=fdat$dropped_sample$dropped_samples_Y[[r]], 
+                             covar=fdat$dropped_sample$dropped_samples_covar[[r]])
+      results <- lapply(1:ncol(fdat$residual_Y[[r]]), function(i) run_univariate_pipeline(fdat$residual_X[[r]], 
+                                                                                    fdat$residual_Y[[r]][,i,drop=FALSE], 
+                                                                                    fdat$residual_X_scalar[[r]], 
+                                                                                    if (fdat$residual_Y_scalar[[r]] == 1) 1 else fdat$residual_Y_scalar[[r]][,i,drop=FALSE], 
+                                                                                    fdat$maf[[r]], dropped_samples, 
+                                                                                    pip_cutoff_to_skip = c(${",".join(['%s' % x for x in skip_analysis_pip_cutoff])})[r]))
+      # Update condition names
+      new_names = names(fdat$residual_Y)[r]
+      if (ncol(fdat$residual_Y[[r]]) > 1) {
+          new_names = colnames(fdat$residual_Y[[r]])
+          if (is.null(new_names)) {
+              # column names does not exist, create generic names instead
+              new_names = 1:ncol(fdat$residual_Y[[r]])
+          }
+          new_names = paste(names(fdat$residual_Y)[r], new_names, sep="_") # DLPFC_iso1 DLPFC_iso2
+      }
+  
+      empty_elements_idx <- which(sapply(results, function(x) is.list(x) && length(x) == 0))
+      if (length(empty_elements_idx)>0) {
+          empty_elements_cnt <- empty_elements_cnt + length(empty_elements_idx)
+          results <- results[-empty_elements_idx]
+          new_names <- new_names[-empty_elements_idx]
+      }
+      fitted = c(fitted, results)
+      condition_names = c(condition_names, new_names)
+      if (length(new_names)>0) {
+          message("Analysis completed for: ", paste(new_names, collapse=","))
+      }
+      # original data no longer relevant, set to NA to release memory
+      fdat$residual_X[[r]] <- NA
+      fdat$residual_Y[[r]] <- NA
+      r = r + 1
+    }
+    names(fitted) <- condition_names
+    saveRDS(list("${_meta_info[2]}" = fitted), ${_output:ar}, compress='xz')
+    if (empty_elements_cnt>0) {
+        message(empty_elements_cnt, " out of ", length(fitted), " analysis are skipped for failing to pass initial screen for potential association signals")
+    }
+
+
+
+
+
+
+

Multivariate analysis: mvSuSiE and mr.mash#

+
+
+
[mnm_1]
+# Prior model file generated from mashr. 
+# Default will be used if it does not exist.
+parameter: mixture_prior = path()
+parameter: mixture_prior_cv = path()
+parameter: prior_weights_min = 5e-4
+parameter: prior_canonical_matrices = True
+parameter: sample_partition = path() 
+parameter: mvsusie_max_iter = 200
+parameter: mrmash_max_iter = 5000
+parameter: ld_reference_meta_file = path()
+depends: sos_variable("regional_data")
+# Check if both 'data' and 'meta_info' are empty lists
+stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
+
+if len(skip_analysis_pip_cutoff) == 0:
+    skip_analysis_pip_cutoff = [-1.0] * len(phenoFile)
+if len(skip_analysis_pip_cutoff) == 1:
+    skip_analysis_pip_cutoff = skip_analysis_pip_cutoff * len(phenoFile)
+if len(skip_analysis_pip_cutoff) != len(phenoFile):
+    raise ValueError(f"``skip_analysis_pip_cutoff`` should have either length 1 or length the same as phenotype files ({len(phenoFile)} in this case)")
+
+meta_info = regional_data['meta_info']
+input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
+output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0].split(":")[0]}_{_meta_info[2]}.mnm.rds'
+task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
+R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
+
+    library(pecotmr) 
+    fdat = load_regional_multivariate_data(genotype = ${_input[0]:anr},
+                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1:len(_input)//2+1]])}),
+                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[len(_input)//2+1:]])}),
+                                          region = "${_meta_info[0]}",
+                                          conditions = c(${",".join(['"%s"' % x for x in _meta_info[4:]])}),
+                                          association_window = "${_meta_info[1]}",
+                                          maf_cutoff = ${maf},
+                                          mac_cutoff = ${mac},
+                                          imiss_cutoff = ${imiss},
+                                          keep_indel = ${"TRUE" if indel else "FALSE"})
+    
+    dd_prior <- if (${mixture_prior:r} == '.' || ${mixture_prior:r} == '') NULL else readRDS(${mixture_prior:r})
+    dd_prior_cv <- if (${mixture_prior_cv:r} == '.' || ${mixture_prior_cv:r} == '') NULL else readRDS(${mixture_prior_cv:r})
+    
+    result <- multivariate_analysis_pipeline(
+                    X=fdat$X,
+                    Y=fdat$residual_Y,
+                    maf=fdat$maf,
+                    max_L=${max_L},
+                    ld_reference_meta_file = ${ld_reference_meta_file:r}, 
+                    mvsusie_max_iter= ${mvsusie_max_iter}, 
+                    mrmash_max_iter = ${mrmash_max_iter},
+                    max_cv_variants = ${max_cv_variants}, 
+                    pip_cutoff_to_skip = c(${",".join(['%s' % x for x in skip_analysis_pip_cutoff])}),
+                    signal_cutoff = ${signal_cutoff},
+                    data_driven_prior_matrices=dd_prior, 
+                    data_driven_prior_matrices_cv=dd_prior_cv, 
+                    canonical_prior_matrices = ${"TRUE" if prior_canonical_matrices else "FALSE"} || is.null(dd_prior),
+                    sample_partition = ${"NULL" if sample_partition=='.' or sample_partition=='' else sample_partition},  
+                    cv_folds = ${twas_cv_folds}, 
+                    cv_seed = ${seed}, 
+                    cv_threads = ${twas_cv_threads},
+                    prior_weights_min = ${prior_weights_min}, 
+                    twas_weights = ${"TRUE" if twas_weights else "FALSE"}
+    )
+  
+    saveRDS(result, ${_output:ar})
+
+
+
+
+
+
+

Functional regression fSuSiE for epigenomic QTL fine-mapping#

+
+
+
[fsusie_1]
+# prior can be either of ["mixture_normal", "mixture_normal_per_scale"]
+parameter: prior = "mixture_normal"
+parameter: max_SNP_EM = 100
+# Max scale is such that 2^max_scale being the number of phenotypes in the transformed space. Default to 2^10  = 1024. Don't change it unless you know what you are doing. Max_scale should be at least larger than 5.
+parameter:  max_scale = 10
+# Purity and coverage used to call cs
+parameter:  min_purity = 0.5
+# Epigenetics mark filter
+parameter: epigenetics_mark_treshold = 16
+# Run susie for top pc of the fsusie input
+parameter: susie_top_pc = 10
+depends: sos_variable("regional_data")
+# Check if both 'data' and 'meta_info' are empty lists
+stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
+meta_info = regional_data['meta_info']
+input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
+output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0].replace(":", "_").replace("-", "_")}.fsusie_{prior}{"_top_pc_weights" if twas_weights else ""}.rds'
+task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
+R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
+    options(warn=1)
+    # extract subset of samples
+    keep_samples = NULL
+    if (${"TRUE" if keep_samples.is_file() else "FALSE"}) {
+      keep_samples = unlist(strsplit(readLines(${keep_samples:ar}), "\\s+"))
+      message(paste(length(keep_samples), "samples are selected to be loaded for analysis"))
+    }
+
+    # Load regional functional data
+    library(pecotmr)
+    tryCatch({
+    fdat = load_regional_functional_data(genotype = ${_input[0]:anr},
+                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1:len(_input)//2+1]])}),
+                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[len(_input)//2+1:]])}),
+                                          region = "${_meta_info[0]}",
+                                          association_window = "${_meta_info[1]}",
+                                          conditions = c(${",".join(['"%s"' % x for x in _meta_info[4:]])}),
+                                          maf_cutoff = ${maf},
+                                          mac_cutoff = ${mac},
+                                          imiss_cutoff = ${imiss},
+                                          keep_indel = ${"TRUE" if indel else "FALSE"},
+                                          keep_samples = keep_samples,
+                                          tabix_header = TRUE,
+                                          phenotype_header = 4,
+                                          region_name_col = 4,
+                                          scale_residuals = FALSE)
+    }, NoSNPsError = function(e) {
+        message("Error: ", paste(e$message, "${_meta_info[2] + '@' + _meta_info[1]}"))
+        saveRDS(list("${_meta_info[0]}" = e$message), ${_output:ar}, compress='xz')
+        quit(save="no")
+    })
+    # Filter out list fdat that with less than a treshold of epigenomic marker.
+    library(tidyverse)
+    filter_fdat_except_specific_names <- function(fdat, n) {
+        # Identify which elements in list1 meet the row count criteria
+        indices_to_keep <- sapply(fdat$Y_coordinates, function(x) nrow(x) >= n)
+        fdat_filtered <- map(fdat[!names(fdat) %in% c("dropped_sample", "X", "chrom")],~.x[indices_to_keep]) 
+        return(c(fdat_filtered,fdat[names(fdat) %in% c("dropped_sample", "X", "chrom")]))
+    }
+
+    fdat = filter_fdat_except_specific_names(fdat, n = ${epigenetics_mark_treshold})
+    # Check if Y_coordinates is empty after filtering
+    if (length(fdat$Y_coordinates) == 0) {
+        e_msg = paste0("None of the study have more than or equal to ",${epigenetics_mark_treshold}, " epigenetics marks, region skipped")
+        message(e_msg)
+        saveRDS(list("${_meta_info[0]}" = e_msg ),  ${_output:ar}, compress='xz')
+        quit(save="no")
+    }
+
+    if (${"TRUE" if save_data else "FALSE"}) {
+      # save data object for debug purpose
+      saveRDS(list("${_meta_info[0]}" = fdat), "${_output:ann}.${epigenetics_mark_treshold}_marks.dataset.rds", compress='xz')
+    }
+  
+    fitted = setNames(replicate(length(fdat$residual_Y), list(), simplify = FALSE), names(fdat$residual_Y))
+    for (r in 1:length(fitted)) {
+        st = proc.time()
+        fitted[[r]] = list()
+        message(paste("Dimension of Y matrix is ", nrow(fdat$residual_Y[[r]]), "rows by", ncol(fdat$residual_Y[[r]]), "columns."))
+        
+        # Run SuSiE on top K PC
+        if(${"TRUE" if (susie_top_pc >0 or twas_weights) else "FALSE"}) {
+            top_pc_data <- prcomp(fdat$residual_Y[[r]], center = TRUE, scale. = TRUE)$x
+            if (${susie_top_pc} < ncol(top_pc_data)) {
+                top_pc_data <- top_pc_data[,1:${susie_top_pc},drop=F]
+            }
+            fitted[[r]]$susie_on_top_pc <- list()
+            for (i in 1:ncol(top_pc_data)) {
+                susie_on_top_pc <- susie_wrapper(fdat$residual_X[[r]], top_pc_data[,i], init_L=${init_L}, max_L=${max_L}, refine=TRUE, coverage = ${coverage[0]})
+                fitted[[r]]$susie_on_top_pc[[i]] <- susie_post_processor(susie_on_top_pc, fdat$residual_X[[r]], top_pc_data[,i], fdat$residual_X_scalar[[r]], 1, fdat$maf[[r]],
+                                           secondary_coverage = c(${",".join([str(x) for x in coverage[1:]])}), signal_cutoff = ${pip_cutoff},
+                                           other_quantities = list(dropped_samples = list(X=fdat$dropped_sample$dropped_samples_X[[r]], 
+                                                                   y=fdat$dropped_sample$dropped_samples_Y[[r]], 
+                                                                   covar=fdat$dropped_sample$dropped_samples_covar[[r]])))
+            }
+            # Run TWAS weights on the first PC
+            # Exactly the same codes copied from susie_twas
+            if ( ${"TRUE" if twas_weights else "FALSE"} ) {
+                twas_weights_output <- twas_weights_pipeline(fdat$residual_X[[r]], top_pc_data[,1], fdat$maf[[r]], 
+                                         susie_fit=fitted[[r]]$susie_on_top_pc[[1]]$susie_result_trimmed,
+                                         ld_reference_meta_file = ${('"%s"' % ld_reference_meta_file) if not ld_reference_meta_file.is_dir() else "NULL"},
+                                         X_scalar = fdat$residual_X_scalar[[r]], y_scalar = fdat$residual_Y_scalar[[r]],
+                                         cv_folds = ${twas_cv_folds}, coverage=${coverage[0]}, secondary_coverage=c(${",".join([str(x) for x in coverage[1:]])}), signal_cutoff = ${pip_cutoff},
+                                         min_cv_maf=${min_cv_maf}, max_cv_variants=${max_cv_variants}, cv_seed=${seed}, cv_threads=${twas_cv_threads})
+                # clean up the output database
+                fitted[[r]] = c(fitted[[r]], twas_weights_output)
+                fitted[[r]]$twas_weights = lapply(fitted[[r]]$twas_weights, function(x) { rownames(x) <- NULL; return(x) })
+            }
+        }
+        # Run fSuSiE -- this can take a while
+        fitted[[r]]$fsusie_result <- fsusie_wrapper(X = fdat$residual_X[[r]],
+                                      Y = fdat$residual_Y[[r]],
+                                      pos=fdat$Y_coordinates[[r]]$start,
+                                      L=${max_L},
+                                      prior="${prior}",
+                                      max_SNP_EM=${max_SNP_EM}, 
+                                      max_scale = ${max_scale},
+                                      min.purity = ${min_purity},
+                                      cov_lev = ${coverage[0]})
+        fitted[[r]]$Y_coordinates = fdat$Y_coordinates[[r]] # Add Y coord 
+        names(fitted[[r]]$fsusie_result$pip) = colnames(fdat$residual_X[[r]]) 
+        fitted[[r]]$fsusie_summary <- susie_post_processor(fitted[[r]]$fsusie_result, fdat$residual_X[[r]], NULL, fdat$residual_X_scalar[[r]], 1, fdat$maf[[r]], 
+                                                          secondary_coverage = c(${",".join([str(x) for x in coverage[1:]])}), signal_cutoff = ${pip_cutoff},
+                                                          other_quantities = list(dropped_samples = list(X=fdat$dropped_sample$dropped_samples_X[[r]], y=fdat$dropped_sample$dropped_samples_Y[[r]], 
+                                                                                  covar=fdat$dropped_sample$dropped_samples_covar[[r]])))
+        fitted[[r]]$total_time_elapsed = proc.time() - st
+        fitted[[r]]$region_info = list(region_coord=parse_region("${_meta_info[0]}"), grange=parse_region("${_meta_info[1]}"), region_name="${_meta_info[2]}")
+        # original data no longer relevant, set to NA to release memory
+        fdat$residual_X[[r]] <- NA
+        fdat$residual_Y[[r]] <- NA
+    }
+    saveRDS(list("${_meta_info[0]}" = fitted), ${_output:ar}, compress='xz')
+
+
+
+
+
+
+

Functional regression fSuSiE with other modality#

+

This is still WIP — mvfSuSiE package is still being developed

+
+
+
[mvfsusie_1]
+# prior can be either of ["mixture_normal", "mixture_normal_per_scale"]
+parameter: prior  = "mixture_normal_per_scale"
+parameter: max_SNP_EM = 1000
+depends: sos_variable("regional_data")
+# Check if both 'data' and 'meta_info' are empty lists
+stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {", ".join(region_name)}.')
+
+meta_info = regional_data['meta_info']
+input: regional_data["data"], group_by = lambda x: group_by_region(x, regional_data["data"]), group_with = "meta_info"
+output: f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[0]}.mvfsusie_{prior}.rds'
+task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output:bn}'
+R: expand = '${ }', stdout = f"{_output:n}.stdout", stderr = f"{_output:n}.stderr", container = container, entrypoint = entrypoint
+    # Load regional association data
+    fdat = load_regional_association_data(genotype = ${_input[0]:anr},
+                                          phenotype = c(${",".join(['"%s"' % x.absolute() for x in _input[1::2]])}),
+                                          covariate = c(${",".join(['"%s"' % x.absolute() for x in _input[2::2]])}),
+                                          region = ${'"%s:%s-%s"' % (_meta_info[1], _meta_info[2], _meta_info[3])},
+                                          maf_cutoff = ${maf},
+                                          mac_cutoff = ${mac},
+                                          imiss_cutoff = ${imiss})
+    # Fine-mapping with mvfSuSiE
+    library("mvf.susie.alpha")
+    Y = map(fdat$residual_Y, ~left_join(fdat$X[,1]%>%as.data.frame%>%rownames_to_column("rowname"), .x%>%t%>%as.data.frame%>%rownames_to_column("rowname") , by = "rowname")%>%select(-2)%>%column_to_rownames("rowname")%>%as.matrix )
+    fitted <- multfsusie(Y_f = list(Y[[1]],Y[[3]]), 
+                         Y_u = Reduce(cbind, Y[[2]]),
+                         pos = list(pos1 =fdat$phenotype_coordiates[[1]], pos2 = fdat$phenotype_coordiates[[3]]),
+                         X=X,
+                         L=${max_L},
+                         data.format="list_df")
+    saveRDS(fitted, ${_output:ar})
+
+
+
+
+
+
+ + + + +