Skip to content

Latest commit

 

History

History
259 lines (219 loc) · 14.6 KB

README.md

File metadata and controls

259 lines (219 loc) · 14.6 KB

FunUnifrac

A repository to implement UniFrac, but on functional profiles of metagenomic data.

Installation

To install the package, run the following command:

conda env update -f environment.yml
conda activate fununifrac

Preprocessed data support

All required resources are shared by Onedrive currently. Please check here. Detailed processes of generating the files are described at the "Preprocessing" section.

Script1: compute_edges.py

The original UniFrac uses a phylogenetic tree, allowing one to measure the difference in composition between two samples. For FunUniFrac, we use a KEGG Orthology (KO) tree and functional profiles instead of an OTU table as input samples. The only obstacle is that the KO tree does not naturally come with branch lengths. To overcome this, we use pairwise AAI (average amino acid identity) as a proxy of the pairwise distances between leaf nodes and assign the rest of the branch lengths by solving the linear system as demonstrated below.
image

Input

  • edge list: The KEGG hierarchy where all the ancestors of the given brite_id.
  • pairwise distances: (KO1, KO2), edge j is on the shortest path from ' 'KO1 to KO2'
wget -O "edge_ko00001.txt" "https://pennstateoffice365-my.sharepoint.com/:t:/g/personal/akp6031_psu_edu/ETUrru_StwJFray9WJ8q4lAB9OwzSwF2Zhze_OvXv6HP4A?download=1"
# 1.59GB
wget -O "KOs_sketched_scaled_10_compare_5" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/ESd-RVfvWbZKpoky2lE6K8EBNktiaw8elnSbxZGZ9mbzHQ?download=1"
wget -O "KOs_sketched_scaled_10_compare_5.labels.txt" "https://pennstateoffice365-my.sharepoint.com/:t:/g/personal/akp6031_psu_edu/EW39Mh7xqsdKsVAIxM5FWVEBnRMkR4bWs5oLvJlbR1y8fw?download=1"

Run

python ./fununifrac/compute_edges.py -e edge_ko00001.txt -d KOs_sketched_scaled_10_compare_5 -o ../fununifrac_out -b ko00001 -n 50 -f 10 -r 100 --distance

Output

  • fununifrac_edge_lengths_{datetime}.csv: The edge lengths added to the original edge list

Script2(Optional): create_edge_matrix.py

This is a sub-script called inside of compute_edges.py. In order to infer the edge lengths, we need to know which edges are traversed between all pairs of KOs. For example, if we have a binary tree with 2 leaves:

    1
L1 / \ L2
  2   3

and edge lengths L1 and L2 respectively, then the distance between the two leaves is L1 + L2, resulting in a matrix equation of the form:

    [1, 1] [L1;L2] = [d(a,b)] 

where d(a,b) is the distance between the two leaves a and b derived in the previous section. The coefficient matrix of all edges traversed between all pairs of KOs is obtained by running:

The same input is used to generate the intermediate matrix.

Run

python ./fununifrac/create_edge_matrix.py -e edge_ko00001.txt -d KOs_sketched_scaled_10_compare_5 -o ../fununifrac_out/large -b ko00001

The resulting matrix will have a name such as: fununifrac_A.npz and will be a massive (233M rows, ~20K columns) sparse matrix called A below.

Output

  • fununifrac_A.npz: The matrix describing the least sequare problem for computing edge lengths.
  • fununifrac_A_basis.txt: The column information.

We will employ a regularized, randomized nonnegative least squares approach to infer the edge lengths. In practice, we've observed that the matrix A is rank deficient, yet over-determined (more rows than columns). Thus there is no unique solution to Ax=y where y is the vector of pairwise distances between all KOs. We can, however, find a unique "shortest length" solution by adding a regularization term to the objective function:

min ||Ax - y||_2^2 + lambda * ||x||_1^2
s.t. x>=0

where lambda is a regularization parameter. This can be cast as a NNLS problem via:

min ||A'x - y'||_2^2
s.t. x>=0
where
A' = [A; sqrt(lambda)*ones(1, n)]
y' = [y; 0]

We can then solve this problem via a randomized approach, selecting enough rows of A' to ensure that the rank is large enough. After a bunch of iterates, the average is then taken.

The output will be an edge list with an additional column specifying the length of that edge. Eg. kegg_ko_edge_df_br:ko00001.txt

Script3: compute_fununifrac.py

This is the core function of the functional unifrac package. It takes a directory containing functional profiles of the sample, in the format of sourmash gather files, as well as an edge list file (in this case, for instance, the output of Script1 in the preprocessing steps above) describing the underlying the KEGG tree. compute_fununifrac.py computes the pairwise functional UniFrac values among the sourmash gather files provided by the user.

Input

  • A directory containing functional profiles with the option to specify the file pattern to match to. The default file pattern is *_gather.csv.
  • An edge list file describing the underlying KO tree
  • A path specifying where the output directory

Run

python ./fununifrac/compute_fununifrac.py -e {edge} -fd . -o {output} --diffab --force -b ko00001 -a median_abund --L2

Output

  • Pairwise functional UniFrac distances matrix saved in .npy format
  • Basis of the above matrix in .npy format

Example

We provide a minimum example to demonstrate the usage of this function below.

Download KO tree and sample data

wget -O "edge_ko00001_lengths.txt" "https://pennstateoffice365-my.sharepoint.com/:t:/g/personal/akp6031_psu_edu/EVbB-ieWK7xDqux7Y7u_4lEBpMi-jCyA7oDkq3RhtrueXQ?download=1"

wget -O "f1077_ihmp_IBD_MSM5LLGF_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/EeSp1SOl-aJPsJOtcDylQc8B_cSWK_pxKiuPumW5CBJevQ?download=1"
wget -O "f2103_ihmp_IBD_HSM7J4Q3_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/EVlFAOKUEDhCslA5KBk8s2QBOcUdhrlkhDJXsFRap72cKA?download=1"
wget -O "f3158_ihmp_IBD_PSM6XBW1_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/EWnhFoHwindPp1q5JxmYbS4BQ9p2OArBGGxQwa0XCP9cSg?download=1"

Run

python ./fununifrac/compute_fununifrac.py -e ./edge_ko00001_lengths.txt -d . -o fununifrac_results --diffab

Output

  • fununifrac_results/fununifrac_out_20230420-150954.main.basis.npy
  • fununifrac_results/fununifrac_out_20230420-150954.main.npy
  • fununifrac_results/ununifrac_out_20230420-150954.diffab.nodes.npy
  • fununifrac_results/fununifrac_out_20230420-150954.diffab.npz

Preprocessing

0. Obtaining the underlying KO tree/DAG structure

The underlying graph structure on which the earth mover's distance is based on is obtained from the KEGG database.

The KEGG Orthology (KO) hierarchy tree was donwloaded using KEGG API. The KEGG BRITE Database is a collection of hierarchical classification system in which ko00001 is the KO hierarchy. Therefore, we utilized the get operation of KEGG API to access the KO hierarchy. For each KO leaf node in this hierarchy tree, we get its corresponding lineage. For instance, the lineage of KO K03382 is "09100 Metabolism; 09111 Xenobiotics biodegradation and metabolism; 00791 Atrazine degradation; K03382 atzB; hydroxydechloroatrazine ethylaminohydrolase". With this lineage information, we save the KO hierarchical tree into a file kegg_ko00001_edges.txt with an edge list format.

Currently, to keep things simple, we are just using subtrees of the KEGG database. As such, you need to pick a parent BRITE node to start with. Later, we will extract the tree of all the children of this parent node. For example, you can choose the BRITE: ko00001 which will correspond to "KEGG Orthology (KO)". In the following, call the resulting tree the "KO tree". In the end, you should have a file like kegg_ko_edge_df_br:ko00001.txt.

For a more detailed description of the process, see extraction repo. In particular: get_ko_hierarchy.py

wget -O "kegg_ko00001_edges.txt" "https://pennstateoffice365-my.sharepoint.com/:t:/g/personal/akp6031_psu_edu/EWWFfRZc4o5OltW7YZFH8yYBCQB8HW8IW_zLveO7eAeMPQ?download=1"

Script

python ./fununifrac/reproducibility/generate_ko_hierarchy.py --outdir ${OUTDIR}

1. KO sketches

From the KEGG database, we downloaded amino acid sequences for all the genes in each KO and used Sourmash to build FracMinHash sketch for each KO. These sketches can be used as a reference to efficiently generate the functional profiles for metagenomic samples, which will be explained in the Section 3. Sample sketches below. To use the pre-generated KO sketches directly, simply download using the command below. Alternatively, the user can download the raw data and perform sourmash sketch using the parameters of their own choice. For usage of sourmash, refer to the Appendix section.

Generate KO sketches from raw KO

Below is command by which we generated the KO sketches above, where $INPUT is the raw KO data downloaded from KEGG.

sourmash sketch fromfile -p protein,k=5,k=7,k=11,abund,scaled=${scaleFactor} -o ${OUTDIR}/KOs_sketched_scaled_${scaleFactor}.sig.zip $INPUT

Note: this process will take forever. It is advised to just download the precomputed ones.

Download sketched KO

wget -O "KOs_sketched_scaled_10.sig.zip" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/ESayLPk49QhIhmPExvgNyrIBHhSyk4hzkdg5dPNHMGL5Mg?download=1"

2. Creating pairwise KO distances

With the KOs sketched, we can also compute the pairwise distances among the KOs, which correspond to the leaf nodes of the underlying tree obtained from KEGG. The pairwise distances among the leaf nodes of the KO tree is computed based on AAI (average amino acid identity) using Sourmash. Here we use k=5.

sourmash compare --protein --no-dna -o ${DATADIR}/compare_k_${kSize} --ani -k ${kSize} -p 50 ${DATADIR}/KOs_sketched_scaled_10.sig.zip

The output of the above is a pairwise distance matrix for all the KOs, along with a text file telling you which KO corresponds to which row/column in the matrix. Eg. of such files are: KOs_sketched_scaled_10_compare_5 and KOs_sketched_scaled_10_compare_5.labels.txt, which can be downloaded below.

Download

wget -O "KOs_sketched_scaled_10_compare_5" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/ESsA10lqAXtOrv0Z_mVVedABTBdgHLAXzeE3N8bZ6q9f7Q?download=1"
wget -O "KOs_sketched_scaled_10_compare_5.labels.txt" "https://pennstateoffice365-my.sharepoint.com/:t:/g/personal/akp6031_psu_edu/Ea4iI9gCtl5Nue0bVABSqGQBIR1cv79l3fUk2bHSv6pxMA?download=1"

3. Sample sketches

This procedure uses Sourmash sketch to convert the raw metagenomic samples in DNA sequences into protein sequences and creates a sketch for each sample. We provide some sample files to illustrate this process.

Download sample input data

wget -O "f1077_ihmp_IBD_MSM5LLGF_P.fastq" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/EeFqDGl6lZZIgGFPme4GkBMBK20P_y0qrjqDWDuWnZ6ImA?download=1"
wget -O "f2103_ihmp_IBD_HSM7J4Q3_P.fastq" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/EaRRHDpkIK5Nid9pFf9BRAsB8Y-KT8lA4Zp1cdYv1eGJ1Q?download=1"
wget -O "f3158_ihmp_IBD_PSM6XBW1_P.fastq" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/ER7xvoZtBSRKlw7YoYJrfZ8BiLkxdqa-43suUolSUS86Ng?download=1"

Sketch samples to produce functional profiles

sourmash sketch translate -f -p scaled=1000,k=5,abund f1077_ihmp_IBD_MSM5LLGF_P.fastq.gz -o f1077_ihmp_IBD_MSM5LLGF_P.fastq.gz.sig
sourmash sketch translate -f -p scaled=1000,k=5,abund f2103_ihmp_IBD_HSM7J4Q3_P.fastq.gz -o f2103_ihmp_IBD_HSM7J4Q3_P.fastq.gz.sig
sourmash sketch translate -f -p scaled=1000,k=5,abund f3158_ihmp_IBD_PSM6XBW1_P.fastq.gz -o f3158_ihmp_IBD_PSM6XBW1_P.fastq.gz.sig

You can also directly download the output using the following command.

# outputs
wget -O "f1077_ihmp_IBD_MSM5LLGF_P.fastq.gz.sig" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/Ebzry6ZzMrpAocLia1gb8zgBzSThqyvc7YZN_lcJ8Fkofg?download=1"
wget -O "f2103_ihmp_IBD_HSM7J4Q3_P.fastq.sig" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/EdSy4EMYhJVOmorMSHyOBCMBu5HshZMur5BKG49D7DgSAA?download=1"
wget -O "f3158_ihmp_IBD_PSM6XBW1_P.fastq.sig" "https://pennstateoffice365-my.sharepoint.com/:u:/g/personal/akp6031_psu_edu/EQsLhGlmu09KuY6QDm6ymLMBi9PgIRADjL0ZLQCHSFEmwg?download=1"

4. sequence and KOs comparisons

Using the output of the previous process, we then produce the functional profile for each sample which will then be used as the input for functional UniFrac computation. The input data for this process include a sample sketch, which is the output of step 3, as well as the sketched KO, the output of step 1.

sourmash gather --protein -k 5 --estimate-ani-ci --threshold-bp 500 f3158_ihmp_IBD_PSM6XBW1_P.fastq.gz.sig KOs_sketched_scaled_10.sig.zip  -o f3158_ihmp_IBD_PSM6XBW1_P_k_5_gather.csv
sourmash gather --protein -k 5 --estimate-ani-ci --threshold-bp 500 f2103_ihmp_IBD_HSM7J4Q3_P.fastq.sig KOs_sketched_scaled_10.sig.zip  -o f2103_ihmp_IBD_HSM7J4Q3_P_k_5_gather.csv
sourmash gather --protein -k 5 --estimate-ani-ci --threshold-bp 500 f1077_ihmp_IBD_MSM5LLGF_P.fastq.gz.sig KOs_sketched_scaled_10.sig.zip  -o f1077_ihmp_IBD_MSM5LLGF_P_k_5_gather.csv

The output of the above will be three sourmash gather files which you can also download directly as follows.

Download

wget -O "f1077_ihmp_IBD_MSM5LLGF_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/EYe7vV9JDBhIgY0v0otSz8IBzJ9sp-LXFZQkM2qV7n5d0Q?download=1"
wget -O "f2103_ihmp_IBD_HSM7J4Q3_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/Ee-QM-tEoQhPsM-2sKOv3y4BdpHZWIJW_PcItrx7JPTQyQ?download=1"
wget -O "f3158_ihmp_IBD_PSM6XBW1_P_k_5_gather.csv" "https://pennstateoffice365-my.sharepoint.com/:x:/g/personal/akp6031_psu_edu/EVN4JF7jkVhGmdHoCKoSTY4Byz3ybRFAsVtBDSv2VM2wEQ?download=1"

Appendix

sourmash

Installation directions and usage of sourmash can be found at https://sourmash.readthedocs.io/en/latest/.