Merge pull request #5 from wanyuac/dev

Version 0.1.5
wanyuac · Jan 1, 2021 · d7b25a4 · d7b25a4
2 parents 2b9e3d5 + 799c1da
commit d7b25a4
Show file tree

Hide file tree

Showing 4 changed files with 263 additions and 118 deletions.
diff --git a/README.md b/README.md
@@ -14,7 +14,7 @@
     - [Gene detection from short reads](#run_ariba)
     - [Pooling allele sequences from all samples](#pool_seqs)
     - [Clustering pooled allele sequences](#seq_clustering)
-    - [Creating an allelic PAM from tabulated CD-HIT-EST output](#makePAM)
+    - [Creating an allelic PAM from the tabulated CD-HIT-EST output](#makePAM)
 
 <br/>
 
@@ -262,62 +262,76 @@ Users may also refer to `ariba_summary.csv` for compiled genetic profiles of all
 
 ### 3.3. Pooling allele sequences from all samples<a name = "pool_seqs" />
 
-This step uses script `pool_seqs.py`.
+This step uses script `compile_results.py` to compile ARIBA's outputs for individual samples. Specifically, the script performs two tasks:
 
-```bash
-python PAMmaker/ariba/pool_seqs.py -i ./gene/*_genes.fna -o alleles.fna -e '_genes.fna'
-
-# Parameters of the script
-python pool_seqs.py -h
-usage: pool_seqs.py [-h] -i I [I ...] [-o O] [-e E]
+- Create a table of identified alleles for all samples with columns: sample, cluster, allele, exact match or variant of the reference allele, percent of nucleotide identity, coverage of the assembled allele to its reference.
+- Pool FASTA files of ARIBA's output allele sequences into one file and append sample names to sequence IDs (in the same format as that for SRST2's output consensus sequences).
 
-Pool ARIBA's output allele sequences into one FASTA file and append sample names to sequence IDs
+**Example command**
 
-optional arguments:
-  -h, --help    show this help message and exit
-  -i I [I ...]  Input FASTA files
-  -o O          Output FASTA file of pooled allele sequences
-  -e E          Filename extension to be removed for sample names
+```bash
+python PAMmaker/ariba/compile_results.py --in_fastas gene/*_genes.fna --in_reports report/*_report.tsv --out_fasta alleles.fna --out_table report.tsv --ext_fastas '_genes.fna' --ext_reports '_report.tsv' 2> compile_results.err
 ```
 
 Output file: `alleles.fna`, a multi-FASTA file of pooled allele sequences. In this file, a sample name is appended to each sequence ID.
 
+Note that it is normal to see warnings (in file `compile_results.err`) about absence of alleles from the report file in the FASTA file, because alleles in the report file may not have adequate nucleotide identities or reference coverages following cut-offs set for ARIBA runs (By default, 90% for both cut-offs).
+
+**Parameters**
+
+```bash
+python compile_results.py --help
+
+  -h, --help            show this help message and exit
+  --in_fastas IN_FASTAS [IN_FASTAS ...]     Input FASTA files
+  --in_reports IN_REPORTS [IN_REPORTS ...]  Input report files
+  --out_fasta OUT_FASTA      Output FASTA file of pooled allele sequences
+  --out_table OUT_TABLE      Output table about identified alleles
+  --ext_fastas EXT_FASTAS    Filename extension of input FASTA files to be removed for sample names
+  --ext_reports EXT_REPORTS  Filename extension of input report files to be removed for sample names
+```
+
 
 
 ### 3.4. Clustering pooled allele sequences<a name = "seq_clustering" />
 
-Despite the demonstration below, it is not necessary to cluster alleles based on complete sequence identity. Users may want to tolerate a few mismatches for their study. This is a feature differing from the SRST2-based pipeline described in Section [2](#guide_srst2).
+Despite the demonstration that follows, it is not necessary to cluster alleles based on complete sequence identity. Users may want to tolerate a few mismatches for their study. This is a feature differing from the SRST2-based pipeline described in Section [2](#guide_srst2).
 
 ```bash
 # Clustering based on complete sequence identity
-cd-hit-est -i alleles.fna -o alleles_rep.fna -c 1.0 -d 0 -s 0.0 -aL 1.0 -aS 1.0 -g 1
+cd-hit-est -i alleles.fna -o alleles_rep.fna -d 0 -c 1.0 -s 1.0
 
 # Tabulate clustering results from cd-hit-est
 python PAMmaker/utility/tabulate_cdhit.py alleles_rep.fna.clstr > alleles_rep.fna.clstr.tsv
 ```
 
 
 
-### 3.5. Creating an allelic PAM from tabulated CD-HIT-EST output<a name = "makePAM" />
+### 3.5. Creating an allelic PAM from the tabulated CD-HIT-EST output<a name = "makePAM" />
 
-```bash
-python PAMmaker/ariba/clusters2pam.py -i alleles_rep.fna.clstr.tsv -om allelic_PAM.tsv -ot alleles_rep.fna.clstr_updated.tsv
+This final step is carried out by script `clusters2pam.py`. Each allele in the output table is named after its best match in the reference database. An extended allele identifier `.var*` is added to the allele label when there is no perfect match to any of reference alleles.
 
-# Script parameters
-python clusters2pam.py -h
-usage: clusters2pam.py [-h] -i I [-om OM] [-ot OT]
 
-Creating an allelic presence-absence matrix from a table of sequence clusters
 
-optional arguments:
-  -h, --help  show this help message and exit
-  -i I        Input FASTA files
-  -om OM      Output presence-absence matrix in TSV format
-  -ot OT      Output table about sequence clusters
+**Example command**
+
+```bash
+python PAMmaker/ariba/clusters2pam.py -i alleles_rep.fna.clstr.tsv -om allelic_PAM.tsv -ot alleles_rep.fna.clstr_updated.tsv
 ```
 
 Outputs of this example:
 
 - `allelic_PAM.tsv`: The objective allelic PAM;
 - `alleles_rep.fna.clstr_updated.tsv`: A table of clustering information, including cluster IDs (`gene`) and sample names.
 
+**Parameters**
+
+```bash
+python clusters2pam.py --help
+
+  -h, --help  show this help message and exit
+  -i I        Input tab-delimited table of clusters from CD-HIT-EST
+  -om OM      Output presence-absence matrix in TSV format
+  -ot OT      Output table about sequence clusters and allele labels
+```
+
diff --git a/ariba/clusters2pam.py b/ariba/clusters2pam.py
@@ -2,7 +2,9 @@
 
 """
 Converts output TSV of utility/tabulate_cdhit.py to an allelic presence-absence matrix (PAM) and saves
-it in TSV format.
+it in TSV format. This script assumes consensus sequences of ARIBA hits are clustered under a minimum
+nucleotide identity that is sufficently high (e.g., 100%) so reference sequences are the same within
+the same cluster.
 
 Command demonstration:
     python clusters2pam.py -i alleles.clstr.tsv -om alleles_pam.tsv -ot alleles_clstr_updated.tsv
@@ -11,7 +13,7 @@
 
 Copyright (C) 2020 Yu Wan <wanyuac@126.com>
 Licensed under the GNU General Public Licence version 3 (GPLv3) <https://www.gnu.org/licenses/>.
-Publication: 11 Nov 2020; the latest modification: 12 Nov 2020
+Publication: 11 Nov 2020; the latest modification: 30 Dec 2020
 """
 
 import os
@@ -22,15 +24,17 @@
 def parse_arguments():
     parser = ArgumentParser(description = "Creating an allelic presence-absence matrix from a table of sequence clusters",\
              epilog = "This is a helper script of R package GeneMates.")
-    parser.add_argument("-i", dest = "i", type = str, required = True, help = "Input FASTA files")
+    parser.add_argument("-i", dest = "i", type = str, required = True, help = "Input tab-delimited table of clusters from CD-HIT-EST")
     parser.add_argument("-om", dest = "om", type = str, required = False, default = "allelic_PAM.tsv", help = "Output presence-absence matrix in TSV format")
-    parser.add_argument("-ot", dest = "ot", type = str, required = False, default = "clusters.tsv", help = "Output table about sequence clusters")
+    parser.add_argument("-ot", dest = "ot", type = str, required = False, default = "clusters.tsv", help = "Output table about sequence clusters and allele labels")
+
     return parser.parse_args()
 
+
 def main():
     args = parse_arguments()
 
-    # Import the input TSV file as a data frame
+    # Import the input TSV file and make a nine-column data frame
     input_tsv = args.i
     if os.path.exists(input_tsv):
         clusters = parse_seqid(pandas.read_csv(input_tsv, sep = "\t", index_col = None))  # Import the table as a data frame of six columns and parse column 'seqid'
@@ -39,66 +43,114 @@ def main():
         sys.exit(1)
 
     # Assign allele IDs and create an allelic PAM from data frame "clusters"
-    pam = create_allelic_pam(clusters)
+    pam, tab = create_allelic_pam(clusters)
     pam.to_csv(args.om, header = True, index = False, sep = "\t", float_format = None)
-    clusters.to_csv(args.ot, header = True, index = False, sep = "\t", float_format = None)
+    tab.to_csv(args.ot, header = True, index = False, sep = "\t", float_format = None)
+
     return
 
+
 def parse_seqid(df):
     """ 
     Parses column 'seqid' into two new columns and returns a data frame of eight columns.
     """
     seqids = df["seqid"].tolist()  # Convert a column into a list
-    genes = list()
-    samples = list()
-
-    # Create two lists from one column
+    ref_alleles = list()  # Names of reference alleles
+    var_code = list()  # A binary list
+    samples = list()  # Sample names
+
+    # Create three lists from one column
     for item in seqids:
-        g, s = item.split("|")
-        genes.append(g)
+        a, v, s = item.split("|")  # Reference allele, variant code, sample name
+        ref_alleles.append(a)
+        var_code.append(int(v))  # Otherwise, each var_code is a character.
         samples.append(s)
-
-    # Append lists to the data frame as columns    
-    df["gene"] = genes
+
+    # Append these new lists to the data frame as additional columns
     df["sample"] = samples
+    df["ref_allele"] = ref_alleles
+    df["variant"] = var_code
+
     return df
 
+
 def create_allelic_pam(df):
     """
     Assign allele IDs and create an allelic PAM based on clustering results.
+    Algorithm for assigning an allele ID for each cluster:
+        1. Use the name of the reference allele having exactly matched hits; (So the algorithm deals with scenarios
+           where multiple reference alleles are present in the same cluster)
+        2. Else, add an extended index (1, 2, 3, ...) to the allele name.
     """
-    samples = get_unique_ids(df, "sample")
-    cluster_ids = get_unique_ids(df, "cluster")  # 0, 1, 2, ...
-    genes_visited = dict()  # A dictionary of genes in processed clusters. {gene : number of clusters}
+    cluster_indices = get_unique_elements(df, "cluster")  # 0, 1, 2, ...
+    samples = get_unique_elements(df, "sample")  # All samples from ARIBA results
+    variants = dict()  # A dictionary of variants encountered. {referance allele : count of unique variants}
     pam = pandas.DataFrame(samples, columns = ["sample"])  # Initalise the output PAM
-
-    # Create a list about presence-absence of each allele and combine it to the output PAM as a column
-    for c in cluster_ids:  # Type of elements: numpy.int64
-        df_c = df[df["cluster"] == c]  # Select rows of the current cluster
-        genes_c = df_c["gene"].tolist()  # All gene names should be the same when alleles are clustered under 100% nucleotide identity
-        gene = genes_c[0]
-        if gene in genes_visited.keys():
-            genes_visited[gene] += 1
-            allele = gene + "." + str(genes_visited[gene])  # Adding a suffix for making an allele name. Example result: sul1.1, sul1.2.
-        else:
-            genes_visited[gene] = 0  # Record a new gene encountered
-            allele = gene  # The first allele of the gene will not have an extended index appended.
-        pa_vec = list()  # A binary vector about presence-absence of the current allele across samples
-        samples_c = df_c["sample"].tolist()  # Samples in which the current allele is detected
+    tab = pandas.DataFrame([], columns = list(df.columns))  # Create an empty data frame of specific columns
+
+    # Populate data frames 'tab' and 'pam'
+    for c in cluster_indices:  # Type of elements: numpy.int64; Iterate through every cluster index.
+        df_c = df[df["cluster"] == c]  # Extract rows of the current cluster
+        df_c.reset_index(drop = True)
+
+        # Determine the reference allele for the current cluster
+        ref_allele, var_code = determine_ref_allele(df_c[["ref_allele", "variant"]], c)  # Reference-allele name and variant code for the current cluster
+        if var_code == 0:  # Some or all hits are exact matches to the chosen reference allele
+            a = ref_allele
+        else:  # Approximate match: append an extended allele identifier to the reference-allele name
+            if ref_allele in variants.keys():
+                variants[ref_allele] += 1
+                a = ref_allele + ".var" + str(variants[ref_allele])  # Adding a suffix for making an allele name. Example result: sul1.1, sul1.2.
+            else:
+                variants[ref_allele] = 1  # First variant of the reference allele is encountered
+                a = ref_allele + ".var1"
+        df_c = df_c.assign(allele = [a for i in range(0, df_c.shape[0])])  # To append a column to the data frame. The shape method returns row and column counts. Do not use pandas.Series, which does not work correctly here.
+        tab = pandas.concat([tab, df_c], ignore_index = True, sort = False)  # Ignoring indexes on the concatenation axis: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
+
+        # Add a column to the PAM
+        pa_vec = list()  # Create a binary list about presence-absence of the current allele
+        samples_c = get_unique_elements(df_c, "sample")  # Samples in the current sequence cluster
         for s in samples:
             pa = 1 if s in samples_c else 0
             pa_vec.append(pa)
-        pam[allele] = pa_vec
-    return pam
+        pam[a] = pa_vec
+
+    return pam, tab
+
 
-def get_unique_ids(df, col_name):
+def get_unique_elements(df, col_name):
     """
     A subordinate function of create_allelic_pam for getting a list of unique and sorted values from a
     given column of input data frame df.
     """
-    ids = list(df[col_name].unique())
-    ids.sort(reverse = False)
-    return ids
+    vals = list(df[col_name].unique())
+    vals.sort(reverse = False)
+
+    return vals
+
+
+def determine_ref_allele(df, cluster_index):
+    """
+    This is a subordinate function of create_allelic_pam and it returns the first element in a list of
+    allele names extracted from a column of data frame df_c["ref_allele"]. All names of reference alleles
+    should be the same when hits are clustered under 100% nucleotide identity and a warning is raised when
+    this is not the case.
+    """
+    # Check whether the current cluster comprises hits to more than one reference alleles
+    if len(set(df["ref_allele"].tolist())) > 1:  # Normally, this is a rare scenario.
+        print("Warning: reference alleles in cluster %d differ." % cluster_index)
+
+    # Choose the first reference allele having an exact hit
+    exact_matches = df.loc[df["variant"] == 0]  # 0 (exact match) or 1 (variant)
+    if len(exact_matches) > 0:  # There is at least one exact hit
+        allele = exact_matches["ref_allele"].values[0]  # Get the value of the first cell in this column
+        var_code = 0
+    else:  # No exact match is present
+        allele = df["ref_allele"].values[0]
+        var_code = 1
+
+    return allele, var_code
+
 
 if __name__ == "__main__":
     main()