Skip to content
This repository has been archived by the owner on Apr 13, 2024. It is now read-only.

Commit

Permalink
Merge pull request #5 from wanyuac/dev
Browse files Browse the repository at this point in the history
Version 0.1.5
  • Loading branch information
wanyuac authored Jan 1, 2021
2 parents 2b9e3d5 + 799c1da commit d7b25a4
Show file tree
Hide file tree
Showing 4 changed files with 263 additions and 118 deletions.
70 changes: 42 additions & 28 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
- [Gene detection from short reads](#run_ariba)
- [Pooling allele sequences from all samples](#pool_seqs)
- [Clustering pooled allele sequences](#seq_clustering)
- [Creating an allelic PAM from tabulated CD-HIT-EST output](#makePAM)
- [Creating an allelic PAM from the tabulated CD-HIT-EST output](#makePAM)

<br/>

Expand Down Expand Up @@ -262,62 +262,76 @@ Users may also refer to `ariba_summary.csv` for compiled genetic profiles of all

### 3.3. Pooling allele sequences from all samples<a name = "pool_seqs" />

This step uses script `pool_seqs.py`.
This step uses script `compile_results.py` to compile ARIBA's outputs for individual samples. Specifically, the script performs two tasks:

```bash
python PAMmaker/ariba/pool_seqs.py -i ./gene/*_genes.fna -o alleles.fna -e '_genes.fna'

# Parameters of the script
python pool_seqs.py -h
usage: pool_seqs.py [-h] -i I [I ...] [-o O] [-e E]
- Create a table of identified alleles for all samples with columns: sample, cluster, allele, exact match or variant of the reference allele, percent of nucleotide identity, coverage of the assembled allele to its reference.
- Pool FASTA files of ARIBA's output allele sequences into one file and append sample names to sequence IDs (in the same format as that for SRST2's output consensus sequences).

Pool ARIBA's output allele sequences into one FASTA file and append sample names to sequence IDs
**Example command**

optional arguments:
-h, --help show this help message and exit
-i I [I ...] Input FASTA files
-o O Output FASTA file of pooled allele sequences
-e E Filename extension to be removed for sample names
```bash
python PAMmaker/ariba/compile_results.py --in_fastas gene/*_genes.fna --in_reports report/*_report.tsv --out_fasta alleles.fna --out_table report.tsv --ext_fastas '_genes.fna' --ext_reports '_report.tsv' 2> compile_results.err
```

Output file: `alleles.fna`, a multi-FASTA file of pooled allele sequences. In this file, a sample name is appended to each sequence ID.

Note that it is normal to see warnings (in file `compile_results.err`) about absence of alleles from the report file in the FASTA file, because alleles in the report file may not have adequate nucleotide identities or reference coverages following cut-offs set for ARIBA runs (By default, 90% for both cut-offs).

**Parameters**

```bash
python compile_results.py --help

-h, --help show this help message and exit
--in_fastas IN_FASTAS [IN_FASTAS ...] Input FASTA files
--in_reports IN_REPORTS [IN_REPORTS ...] Input report files
--out_fasta OUT_FASTA Output FASTA file of pooled allele sequences
--out_table OUT_TABLE Output table about identified alleles
--ext_fastas EXT_FASTAS Filename extension of input FASTA files to be removed for sample names
--ext_reports EXT_REPORTS Filename extension of input report files to be removed for sample names
```



### 3.4. Clustering pooled allele sequences<a name = "seq_clustering" />

Despite the demonstration below, it is not necessary to cluster alleles based on complete sequence identity. Users may want to tolerate a few mismatches for their study. This is a feature differing from the SRST2-based pipeline described in Section [2](#guide_srst2).
Despite the demonstration that follows, it is not necessary to cluster alleles based on complete sequence identity. Users may want to tolerate a few mismatches for their study. This is a feature differing from the SRST2-based pipeline described in Section [2](#guide_srst2).

```bash
# Clustering based on complete sequence identity
cd-hit-est -i alleles.fna -o alleles_rep.fna -c 1.0 -d 0 -s 0.0 -aL 1.0 -aS 1.0 -g 1
cd-hit-est -i alleles.fna -o alleles_rep.fna -d 0 -c 1.0 -s 1.0

# Tabulate clustering results from cd-hit-est
python PAMmaker/utility/tabulate_cdhit.py alleles_rep.fna.clstr > alleles_rep.fna.clstr.tsv
```



### 3.5. Creating an allelic PAM from tabulated CD-HIT-EST output<a name = "makePAM" />
### 3.5. Creating an allelic PAM from the tabulated CD-HIT-EST output<a name = "makePAM" />

```bash
python PAMmaker/ariba/clusters2pam.py -i alleles_rep.fna.clstr.tsv -om allelic_PAM.tsv -ot alleles_rep.fna.clstr_updated.tsv
This final step is carried out by script `clusters2pam.py`. Each allele in the output table is named after its best match in the reference database. An extended allele identifier `.var*` is added to the allele label when there is no perfect match to any of reference alleles.

# Script parameters
python clusters2pam.py -h
usage: clusters2pam.py [-h] -i I [-om OM] [-ot OT]

Creating an allelic presence-absence matrix from a table of sequence clusters

optional arguments:
-h, --help show this help message and exit
-i I Input FASTA files
-om OM Output presence-absence matrix in TSV format
-ot OT Output table about sequence clusters
**Example command**

```bash
python PAMmaker/ariba/clusters2pam.py -i alleles_rep.fna.clstr.tsv -om allelic_PAM.tsv -ot alleles_rep.fna.clstr_updated.tsv
```

Outputs of this example:

- `allelic_PAM.tsv`: The objective allelic PAM;
- `alleles_rep.fna.clstr_updated.tsv`: A table of clustering information, including cluster IDs (`gene`) and sample names.

**Parameters**

```bash
python clusters2pam.py --help

-h, --help show this help message and exit
-i I Input tab-delimited table of clusters from CD-HIT-EST
-om OM Output presence-absence matrix in TSV format
-ot OT Output table about sequence clusters and allele labels
```

130 changes: 91 additions & 39 deletions ariba/clusters2pam.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

"""
Converts output TSV of utility/tabulate_cdhit.py to an allelic presence-absence matrix (PAM) and saves
it in TSV format.
it in TSV format. This script assumes consensus sequences of ARIBA hits are clustered under a minimum
nucleotide identity that is sufficently high (e.g., 100%) so reference sequences are the same within
the same cluster.
Command demonstration:
python clusters2pam.py -i alleles.clstr.tsv -om alleles_pam.tsv -ot alleles_clstr_updated.tsv
Expand All @@ -11,7 +13,7 @@
Copyright (C) 2020 Yu Wan <wanyuac@126.com>
Licensed under the GNU General Public Licence version 3 (GPLv3) <https://www.gnu.org/licenses/>.
Publication: 11 Nov 2020; the latest modification: 12 Nov 2020
Publication: 11 Nov 2020; the latest modification: 30 Dec 2020
"""

import os
Expand All @@ -22,15 +24,17 @@
def parse_arguments():
parser = ArgumentParser(description = "Creating an allelic presence-absence matrix from a table of sequence clusters",\
epilog = "This is a helper script of R package GeneMates.")
parser.add_argument("-i", dest = "i", type = str, required = True, help = "Input FASTA files")
parser.add_argument("-i", dest = "i", type = str, required = True, help = "Input tab-delimited table of clusters from CD-HIT-EST")
parser.add_argument("-om", dest = "om", type = str, required = False, default = "allelic_PAM.tsv", help = "Output presence-absence matrix in TSV format")
parser.add_argument("-ot", dest = "ot", type = str, required = False, default = "clusters.tsv", help = "Output table about sequence clusters")
parser.add_argument("-ot", dest = "ot", type = str, required = False, default = "clusters.tsv", help = "Output table about sequence clusters and allele labels")

return parser.parse_args()


def main():
args = parse_arguments()

# Import the input TSV file as a data frame
# Import the input TSV file and make a nine-column data frame
input_tsv = args.i
if os.path.exists(input_tsv):
clusters = parse_seqid(pandas.read_csv(input_tsv, sep = "\t", index_col = None)) # Import the table as a data frame of six columns and parse column 'seqid'
Expand All @@ -39,66 +43,114 @@ def main():
sys.exit(1)

# Assign allele IDs and create an allelic PAM from data frame "clusters"
pam = create_allelic_pam(clusters)
pam, tab = create_allelic_pam(clusters)
pam.to_csv(args.om, header = True, index = False, sep = "\t", float_format = None)
clusters.to_csv(args.ot, header = True, index = False, sep = "\t", float_format = None)
tab.to_csv(args.ot, header = True, index = False, sep = "\t", float_format = None)

return


def parse_seqid(df):
"""
Parses column 'seqid' into two new columns and returns a data frame of eight columns.
"""
seqids = df["seqid"].tolist() # Convert a column into a list
genes = list()
samples = list()

# Create two lists from one column
ref_alleles = list() # Names of reference alleles
var_code = list() # A binary list
samples = list() # Sample names

# Create three lists from one column
for item in seqids:
g, s = item.split("|")
genes.append(g)
a, v, s = item.split("|") # Reference allele, variant code, sample name
ref_alleles.append(a)
var_code.append(int(v)) # Otherwise, each var_code is a character.
samples.append(s)

# Append lists to the data frame as columns
df["gene"] = genes

# Append these new lists to the data frame as additional columns
df["sample"] = samples
df["ref_allele"] = ref_alleles
df["variant"] = var_code

return df


def create_allelic_pam(df):
"""
Assign allele IDs and create an allelic PAM based on clustering results.
Algorithm for assigning an allele ID for each cluster:
1. Use the name of the reference allele having exactly matched hits; (So the algorithm deals with scenarios
where multiple reference alleles are present in the same cluster)
2. Else, add an extended index (1, 2, 3, ...) to the allele name.
"""
samples = get_unique_ids(df, "sample")
cluster_ids = get_unique_ids(df, "cluster") # 0, 1, 2, ...
genes_visited = dict() # A dictionary of genes in processed clusters. {gene : number of clusters}
cluster_indices = get_unique_elements(df, "cluster") # 0, 1, 2, ...
samples = get_unique_elements(df, "sample") # All samples from ARIBA results
variants = dict() # A dictionary of variants encountered. {referance allele : count of unique variants}
pam = pandas.DataFrame(samples, columns = ["sample"]) # Initalise the output PAM

# Create a list about presence-absence of each allele and combine it to the output PAM as a column
for c in cluster_ids: # Type of elements: numpy.int64
df_c = df[df["cluster"] == c] # Select rows of the current cluster
genes_c = df_c["gene"].tolist() # All gene names should be the same when alleles are clustered under 100% nucleotide identity
gene = genes_c[0]
if gene in genes_visited.keys():
genes_visited[gene] += 1
allele = gene + "." + str(genes_visited[gene]) # Adding a suffix for making an allele name. Example result: sul1.1, sul1.2.
else:
genes_visited[gene] = 0 # Record a new gene encountered
allele = gene # The first allele of the gene will not have an extended index appended.
pa_vec = list() # A binary vector about presence-absence of the current allele across samples
samples_c = df_c["sample"].tolist() # Samples in which the current allele is detected
tab = pandas.DataFrame([], columns = list(df.columns)) # Create an empty data frame of specific columns

# Populate data frames 'tab' and 'pam'
for c in cluster_indices: # Type of elements: numpy.int64; Iterate through every cluster index.
df_c = df[df["cluster"] == c] # Extract rows of the current cluster
df_c.reset_index(drop = True)

# Determine the reference allele for the current cluster
ref_allele, var_code = determine_ref_allele(df_c[["ref_allele", "variant"]], c) # Reference-allele name and variant code for the current cluster
if var_code == 0: # Some or all hits are exact matches to the chosen reference allele
a = ref_allele
else: # Approximate match: append an extended allele identifier to the reference-allele name
if ref_allele in variants.keys():
variants[ref_allele] += 1
a = ref_allele + ".var" + str(variants[ref_allele]) # Adding a suffix for making an allele name. Example result: sul1.1, sul1.2.
else:
variants[ref_allele] = 1 # First variant of the reference allele is encountered
a = ref_allele + ".var1"
df_c = df_c.assign(allele = [a for i in range(0, df_c.shape[0])]) # To append a column to the data frame. The shape method returns row and column counts. Do not use pandas.Series, which does not work correctly here.
tab = pandas.concat([tab, df_c], ignore_index = True, sort = False) # Ignoring indexes on the concatenation axis: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

# Add a column to the PAM
pa_vec = list() # Create a binary list about presence-absence of the current allele
samples_c = get_unique_elements(df_c, "sample") # Samples in the current sequence cluster
for s in samples:
pa = 1 if s in samples_c else 0
pa_vec.append(pa)
pam[allele] = pa_vec
return pam
pam[a] = pa_vec

return pam, tab


def get_unique_ids(df, col_name):
def get_unique_elements(df, col_name):
"""
A subordinate function of create_allelic_pam for getting a list of unique and sorted values from a
given column of input data frame df.
"""
ids = list(df[col_name].unique())
ids.sort(reverse = False)
return ids
vals = list(df[col_name].unique())
vals.sort(reverse = False)

return vals


def determine_ref_allele(df, cluster_index):
"""
This is a subordinate function of create_allelic_pam and it returns the first element in a list of
allele names extracted from a column of data frame df_c["ref_allele"]. All names of reference alleles
should be the same when hits are clustered under 100% nucleotide identity and a warning is raised when
this is not the case.
"""
# Check whether the current cluster comprises hits to more than one reference alleles
if len(set(df["ref_allele"].tolist())) > 1: # Normally, this is a rare scenario.
print("Warning: reference alleles in cluster %d differ." % cluster_index)

# Choose the first reference allele having an exact hit
exact_matches = df.loc[df["variant"] == 0] # 0 (exact match) or 1 (variant)
if len(exact_matches) > 0: # There is at least one exact hit
allele = exact_matches["ref_allele"].values[0] # Get the value of the first cell in this column
var_code = 0
else: # No exact match is present
allele = df["ref_allele"].values[0]
var_code = 1

return allele, var_code


if __name__ == "__main__":
main()
Loading

0 comments on commit d7b25a4

Please sign in to comment.