Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: add rank lineage #130

Merged
merged 8 commits into from
Aug 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added

- Added classifier support for KMCP profiles (#129).
- Added a command-line option `--add-rank-lineage` to the `standardise` and
`merge` commands, which inserts a new column `rank_lineage` to results that
contains semi-colon-separated strings with the ranks (#130).

## [0.4.1] - (2023-07-13)

Expand Down
13 changes: 13 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,19 @@ docs/tutorials/2612_se-ERR5766180-db_mOTU.out:
docs/tutorials/2612_pe-ERR5766176-db1.kraken2.report.txt:
cd docs/tutorials && curl -O https://raw.githubusercontent.com/taxprofiler/taxpasta/dev/tests/data/kraken2/2612_pe-ERR5766176-db1.kraken2.report.txt

taxonomy_directory := tests/data/taxonomy
## Generate test files
test-setup: $(taxonomy_directory)

# Running this command assumes that the taxonkit
# (https://bioinf.shenwei.me/taxonkit/) has been installed beforehand and the
# NCBI taxonomy was downloaded.
$(taxonomy_directory):
taxonkit list --ids '160488,511145,889517' --indent "" \
| taxonkit reformat --taxid-field 1 --output-ambiguous-result --format "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" \
| cut --fields=2-8 \
| taxonkit create-taxdump --out-dir "$(taxonomy_directory)" --force --rank-names "superkingdom,phylum,class,order,family,genus,species"

################################################################################
# Self Documenting Commands #
################################################################################
Expand Down
8 changes: 8 additions & 0 deletions src/taxpasta/domain/service/taxonomy_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,10 @@ def get_taxon_name_lineage(self, taxonomy_id: int) -> Optional[List[str]]:
def get_taxon_identifier_lineage(self, taxonomy_id: int) -> Optional[List[int]]:
"""Return the lineage of a given taxonomy identifier as identifiers."""

@abstractmethod
def get_taxon_rank_lineage(self, taxonomy_id: int) -> Optional[List[str]]:
"""Return the lineage of a given taxonomy identifier as ranks."""

@abstractmethod
def add_name(self, table: DataFrame[ResultTable]) -> DataFrame[ResultTable]:
"""Add a column for the taxon name to the given table."""
Expand All @@ -73,6 +77,10 @@ def add_identifier_lineage(
) -> DataFrame[ResultTable]:
"""Add a column for the taxon lineage as identifiers to the given table."""

@abstractmethod
def add_rank_lineage(self, table: DataFrame[ResultTable]) -> DataFrame[ResultTable]:
"""Add a column for the taxon lineage as ranks to the given table."""

@abstractmethod
def summarise_at(
self, profile: DataFrame[StandardProfile], rank: str
Expand Down
17 changes: 17 additions & 0 deletions src/taxpasta/infrastructure/cli/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -302,6 +302,12 @@ def merge(
help="Add the taxon's entire lineage to the output. These are taxon "
"identifiers separated by semi-colons.",
),
rank_lineage: bool = typer.Option( # noqa: B008
False,
"--add-rank-lineage",
help="Add the taxon's entire rank lineage to the output. These are taxon "
"ranks separated by semi-colons.",
),
) -> None:
"""Standardise and merge two or more taxonomic profiles."""
# Perform input validation.
Expand Down Expand Up @@ -381,6 +387,14 @@ def merge(
)
raise typer.Exit(code=2)

if rank_lineage:
if taxonomy is None:
logger.critical(
"The '--add-rank-lineage' option requires a taxonomy. Please "
"provide one using the option '--taxonomy'."
)
raise typer.Exit(code=2)

# Ensure that we can write to the output directory.
try:
output.parent.mkdir(parents=True, exist_ok=True)
Expand Down Expand Up @@ -433,6 +447,9 @@ def merge(

# The order of the following conditions is chosen specifically to yield a pleasant
# output format.
if rank_lineage and valid_output_format is not WideObservationTableFileFormat.BIOM:
assert taxonomy_service is not None # nosec assert_used
result = taxonomy_service.add_rank_lineage(result)

if id_lineage and valid_output_format is not WideObservationTableFileFormat.BIOM:
assert taxonomy_service is not None # nosec assert_used
Expand Down
20 changes: 20 additions & 0 deletions src/taxpasta/infrastructure/cli/standardise.py
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,12 @@ def standardise(
help="Add the taxon's entire lineage to the output. These are taxon "
"identifiers separated by semi-colons.",
),
add_rank_lineage: bool = typer.Option( # noqa: B008
False,
"--add-rank-lineage",
help="Add the taxon's entire rank lineage to the output. These are taxon "
"ranks separated by semi-colons.",
Midnighter marked this conversation as resolved.
Show resolved Hide resolved
),
) -> None:
"""Standardise a taxonomic profile."""
# Perform input validation.
Expand Down Expand Up @@ -213,6 +219,14 @@ def standardise(
)
raise typer.Exit(code=2)

if add_rank_lineage:
if taxonomy is None:
logger.critical(
"The '--add-rank-lineage' option requires a taxonomy. Please "
"provide one using the option '--taxonomy'."
)
raise typer.Exit(code=2)

# Ensure that we can write to the output directory.
try:
output.parent.mkdir(parents=True, exist_ok=True)
Expand Down Expand Up @@ -240,6 +254,12 @@ def standardise(

# The order of the following conditions is chosen specifically to yield a pleasant
# output format.
if add_rank_lineage:
assert taxonomy_service is not None # nosec assert_used
result = Sample(
name=result.name,
profile=taxonomy_service.add_rank_lineage(result.profile),
)

if add_id_lineage:
assert taxonomy_service is not None # nosec assert_used
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,14 @@ def get_taxon_identifier_lineage(self, taxonomy_id: int) -> Optional[List[int]]:
return None
return taxon.taxid_lineage

def get_taxon_rank_lineage(self, taxonomy_id: int) -> Optional[List[str]]:
"""Return the lineage of a given taxonomy identifier as ranks."""
try:
taxon = taxopy.Taxon(taxid=taxonomy_id, taxdb=self._tax_db)
except TaxidError:
return None
return list(taxon.rank_name_dictionary.keys())

def add_name(self, table: DataFrame[ResultTable]) -> DataFrame[ResultTable]:
"""Add a column for the taxon name to the given table."""
result = table.copy()
Expand Down Expand Up @@ -141,6 +149,24 @@ def _taxid_lineage_as_str(self, taxonomy_id: int) -> Optional[str]:
return None
return ";".join([str(tax_id) for tax_id in taxon.taxid_lineage])

def add_rank_lineage(self, table: DataFrame[ResultTable]) -> DataFrame[ResultTable]:
"""Add a column for the taxon lineage as ranks to the given table."""
result = table.copy()
result.insert(
1,
"rank_lineage",
table.taxonomy_id.map(self._rank_lineage_as_str),
)
return result

def _rank_lineage_as_str(self, taxonomy_id: int) -> Optional[str]:
"""Return the rank lineage of a taxon as concatenated identifiers."""
try:
taxon = taxopy.Taxon(taxid=taxonomy_id, taxdb=self._tax_db)
except TaxidError:
return None
return ";".join(taxon.rank_name_dictionary.keys())

def summarise_at(
self, profile: DataFrame[StandardProfile], rank: str
) -> DataFrame[StandardProfile]:
Expand Down
6 changes: 6 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,3 +101,9 @@ def ganon_data_dir(data_dir: Path) -> Path:
def kmcp_data_dir(data_dir: Path) -> Path:
"""Provide the path to the KMCP data directory."""
return data_dir / "kmcp"


@pytest.fixture(scope="session")
def taxonomy_data_dir(data_dir: Path) -> Path:
"""Provide the path to the taxonomy data directory."""
return data_dir / "taxonomy"
Empty file.
Empty file added tests/data/taxonomy/merged.dmp
Empty file.
19 changes: 19 additions & 0 deletions tests/data/taxonomy/names.dmp
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
1 | root | | scientific name |
86398254 | Pseudomonadales | | scientific name |
87250111 | Enterobacteriaceae | | scientific name |
329474883 | Gammaproteobacteria | | scientific name |
432158898 | Ascomycota | | scientific name |
476817098 | Eukaryota | | scientific name |
492356122 | Saccharomyces cerevisiae | | scientific name |
536329594 | Saccharomycetales | | scientific name |
609216830 | Bacteria | | scientific name |
615773024 | Saccharomycetaceae | | scientific name |
933264868 | Saccharomyces | | scientific name |
1012954932 | Enterobacterales | | scientific name |
1187493883 | Escherichia | | scientific name |
1199096325 | Saccharomycetes | | scientific name |
1478401337 | Pseudomonadaceae | | scientific name |
1616653803 | Pseudomonas | | scientific name |
1641076285 | Proteobacteria | | scientific name |
1887621118 | Pseudomonas putida | | scientific name |
1945799576 | Escherichia coli | | scientific name |
19 changes: 19 additions & 0 deletions tests/data/taxonomy/nodes.dmp
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
1 | 1 | no rank | | 8 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | |
86398254 | 329474883 | order | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
87250111 | 1012954932 | family | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
329474883 | 1641076285 | class | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
432158898 | 476817098 | phylum | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
476817098 | 1 | superkingdom | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
492356122 | 933264868 | species | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
536329594 | 1199096325 | order | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
609216830 | 1 | superkingdom | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
615773024 | 536329594 | family | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
933264868 | 615773024 | genus | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
1012954932 | 329474883 | order | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
1187493883 | 87250111 | genus | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
1199096325 | 432158898 | class | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
1478401337 | 86398254 | family | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
1616653803 | 1478401337 | genus | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
1641076285 | 609216830 | phylum | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
1887621118 | 1616653803 | species | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
1945799576 | 1187493883 | species | XX | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |
Loading