Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ncbi taxonomy data cleaning #1014

Open
wants to merge 25 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
7c7bc1f
ncbi taxonomy data cleaning scripts and unittest
krishnaswamypradeep Apr 22, 2024
d9145a9
fixed format issues
krishnaswamypradeep Apr 22, 2024
fab56a8
Merge branch 'master' into ncbi_taxonomy
spiekos Jun 24, 2024
3dcb084
Update README.md
spiekos Jun 25, 2024
1259329
Update README.md
spiekos Jun 25, 2024
d910e77
Update format_ncbi_taxonomy_test.py
spiekos Jul 4, 2024
044c936
Update tests.sh
spiekos Jul 4, 2024
22df802
Update README.md
spiekos Jul 4, 2024
9c4232c
Rename ncbi_taxonomy_schema.mcf to ncbi_taxonomy_schema.mcf
spiekos Jul 4, 2024
26f8251
Update README.md
spiekos Jul 4, 2024
9a9fb70
Update format_ncbi_taxonomy.py
spiekos Jul 4, 2024
a30ff55
Update format_ncbi_taxonomy_test.py
spiekos Jul 4, 2024
bf94a3c
Update ncbi_taxonomy.tmcf
spiekos Jul 6, 2024
4282325
Update format_ncbi_taxonomy.py
spiekos Jul 22, 2024
b7aacf2
Update ncbi_taxonomy.tmcf
spiekos Jul 22, 2024
cf54a87
Update README.md
spiekos Jul 22, 2024
ea8fadb
Update README.md
spiekos Jul 22, 2024
888c115
NCBI Taxonomy git comments fix
krishnaswamypradeep Sep 3, 2024
d90da57
type fix in README.md file
krishnaswamypradeep Sep 3, 2024
36b274f
Merge branch 'master' into ncbi_taxonomy
hareesh-ms Sep 4, 2024
547038d
Merge branch 'datacommonsorg:master' into ncbi_taxonomy
krishnaswamypradeep Oct 3, 2024
2c9cce9
lint issue fix
krishnaswamypradeep Oct 3, 2024
e189868
Merge branch 'master' into ncbi_taxonomy
spiekos Oct 22, 2024
a94616e
Update README.md
spiekos Oct 29, 2024
96259ac
Update README.md
spiekos Oct 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
129 changes: 129 additions & 0 deletions scripts/biomedical/NCBI_Taxonomy/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Importing NCBI Taxonomy Data

## Table of Contents

1. [About the Dataset](#about-the-dataset)
1. [Download URL](#download-url)
2. [Database Overview](#database-overview)
3. [Schema Overview](#schema-overview)
1. [New Schema](#new-schema)
2. [dcid Generation](#dcid-generation)
3. [enum Generation](#enum-generation)
4. [Edges](#edges)
4. [Notes and Caveats](#notes-and-caveats)
5. [License](#license)
6. [Dataset Documentation and Relevant Links](#dataset-documentation-and-relevant-links)
2. [About the Import](#about-the-import)
1. [Artifacts](#artifacts)
2. [Import Procedure](#import-procedure)
3. [Test](#test)


## About the Dataset

NCBI Taxonomy "consists of a curated set of names and classifications for all of the source organisms represented in the International Nucleotide Sequence Database Collaboration (INSDC). The NCBI Taxonomy database contains a list of names that are determined to be nomenclaturally correct or valid (as defined according to the different codes of nomenclature), classified in an approximately phylogenetic hierarchy (depending on the level of knowledge regarding phylogenetic relationships of a given group) as well as a number of names that exist outside the jurisdiction of the codes. That is, it focuses on nomenclature and systematics, rather than documenting the description of taxa." Furthermore, NCBI Taxonomy "includes organism names and classifications for every sequence in the nucleotide and protein sequence databases of the INSDC. It provides a framework for clustering elements within other domains of NCBI web pages, for internal linking between domains of the Entrez system and for linking out to taxon-specific external resources on the web and relevant publications. It is also the standard nomenclature and classification repository for the INSDC that comprises of GenBank, the European Molecular Biology Laboratory (EMBL) and DNA Data Bank of Japan (DDBJ)."

### Download URL

NCBI Taxonomy data can be downloaded from the National Center for Biotechnology Information (NCBI) Assembly database using their FTP Site
1. [ncbi_taxdump](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.Z).
2. [ncbi_taxcat](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxcat.tar.gz).

### Database Overview

"NCBI Taxonomy distinguishes between formal and informal names. Formal names are declared based on rules laid down in four relevant codes of nomenclature (although other codes do exist). These are the International Code of Nomenclature for algae, fungi, and plants (ICNafp), the International Code of Nomenclature of Prokaryotes (ICNP) and the International Code of Zoological Nomenclature (ICZN). The viruses are governed by the International Code of Virus Classification and Nomenclature (ICVCN, also referred to as the ICTV Code). Informal names follow internal rules that are dictated by practical considerations outside of the Codes. For example, names lacking species epithets are commonly applied to GenBank records."

In this import we include information from the following files downloaded from the ftp backend:
* divisions.dmp
* names.dmp
* host.dmp
* nodes.dmp
* categories.dmp

### Schema Overview

#### New Schema

* Classes
* Taxon (Thing > BioChemEntity > BiomedicalEntity > BiologicalEntity > GenomeAnnotation > Taxon)
* Properties
* BiologicalEntity: ncbiBlastName
* Taxon: biologicalHost, commonName, genBankName, hasInheritedDivsion, ncbiTaxId, taxonDivision, taxonRank, taxonTopLevelCategory
* Enumerations
* TaxonTopLevelCategoryEnum
* Ennummeartions Generated By Script
* BiologicalTaxonomicDivisionEnum
* BiologicalTaxonomicRankEnum

#### dcid Generation

The data for each entry in NCBI Taxonomy was stored as a Taxon entity. The dcid for these entities was generated by adding the prefix "bio/" to the scientific name for the Taxon in which the scientific name was reperesented in pascal case and text in <> was connected by an "_" (e.g. scientifc name Bacteria <Bacteria> for ncbi tax id "2" was represented as "bio/Bacteria_Bacteria").

#### enum Generation

The schema for the BiologicalTaxonomicDivisionEnum, and BiologicalHostEnum, and BiologicalTaxonomicRankEnum are autogenerated by `format_ncbi_taxonomy.py`.

#### Edges

The edges, or links, in this import are between Taxon nodes that are related in parent-child relationships within the taxonomic heirachry. The dcid of a parent Taxon node is stored in the property "parentTaxon". For example, a node of taxon rank of Species will be linked to the relevant Taxon node of rank Genus.

### Notes and Caveats

krishnaswamypradeep marked this conversation as resolved.
Show resolved Hide resolved
Need to add notes

### License

This data is from an NIH human genome unrestricted-access data repository and made accessible under the [NIH Genomic Data Sharing (GDS) Policy](https://osp.od.nih.gov/scientific-sharing/genomic-data-sharing/).

### Dataset Documentation and Relevant Links

More information about the NCBI Taxonomy database can be found [here](https://www.ncbi.nlm.nih.gov/books/NBK53758/). Additional information is contained in [Schoch et al. 2020](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7408187/).

## About the import

### Artifacts

#### Scripts

##### Bash Scripts

- [download.sh](scripts/download.sh) downloads the most recent release of the NCBI Taxonomy data.
- [run.sh](scripts/run.sh) creates new taxonomy enum mfc and converts data into formatted CSV for import of data using categories.dmp, division.dmp, host.dmp, names.dmp & nodes.dmp files from download location
- [tests.sh](scripts/tests.sh) runs standard tests to check for proper formatting of taxonomy enum mfc file.
krishnaswamypradeep marked this conversation as resolved.
Show resolved Hide resolved

##### Python Scripts

- [format_ncbi_taxonomy.py](scripts/format_ncbi_taxonomy.py) creates the taxonomy enum mcf and formatted CSV files.
- [format_ncbi_taxonomy_test.py](scripts/format_ncbi_taxonomy.py) unittest script to test standard test cases on taxonomy enum mcf.

#### tMCFs

- [ncbi_taxonomy_schema.mcf](tMCFs/ncbi_taxonomy.tmcf) contains the tmcf mapping to the csv of taxonomy.

#### Schema MCF

- [ncbi_taxonomy_schema.mcf](schema_mcf/ncbi_taxonomy_schema.mcf) contains the schema mcf.


### Import Procedure

Download the most recent versions of NCBI Taxonomy data:

```bash
sh download.sh
```

Generate the enum schema MCF & formatted CSV:

```bash
sh run.sh
```


### Test

To run tests:

```bash
sh tests.sh
```
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
Node: dcid:Taxon
name: "Taxon"
typeOf: schema:Class
subClassOf: dcs:BiologicalEntity
description: "A set of organisms asserted to represent a natural cohesive biological unit."

Node: dcid:biologicalHost
name: "biologicalHost"
typeOf: schema:Property
domainIncludes: dcs:Taxon
rangeIncludes: dcs:BiologicalHostEnum
description: "The type of larger biological organism that harbors this group of organisms."

Node: dcid:genBankName
name: "genBankName"
typeOf: schema:Property
domainIncludes: dcs:Taxon
rangeIncludes: schema:Text
description: "The name by which GenBank refers to this taxon."

Node: dcid:hasInheritedDivision
name: "hasInheritedDivision"
typeOf: schema:Property
domainIncludes: dcs:Taxon
rangeIncludes: schema:Boolean
description: "Division is inherited by the parent Taxon."

Node: dcid:ncbiBlastName
name: "ncbiBlastName"
typeOf: schema:Property
domainIncludes: dcs:BiologicalEntity
rangeIncludes: schema:Text
description: "The name by which NCBI Basic Local Alignment Search Tool (Blast) refers to this taxon."

Node: dcid:ncbiTaxId
name: "ncbiTaxId"
typeOf: schema:Property
domainIncludes: dcs:Taxon
rangeIncludes: schema:Number
description: "Node id in ncbi GenBank taxonomy database."

Node: dcid:parentTaxon
name: "parentTaxon"
typeOf: schema:Property
domainIncludes: dcs:Taxon
rangeIncludes: dcs:Taxon
description:"Closest parent taxon of the taxon in question."

Node: dcid:taxonDivision
name: "taxonDivision"
typeOf: schema:Property
domainIncludes: dcs:Taxon
rangeIncludes: dcs:BiologicalTaxonomicDivisionEnum
description:"The broad biological division to which the group of organisms belongs."

Node: dcid:taxonRank
name: "taxonRank"
typeOf: schema:Property
domainIncludes: dcs:Taxon
rangeIncludes: dcs:BiologicalTaxonomicRankEnum
description: "The relative level of the group of organisms (a taxon) is in an ancestral or hereditary hierarchy."

Node: dcid:taxonTopLevelCategory
name: "taxonTopLevelCategory"
typeOf: schema:Property
domainIncludes: dcs:Taxon
rangeIncludes: dcs:TaxonTopLevelCategoryEnum
description: "The top level category to which a group of organisms (a taxon) belongs. This can be Archaea, Bacteria, Eukaryota, Viruses and Viroids, Other, and Unclassified."

Node: dcid:abbreviaion
name: "abbreviaion"
typeOf: schema:Property
domainIncludes: dcs:Taxon
rangeIncludes:schema:Text

Node: dcid:acronym
name: "acronym"
typeOf: schema:Property
domainIncludes: dcs:Taxon
rangeIncludes:schema:Text

14 changes: 14 additions & 0 deletions scripts/biomedical/NCBI_Taxonomy/scripts/download.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
#!/bin/bash
krishnaswamypradeep marked this conversation as resolved.
Show resolved Hide resolved

mkdir -p input; cd input

# download the newest ncbi taxdump file and uncompress it
curl -o ncbi_taxdump.tar.Z https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/new_taxdump.tar.Z
uncompress ncbi_taxdump.tar.Z
tar -xvf ncbi_taxdump.tar
rm ncbi_taxdump.tar

# download the newest ncbi taxcat file and uncompress it
curl -o ncbi_taxcat.tar.gz https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxcat.tar.gz
tar -xvzf ncbi_taxcat.tar.gz
rm ncbi_taxcat.tar.gz
Loading
Loading