Update README.md

Add the Notes and Caveat Subsection to README.md
datacommonsorg · Oct 29, 2024 · 659fc7e · 659fc7e
1 parent d41b21f
commit 659fc7e
Showing 1 changed file with 20 additions and 11 deletions.
diff --git a/scripts/biomedical/NCBI_Gene/README.md b/scripts/biomedical/NCBI_Gene/README.md
@@ -179,30 +179,31 @@ NCBI Gene data can be downloaded from the National Center for Biotechnology Info
 
 #### dcid Generation
 
-* **Gene:** The data for each entry in NCBI Gene was stored as a Gene entity. The dcid for these entities was generated by adding the prefix "bio/" to the gene symbol for *Homo sapiens* (i.e. when NCBI TaxID is 9606) genes only (e.g. bio/TP53). For all other species, the dcid for genes is generated by adding the prefix "bio/ncbi_" to the NCBI Gene ID. For example, the dcid for Trp53 in *Mus musculus* would be bio/ncbi_22059.
+* **Gene:** The data for each entry in NCBI Gene was stored as a Gene entity. The dcid for these entities was generated by adding the prefix 'bio/' to the gene symbol for *Homo sapiens* (i.e. when NCBI TaxID is 9606) genes only (e.g. bio/TP53). For all other species, the dcid for genes is generated by adding the prefix 'bio/ncbi_' to the NCBI Gene ID. For example, the dcid for Trp53 in *Mus musculus* would be bio/ncbi_22059.
 * **GeneMendelianInheritanceInManIdentifierAssociation:** The dcids for GeneMendelianInheritanceInManIdentifierAssociation are generated as the referenced gene dcid followed by "_omim_" and then the OMIM ID (e.g. bio/ncbi_216_omim_100640).
-* **GeneReferenceIntoFunction:** The dcids for GeneReferenceIntoFunction nodes are generated by joining with an "_PubMed_" the dcid of the referenced gene with the PubMed ID of the paper with the referenced function as in <Gene_dcid>_PubMed_<PubMed_ID> (e.g. bio/ncbi_715746_PubMed_19537945).
-* **GenomicCoordinates:** The dcids for GenomicCoordinates of Genes are generated in the following format: "bio/<chromosome_genomic_accession.version>_<start>_<stop>" where start and stop indicate the starting and stoping position for the Gene in the chromosome (or other genomic scaffold unit) of the specified genome assembly version. The dcids for GenomicCoordinates of RnaTranscripts are generated in the following format: "bio/<genomic_nucleotide_accession.version>_<start>_<end> where genomic_nucleotide_accession.version is the identifier for the matching RefSeq genomic nucleotide region and the start and end indicated the starting and stoping position for the Gene in the chromosome (e.g. bio/NC_001321.1_8388_9068).
-* **GeneOntologyTerm:** The dcids for GeneOntologyTerm nodes are generated by adding the prefix "bio/" to the Gene Ontology (GO) Term Accession replacing the ":" with "_" (e.g. bio/GO_0005105).
-* **MaturePeptide:** The dcids for Proteins are generated by adding the prefix "bio/" to the peptide_accession.version (e.g. bio/XM_064049720.1)
-* **MendelianInheritanceInManEntity:** The dcids for MendelianInheritanceInManEntity nodes is generated by adding the prefix "bio/omim_" to omim IDs (e.g. bio/omim_136350).
-* **Protein:** The dcids for Proteins are generated by adding the prefix "bio/" to the protein_accession.version (e.g. bio/AVV84537.1).
-* **RnaTranscript**: The dcids for RnaTranscripts are generated by adding the prefix "bio/" to the RNA_nucleotide_accession.version (e.g. bio/XM_062216974.1).
-* **Taxon:** The dcid for Taxon nodes were generated by adding the prefix "bio/" to the scientific name for the Taxon in which the scientific name was reperesented in pascal case and text (e.g. bio/HomoSapiens). In <> was connected by an "_" (e.g. scientifc name Bacteria <Bacteria> for ncbi tax id "2" was represented as bio/Bacteria_Bacteria).
-* **UmlsConceptUniqueIdentifier:** The dcids for UmlsConceptUniqueIdentifiers are generated by adding the prefix "bio/" to the UMLS CUIs (e.g. bio/C1563720). 
+* **GeneReferenceIntoFunction:** The dcids for GeneReferenceIntoFunction nodes are generated by joining with an 'PubMed' flanked by '_'s the dcid of the referenced gene with the PubMed ID of the paper with the referenced function as in <Gene_dcid>_PubMed_<PubMed_ID> (e.g. bio/ncbi_715746_PubMed_19537945).
+* **GenomicCoordinates:** The dcids for GenomicCoordinates of Genes are generated in the following format: 'bio/<chromosome_genomic_accession.version>_<start>_<stop>' where start and stop indicate the starting and stoping position for the Gene in the chromosome (or other genomic scaffold unit) of the specified genome assembly version. The dcids for GenomicCoordinates of RnaTranscripts are generated in the following format: 'bio/<genomic_nucleotide_accession.version>_<start>_<end>' where genomic_nucleotide_accession.version is the identifier for the matching RefSeq genomic nucleotide region and the start and end indicated the starting and stoping position for the Gene in the chromosome (e.g. bio/NC_001321.1_8388_9068).
+* **GeneOntologyTerm:** The dcids for GeneOntologyTerm nodes are generated by adding the prefix 'bio/' to the Gene Ontology (GO) Term Accession replacing the ':' with '_' (e.g. bio/GO_0005105).
+* **MaturePeptide:** The dcids for Proteins are generated by adding the prefix 'bio/' to the peptide_accession.version (e.g. bio/XM_064049720.1)
+* **MendelianInheritanceInManEntity:** The dcids for MendelianInheritanceInManEntity nodes is generated by adding the prefix 'bio/omim_' to omim IDs (e.g. bio/omim_136350).
+* **Protein:** The dcids for Proteins are generated by adding the prefix 'bio/' to the protein_accession.version (e.g. bio/AVV84537.1).
+* **RnaTranscript**: The dcids for RnaTranscripts are generated by adding the prefix 'bio/' to the RNA_nucleotide_accession.version (e.g. bio/XM_062216974.1).
+* **Taxon:** The dcid for Taxon nodes were generated by adding the prefix 'bio/' to the scientific name for the Taxon in which the scientific name was reperesented in pascal case and text (e.g. bio/HomoSapiens). In <> was connected by an '_' (e.g. scientifc name Bacteria <Bacteria> for ncbi tax id '2' was represented as bio/Bacteria_Bacteria).
+* **UmlsConceptUniqueIdentifier:** The dcids for UmlsConceptUniqueIdentifiers are generated by adding the prefix 'bio/' to the UMLS CUIs (e.g. bio/C1563720). 
 
 #### enum Generation
 
 The schema for the GeneFeatureTypeRegulatoryEnum, GeneFeatureTypeMiscellaneousEnum and GeneFeatureTypeMiscellaneousRecombinationEnum are autogenerated by `format_ncbi_gene.py`.
 
-A sample auto generated schema file saves at [ncbi_gene_schema_enum.mcf](/scripts/biomedical/NCBI_Gene/test_data/sample_enum/ncbi_gene_schema_enum.mcf) for reference
+A sample auto generated schema file saves at [ncbi_gene_schema_enum.mcf](/scripts/biomedical/NCBI_Gene/test_data/sample_enum/ncbi_gene_schema_enum.mcf) for reference.
 
 #### Edges
 
 Links were established between the entity classes included in this import. In the table below we document this info, alphabatizing on the entity entity type of the outgoing link. We include the entity type of the corresponding linked node along with the property whose value is the link.
 
 | Entity Type of Outgoing Link | Entity Type of Ingoing Link | Property |
 | -------- | ------- | ------- |
+| Gene | Gene | geneOrtholog |
 | Gene | Gene | genePotentialReadthroughSibling |
 | Gene | Gene | geneReadthroughChild |
 | Gene | Gene | geneReadthroughParent |
@@ -229,6 +230,14 @@ Links were established between the entity classes included in this import. In th
 
 This import relies on the ncbi_tax_id_dcid_mapping file, which is generated as output from the Taxonomy import. This maps the NCBI Tax ID to the dcid representing the corresponding preexisting dcid in the graph. According to NCBI, Gene "most of the files in this path are re-calculated daily. Gene does not, however, compare previous and current data, so the date on the file may change without any change in content." This necessitates regular updates.
 
+For the `gene_info.csv` file, the property 'dbXrefs' is a list of <database>:<ID> pairs split by a '|' delimeter. This property was represented in a seperate csv file where each database is a column and each row was an individual Gene. This was brought in as it's own CSV+tMCF file pair in addition to the CSV+tMCF file pair representing the remaining information in the `gene_info.csv` input file. Each database is represented as it's own property within the graph with the value being the ID, rather than listing all alternative IDs in a list in a single property like the input file. This allows for searching for a particular database ID in Biomedical Data Commons.
+
+For the `gene_neighbors.csv` file we only ingested the genomic coordinate information for the Genes. We did not include information regarding neighboring genes. We dropped the following columns prior to ingestion: 'GeneIDs on left', 'distance to left', 'GeneIDs on right', 'distance to right', and 'overlapping GeneIDs'.
+
+For the `gene_ortholog.csv` and the `gene_group.csv` files we represented the different relationships as their own distinct properties (e.g. 'geneOrtholog', 'genePotentialReadthroughSibling' 'geneReadthroughChild', etc.),  whose value is a link to the corresponding Gene. Since these values could be comma seprated lists in which multiple Genes could be assigned the same relationship to a single Gene, we directly referenced these nodes by converting these lists of Genes to the appropriately referenced dcid for the Gene in Biomedical Data Commons (e.g. dcid:bio/ncbi_112291008). All Gene nodes and thus their dcids used in the NCBI Gene import are initiated by ingesting the `gene_info.csv` file, so we are guarenteed their existence in the graph. An additional check for the existence of these referenced genes is made in the import script when converting the input of the NCBI Gene ID to the corresponding Gene dcid.
+
+The `gene2accession.csv` file contained paired pieces of information (the accession.version and GI) on the following related entity types Genomic Nucleotide (i.e. GenomicRegion), RNA Nucleotide (i.e. RnaTranscript), Protein (Protein), and Mature Peptide (i.e. MaturePeptide). The information for each of these distict, but related entity types may or may not be available for each Gene in the `gene2accession.csv` file. To maintain the specificity of the information and links between each of the related entity types regardless of the presence or absence of information on those entities for any given Gene, links were made between all potential pairs of related entity types: Gene, GenomicRegion, RnaTranscript, Protein, and MaturePeptide. The nature of these links are described in the [Edges](#edges) subsection.
+
 ### License
 
 This data is from an NIH National Library of Medicine (NLM) genome unrestricted-access data repository and made accessible under the [NIH Genomic Data Sharing (GDS) Policy](https://osp.od.nih.gov/scientific-sharing/genomic-data-sharing/) and the [NLM Accessibility policy](https://www.nlm.nih.gov/accessibility.html). Additional information on "NCBI Website and Data Usage Policies" can be found [here](https://www.ncbi.nlm.nih.gov/home/about/policies/).