datacommonsorg · spiekos · Jul 3, 2024 · Jul 3, 2024
diff --git a/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/README.md b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/README.md
@@ -0,0 +1,143 @@
+# Importing Master Species List and Virus Metadata Resource from the International Committee on Taxonomy of Viruses (ICTV)
+
+## Table of Contents
+
+1. [About the Dataset](#about-the-dataset)
+    1. [Download URL](#download-urls)
+    2. [Overview](#overview)
+    3. [Notes and Caveats](#notes-and-caveats)
+    4. [Schema Overview](#schema-overview)
+       1. [New Schema](#new-schema)
+       2. [dcid Generation](#dcid-generation)
+         1. [SequenceOntologyTerm](#sequenceontologyterm)
+         2. [SequenceOntologyTermSynonym](#sequenceontologytermsynonym)
+         3. [Illegal Characters](#illegal-characters)
+    5. [License](#license)
+    6. [Dataset Documentation and Relevant Links](#dataset-documentation-and-relevant-links)
+2. [About the Import](#about-the-import)
+    1. [Artifacts](#artifacts)
+       2. [Scripts](#scripts)
+       3. [tMCFs](#tmcfs)
+       4. [Log Files](#log-files)
+    2. [Import Procedure](#import-procedure)
+    3. [Tests](#tests)
+
+
+## About the Datasets
+"[The Sequence Ontology]((http://www.sequenceontology.org/) is a set of terms and relationships used to describe the features and attributes of biological sequence. SO includes different kinds of features which can be located on the sequence. Biological features are those which are defined by their disposition to be involved in a biological process. Examples are binding_site and exon. Biomaterial features are those which are intended for use in an experiment such as aptamer and PCR_product. There are also experimental features which are the result of an experiment. SO also provides a rich set of attributes to describe these features such as 'polycistronic' and 'maternally imprinted'."
+
+### Download URLs
+
+[The Sequence Ontology](http://www.sequenceontology.org/) releases monthly updates to their [GitHub](https://github.com/The-Sequence-Ontology/). We process the "so.json" file from the provided [data files](https://github.com/The-Sequence-Ontology/SO-Ontologies/tree/master/Ontology_Files).
+
+
+### Overview
+
+This directory stores all scripts used to import data from the Sequence Ontology (SO). "SO is a collaborative ontology project for the definition of sequence features used in biological sequence annotation. SO was initially developed by the Gene Ontology Consortium."
+
+
+### Notes and Caveats
+
+In cases in which no name for the SequenceOntologyTerm is not provided, the sequence ontology identifier is used. In addition, we do not include cases in which the synonym name is missing and a placeholder name "<new synonym>". The data is represented in a json file, which requires expansion so that each line contains at most one unique SequenceOntologyTerm and SequenceOntologySynonym pair. Other properties require additional expansion and/or filtering. This includes seperating out urls that are contained in a list as a value of the identifier property in the original file. This is done through regex of website subdomains and then representing these as a url property. Furthermore, all individual sources for identifiers are recorded as unique properties, unlike the original data which records them as a list of identifier and source pairs. Finally, a SequenceOntologyTerm does not necessarily have a parent term and it is possible for a SequenceOntologyTerm to have multiple parents.
+
+
+### Schema Overview
+
+#### New Schema
+
+Classes, properties, and enumerations that were added in this import to represent the data.
+
+* Classes
+    * SequenceOntologyTerm (Thing > BioChemEntity > BiomedicalEntity > BiologicalEntity > GenomeAnnotation > SequenceOntologyTerm)
+    * SequenceOntologyTermSynonym (Thing > BioChemEntity > BiomedicalEntity > BiologicalEntity > GenomeAnnotation > SequenceOntologyTerm > SequenceOntologyTermSynonym)
+* Properties
+    * BioChemEntity: proteinModificationOntology
+    * BiomedicalEntity: loincCode, pubMedCentralId
+    * GenomeAnnotation: pomBaseId, rnaModId
+    * SequenceOntologyTerm: sequenceOntologySubset
+    * SequenceOntologySynonym: synonymType
+    * Thing: digitalObjectIdentifier, internationalStandardBookNumber, internationalStandardSerialNumber, isDeprecated
+* Enumerations
+    * SequenceOntologySubsetEnum
+    * SynonymTypeEnum
+
+#### dcid Generation
+A ‘bio/’ prefix was attached to all dcids in this import. Each line in each input file is considered its own unique SequenceOntologyTerm and if relevant, SequenceOntologyTermSynonym pair.
+
+##### SequenceOntologyTerm
+Dcids were generated using the Sequence Ontology identifier and then substituting ':' with '_' (e.g. SO:00000000002382 dcid is bio/SO_00000000002382).
+
+##### SequenceOntologyTermSynonym
+Dcids were generated by replacing any illegal characters in the name of the synonym, converting it into pascal case, and then appending it to the dcid of the related SequenceOntologyTerm using '_' (e.g. the synonym "Sequence" for SO:0000001 will be represented as "bio/SO_0000001_Sequence").
+
+#### Illegal Characters
+Only ASCII characters are allowed to be used in dcids. Additionally, a number of characters that are illegal to include in the dcid were replaced in place with the following characters specified below:
+
+| Illegal Character | Replacement Character |
+| ----------------- | --------------------- |
+| : | _ |
+| ; | _ |
+| <space>   |   |
+| [ | ( |
+| ] | ) |
+| - | _ |
+| – | _ |
+| ‘ | _ |
+| # |   |
+
+
+### License
+
+The data is published under the Creative Commons Attribution ShareAlike 4.0 International [(CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/).
+
+
+### Dataset Documentation and Relevant Links
+
+[The Sequence Ontology](http://www.sequenceontology.org/) updates monthly by pushing their data to their [GitHub](https://github.com/The-Sequence-Ontology/). The Sequence Ontology is a product of the [Genome Ontology Consortium](https://geneontology.org/).
+
+
+## About the import
+
+### Artifacts
+
+#### Scripts 
+
+##### Bash Scripts
+
+- [download.sh](scripts/download.sh) downloads the most recent release of the ICTV Master Species List and Virus Metadata Resource.
+- [run.sh](scripts/run.sh) converts data into formatted CSV for import of The Sequence Ontology data into the knowledge graph.
+- [tests.sh](scripts/tests.sh) runs standard tests on CSV + tMCF pairs to check for proper formatting.
+
+##### Python Scripts
+
+- [format_the_sequence_ontology.py](scripts/format_the_sequence_ontology.py) creates the cleaned and CSV formatted file from The Sequence Ontology data.
+
+#### tMCFs
+
+- [the_sequence_ontology.tmcf](tMCFs/the_sequence_ontology.tmcf) contains the tmcf mapping to the csv of the cleaned and CSV formatted The Sequence Ontology data.
+
+### Import Procedure
+
+Download the most recent versions of The Sequence Ontology by running:
+
+```bash
+sh download.sh
+```
+
+Generate the cleaned and formatted CSV file by running:
+
+```bash
+sh run.sh
+```
+
+### Tests
+
+The first step of `tests.sh` is to downloads Data Commons's java -jar import tool, storing it in a `tmp` directory. This assumes that the user has Java Runtime Environment (JRE) installed. This tool is described in Data Commons documentation of the [import pipeline](https://github.com/datacommonsorg/import/). The relases of the tool can be viewed [here](https://github.com/datacommonsorg/import/releases/). Here we download version `0.1-alpha.1k` and apply it to check our csv + tmcf import. It evaluates if all schema used in the import is present in the graph, all referenced nodes are present in the graph, along with other checks that issue fatal errors, errors, or warnings upon failing checks. Please note that empty tokens for some columns are expected as this reflects the original data. The imports create the SequenceOntologyTerm and SequenceOntologySynonymTerm nodes that are then refrenced within this import by parent and synonym properties respectively. This resolves any concern about missing reference warnings concerning these node types by the test. Finally, not all SequenceOntologyTerms have parent nodes. This results in the CSV_EmptyDcidReferences warning, which can be ignored as this reflects the underlying state of the data.
+
+To run tests:
+
+```bash
+sh tests.sh
+```
+
+This will generate an output file for the results of the tests on each csv + tmcf pair
diff --git a/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/download.sh b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/download.sh
@@ -0,0 +1,27 @@
+# Copyright 2024 Google LLC
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Author: Samantha Piekos
+Date: 05/10/2024
+Name: run
+Description: This file downloads the most recent raw file from
+The-Sequence-Ontologies GitHub.
+"""
+
+#!/bin/bash
+
+mkdir -p input; cd input
+
+# download assembly summary files
+curl -o so.json https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/Ontology_Files/so.json