diff --git a/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/README.md b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/README.md new file mode 100644 index 0000000000..01331f7727 --- /dev/null +++ b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/README.md @@ -0,0 +1,143 @@ +# Importing Master Species List and Virus Metadata Resource from the International Committee on Taxonomy of Viruses (ICTV) + +## Table of Contents + +1. [About the Dataset](#about-the-dataset) + 1. [Download URL](#download-urls) + 2. [Overview](#overview) + 3. [Notes and Caveats](#notes-and-caveats) + 4. [Schema Overview](#schema-overview) + 1. [New Schema](#new-schema) + 2. [dcid Generation](#dcid-generation) + 1. [SequenceOntologyTerm](#sequenceontologyterm) + 2. [SequenceOntologyTermSynonym](#sequenceontologytermsynonym) + 3. [Illegal Characters](#illegal-characters) + 5. [License](#license) + 6. [Dataset Documentation and Relevant Links](#dataset-documentation-and-relevant-links) +2. [About the Import](#about-the-import) + 1. [Artifacts](#artifacts) + 2. [Scripts](#scripts) + 3. [tMCFs](#tmcfs) + 4. [Log Files](#log-files) + 2. [Import Procedure](#import-procedure) + 3. [Tests](#tests) + + +## About the Datasets +"[The Sequence Ontology]((http://www.sequenceontology.org/) is a set of terms and relationships used to describe the features and attributes of biological sequence. SO includes different kinds of features which can be located on the sequence. Biological features are those which are defined by their disposition to be involved in a biological process. Examples are binding_site and exon. Biomaterial features are those which are intended for use in an experiment such as aptamer and PCR_product. There are also experimental features which are the result of an experiment. SO also provides a rich set of attributes to describe these features such as 'polycistronic' and 'maternally imprinted'." + +### Download URLs + +[The Sequence Ontology](http://www.sequenceontology.org/) releases monthly updates to their [GitHub](https://github.com/The-Sequence-Ontology/). We process the "so.json" file from the provided [data files](https://github.com/The-Sequence-Ontology/SO-Ontologies/tree/master/Ontology_Files). + + +### Overview + +This directory stores all scripts used to import data from the Sequence Ontology (SO). "SO is a collaborative ontology project for the definition of sequence features used in biological sequence annotation. SO was initially developed by the Gene Ontology Consortium." + + +### Notes and Caveats + +In cases in which no name for the SequenceOntologyTerm is not provided, the sequence ontology identifier is used. In addition, we do not include cases in which the synonym name is missing and a placeholder name "". The data is represented in a json file, which requires expansion so that each line contains at most one unique SequenceOntologyTerm and SequenceOntologySynonym pair. Other properties require additional expansion and/or filtering. This includes seperating out urls that are contained in a list as a value of the identifier property in the original file. This is done through regex of website subdomains and then representing these as a url property. Furthermore, all individual sources for identifiers are recorded as unique properties, unlike the original data which records them as a list of identifier and source pairs. Finally, a SequenceOntologyTerm does not necessarily have a parent term and it is possible for a SequenceOntologyTerm to have multiple parents. + + +### Schema Overview + +#### New Schema + +Classes, properties, and enumerations that were added in this import to represent the data. + +* Classes + * SequenceOntologyTerm (Thing > BioChemEntity > BiomedicalEntity > BiologicalEntity > GenomeAnnotation > SequenceOntologyTerm) + * SequenceOntologyTermSynonym (Thing > BioChemEntity > BiomedicalEntity > BiologicalEntity > GenomeAnnotation > SequenceOntologyTerm > SequenceOntologyTermSynonym) +* Properties + * BioChemEntity: proteinModificationOntology + * BiomedicalEntity: loincCode, pubMedCentralId + * GenomeAnnotation: pomBaseId, rnaModId + * SequenceOntologyTerm: sequenceOntologySubset + * SequenceOntologySynonym: synonymType + * Thing: digitalObjectIdentifier, internationalStandardBookNumber, internationalStandardSerialNumber, isDeprecated +* Enumerations + * SequenceOntologySubsetEnum + * SynonymTypeEnum + +#### dcid Generation +A ‘bio/’ prefix was attached to all dcids in this import. Each line in each input file is considered its own unique SequenceOntologyTerm and if relevant, SequenceOntologyTermSynonym pair. + +##### SequenceOntologyTerm +Dcids were generated using the Sequence Ontology identifier and then substituting ':' with '_' (e.g. SO:00000000002382 dcid is bio/SO_00000000002382). + +##### SequenceOntologyTermSynonym +Dcids were generated by replacing any illegal characters in the name of the synonym, converting it into pascal case, and then appending it to the dcid of the related SequenceOntologyTerm using '_' (e.g. the synonym "Sequence" for SO:0000001 will be represented as "bio/SO_0000001_Sequence"). + +#### Illegal Characters +Only ASCII characters are allowed to be used in dcids. Additionally, a number of characters that are illegal to include in the dcid were replaced in place with the following characters specified below: + +| Illegal Character | Replacement Character | +| ----------------- | --------------------- | +| : | _ | +| ; | _ | +| | | +| [ | ( | +| ] | ) | +| - | _ | +| – | _ | +| ‘ | _ | +| # | | + + +### License + +The data is published under the Creative Commons Attribution ShareAlike 4.0 International [(CC BY-SA 4.0)](https://creativecommons.org/licenses/by-sa/4.0/). + + +### Dataset Documentation and Relevant Links + +[The Sequence Ontology](http://www.sequenceontology.org/) updates monthly by pushing their data to their [GitHub](https://github.com/The-Sequence-Ontology/). The Sequence Ontology is a product of the [Genome Ontology Consortium](https://geneontology.org/). + + +## About the import + +### Artifacts + +#### Scripts + +##### Bash Scripts + +- [download.sh](scripts/download.sh) downloads the most recent release of the ICTV Master Species List and Virus Metadata Resource. +- [run.sh](scripts/run.sh) converts data into formatted CSV for import of The Sequence Ontology data into the knowledge graph. +- [tests.sh](scripts/tests.sh) runs standard tests on CSV + tMCF pairs to check for proper formatting. + +##### Python Scripts + +- [format_the_sequence_ontology.py](scripts/format_the_sequence_ontology.py) creates the cleaned and CSV formatted file from The Sequence Ontology data. + +#### tMCFs + +- [the_sequence_ontology.tmcf](tMCFs/the_sequence_ontology.tmcf) contains the tmcf mapping to the csv of the cleaned and CSV formatted The Sequence Ontology data. + +### Import Procedure + +Download the most recent versions of The Sequence Ontology by running: + +```bash +sh download.sh +``` + +Generate the cleaned and formatted CSV file by running: + +```bash +sh run.sh +``` + +### Tests + +The first step of `tests.sh` is to downloads Data Commons's java -jar import tool, storing it in a `tmp` directory. This assumes that the user has Java Runtime Environment (JRE) installed. This tool is described in Data Commons documentation of the [import pipeline](https://github.com/datacommonsorg/import/). The relases of the tool can be viewed [here](https://github.com/datacommonsorg/import/releases/). Here we download version `0.1-alpha.1k` and apply it to check our csv + tmcf import. It evaluates if all schema used in the import is present in the graph, all referenced nodes are present in the graph, along with other checks that issue fatal errors, errors, or warnings upon failing checks. Please note that empty tokens for some columns are expected as this reflects the original data. The imports create the SequenceOntologyTerm and SequenceOntologySynonymTerm nodes that are then refrenced within this import by parent and synonym properties respectively. This resolves any concern about missing reference warnings concerning these node types by the test. Finally, not all SequenceOntologyTerms have parent nodes. This results in the CSV_EmptyDcidReferences warning, which can be ignored as this reflects the underlying state of the data. + +To run tests: + +```bash +sh tests.sh +``` + +This will generate an output file for the results of the tests on each csv + tmcf pair diff --git a/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/download.sh b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/download.sh new file mode 100644 index 0000000000..81eb795c05 --- /dev/null +++ b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/download.sh @@ -0,0 +1,27 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Author: Samantha Piekos +Date: 05/10/2024 +Name: run +Description: This file downloads the most recent raw file from +The-Sequence-Ontologies GitHub. +""" + +#!/bin/bash + +mkdir -p input; cd input + +# download assembly summary files +curl -o so.json https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/Ontology_Files/so.json \ No newline at end of file diff --git a/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/format_the_sequence_ontology.py b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/format_the_sequence_ontology.py new file mode 100644 index 0000000000..48c0a23e05 --- /dev/null +++ b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/format_the_sequence_ontology.py @@ -0,0 +1,510 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +''' +Author: Samantha Piekos +Date: 05/13/2024 +Last Edited: 07/02/2024 +Name: format_the_sequence_ontology.py +Description: converts nested .json to .csv and seperates out cross +references for the sequence ontology and corresponding definitions into +distinct properties. +@file_input: input json file downloaded from The-Sequence-Ontology GitHub +@file_output: formatted csv files ready for import into data commons kg + with corresponding tmcf file +''' + +# import environment +import json +import numpy as np +import pandas as pd +import sys + + +# declare universal variables +DICT_SUBSETS = { + 'http://purl.obolibrary.org/obo/so#DBVAR': 'dcs:SequenceOntologySubsetDbVar', + 'http://purl.obolibrary.org/obo/so#biosapiens': 'dcs:SequenceOntologySubsetBiosapiens', + 'http://purl.obolibrary.org/obo/so#SOFA': 'dcs:SequenceOntologySubsetSofa', + 'http://purl.obolibrary.org/obo/so#Alliance_of_Genome_Resources': 'dcs:SequenceOntologySubsetAllianceOfGenomeResources' +} + +DICT_SYNONYM_TYPE = { + 'hasRelatedSynonym': 'dcs:SynonymTypeRelated', + 'hasNarrowSynonym': 'dcs:SynonymTypeNarrow', + 'hasBroadSynonym': 'dcs:SynonymTypeBroad', + 'hasExactSynonym': 'dcs:SynonymTypeExact' +} + +LIST_META_KEYS = [ + 'basicPropertyValues', # [{'pred': '', 'val': ''}, + # {'pred': '', 'val': ''}] + 'comments', # list + 'definition', # dictionary {'val': '', 'xrefs': []} + 'deprecated', # boolean + 'subsets', # list + 'synonyms', # [{'pred': '', 'val': '', 'xrefs': []}, + # {'pred': '', 'val': '', 'xrefs': []}] + 'xrefs' # [{'val': ''}] +] + +LIST_DROP_MIN_USED_SOURCES = [ + 'BACTERIAL_REGULATION_WORKING_GROUP', + 'BBOP', + 'BCS', + 'CLINGEN', + 'EBIBS', + 'GREEKC', + 'GENCC', + 'GMOD', + 'GO', + 'HGNC', + 'INDIANA', + 'INVITROGEN', + 'JAX', + 'LAMHDI', + 'MGD', + 'MGI', + 'MODENCODE', + 'OBO', + 'PHIGO', + 'POMBE', + 'RFAM', + 'RSC', + 'SANGER', + 'SBOL', + 'SGD', + 'UNIPROT', + 'WB', + 'XENBASE', + 'ZFIN' +] + + +LIST_DROP_UNKNOWN_SOURCES =[ +'EBI', +'FB', +'GOC', +'NCBI', +'SO' +] + + +# define functions +def pascalcase(s): + # convert string to pascal case + list_words = s.split() + converted = "".join( + word[0].upper() + word[1:].lower() for word in list_words) + return converted + + +def check_for_illegal_charc(s): + # check that dcid does not contain any illegal characters + # print error message if it does + list_illegal = ["'", "–", "*", ',' + ">", "<", "@", "]", "[", "|", ":", ";" + " "] + if any([x in s for x in list_illegal]): + print('Error! dcid contains illegal characters!', s) + + +def check_if_new_meta_variables(df): + # evaluate if any new properties are added to meta data + # print out error message if one is detected + l = [] + for index, row in df.iterrows(): + # identify keys in meta column dictionary + for item in row['meta'].keys(): + l.append(item) + l = list(set(l)) # remove duplicates + for item in l: + # check if any new keys added to meta dictionary + # print any new keys + if item not in LIST_META_KEYS: + print('Error! New key in meta dictionary! Update code to incorporate:', item) + return + + +def initiate_empty_meta_key_columns(df): + # initate new column of empty strings for every meta dictionary key + for key in LIST_META_KEYS: + df[key] = '' + return df + + +def expand_meta(df): + # expand the meta data into one line per entry in the dataframe + check_if_new_meta_variables(df) + df = initiate_empty_meta_key_columns(df) + for index, row in df.iterrows(): + for key in row['meta'].keys(): + df.at[index, key] = row['meta'][key] + df = df.drop(['meta'], axis=1) + return df + + +def format_reference_list(l): + # return first item in list if one exists + if len(l) > 0: + return l[0] + return '' + + +def expand_properties(df, col): + # expand dataframe so that each property with multiple values is represented on it's own line + new_rows = [] + for index, row in df.iterrows(): + c = row[col] + if isinstance(c, list): # Check if property is a list + for item in c: # Iterate over each property list + new_row = row.copy() # Deep copy to avoid modifying original + pred, value = item.get('pred'), item.get('val') + if '#' in pred: + pred = pred.split('#')[1] + new_row[pred] = value + new_rows.append(new_row) + else: + new_rows.append(row) # Unchanged row if 'property' is not a list + else: + new_rows.append(row) # Unchanged row if 'property' is not a list + + return pd.DataFrame(new_rows).drop([col], axis=1).fillna('') + + +def format_synonym_value(value): + # take the text value after ':' + if ':' in value: + value = value.split(':')[1] + return value + + +def expand_synonyms(df, col): + # expand dataframe so that each synonym is represented on it's own line + new_rows = [] + for index, row in df.iterrows(): + c = row[col] + if isinstance(c, list): # Check if synonyms is a list + for item in c: # Iterate over each synonym dictionary + new_row = row.copy() # Deep copy to avoid modifying original + new_row[col + '_pred'] = DICT_SYNONYM_TYPE[item.get('pred')] + new_row[col + '_value'] = format_synonym_value(item.get('val')) + new_row[col + '_source'] = format_reference_list(item.get('xrefs')) + new_rows.append(new_row) + else: + new_rows.append(row) # Unchanged row if 'synonyms' is not a list + + # replace '_' with ' ' + df_final = pd.DataFrame(new_rows).drop([col], axis=1).fillna('') + df_final[col + '_value'] = df_final[col + '_value'].str.replace('_', ' ') + + return df_final + + +def replace_illegal_dcid_characters(s): + # replace illegal characters with legal characters for dcid + s = s.replace('-', ' ')\ + .replace("'", ' ')\ + .replace(",", ' ')\ + .replace('(', '')\ + .replace(')', '')\ + .replace('*', '')\ + .replace(' ', ' ') + return s + + +def handle_placeholder_synonym_cases(df, index, row): + # delete info for synonym entry if a placeholder name was not updated with actual entry + if row['synonyms_value'] == '': + # if missing synonym value make sure all values related to synonym are blank + df.loc[index, 'synonyms_dcid'] = '' + df.loc[index, 'synonyms_value'] = '' + df.loc[index, 'synonyms_pred'] = '' + df.loc[index, 'synonyms_source'] = '' + return df + + +def format_synonym_dcid(df): + # format dcid for synonyms that have multiple pieces of info attached + # that will be stored in SequenceOntologyTermSynonym nodes + df['synonyms_dcid'] = df['synonyms_value'].copy() + # Deep copy to avoid modifying original + for index, row in df.iterrows(): + if len(row['synonyms_dcid']) > 0: + synonym = row['synonyms_dcid'] + synonym = replace_illegal_dcid_characters(synonym) + synonym = pascalcase(synonym) + synonym_dcid = row['dcid'] + '_' + synonym + df.loc[index, 'synonyms_dcid'] = synonym_dcid + df = handle_placeholder_synonym_cases(df, index, row) + check_for_illegal_charc(df.loc[index, 'synonyms_dcid']) + # format string values with commas appropriately + return df + + +def format_dcid_and_id(df): + # generate dcid, indicate id value, and format node names + df['url'] = df['id'] + df['dcid'] = 'bio/SO_' + df['id'].str.split('/SO_').str[1] + df['identifier'] = 'SO:' + df['id'].str.split('/SO_').str[1] + df['name'] = df['lbl'].str.replace('_', ' ') + df = format_synonym_dcid(df) + return df.drop(['id'], axis=1) + + +def is_not_none(x): + # check if value exists + if pd.isna(x): + return False + return True + + +def convert_list_to_string(df, col): + # convert a property that has a list to an appropriately formatted string value + for index, row in df.iterrows(): + v = row[col] + if isinstance(v, list): + v = ','.join(v) + df.loc[index, col] = v + return df + + +def format_list_as_text_strings(df, col_names): + """ + Converts missing values to numpy nan value and adds outside quotes + to strings (excluding np.nan). Applies change to columns specified in col_names. + """ + for col in col_names: + df = convert_list_to_string(df, col) + df[col] = df[col].str.rstrip() # Remove trailing whitespace + df[col] = df[col].replace([''],np.nan) # replace missing values with np.nan + + # Quote only string values + mask = df[col].apply(is_not_none) + df.loc[mask, col] = '"' + df.loc[mask, col].astype(str) + '"' + + return df + + +def format_text_strings(df, col_names): + """ + Converts missing values to numpy nan value and adds outside quotes + to strings (excluding np.nan). Applies change to columns specified in col_names. + """ + + for col in col_names: + df[col] = df[col].str.rstrip() # Remove trailing whitespace + df[col] = df[col].replace([''],np.nan) # replace missing values with np.nan + + # Quote only string values + mask = df[col].apply(is_not_none) + #df.loc[mask, col] = '"' + df.loc[mask, col].astype(str) + '"' + df.loc[mask, col] = df.loc[mask, col].apply(lambda x: f'"{x}"') + + return df + + +def check_if_url(list_id): + # if the id contains the start of a website url + # return identifier before the source as the id + # return source as 'HTTP' + # else return the identifir as is and the source + # as indicated before the first ':' + if list_id[0].startswith('http'): + identifier = (':').join(list_id) + source = 'HTTP' + elif list_id[1].startswith('www.'): + identifier = (':').join(list_id[1:]) + source = 'HTTP' + elif list_id[1].startswith('http'): + identifier = (':').join(list_id[1:]) + source = 'HTTP' + else: + identifier = (':').join(list_id[0:]) + source = list_id[0].upper() + return identifier, source + + +def extract_source_id(item): + # identify if value is a link, if it is format it appropriately + l = [] + list_id = item.split(':') + if len(list_id) > 2: + for i in list_id[1:]: + identifier, source = check_if_url(list_id) + l.append(identifier) + else: + identifier, source = check_if_url(list_id) + l.append(identifier) + return l, source + + +def check_if_col(df, col): + # check if a column is a column in the df + # if not, initiate a new column full of empty strings + if col in df.columns: + return df + df[col] = '' + return df + + +def seperate_xrefs_cols(value, index, col, df): + # initiate new column for all cross reference sources + df = check_if_col(df, col) + df.loc[index, col] = value + return df + + +def determine_source(value): + # identify source as the string prior to the ':' + if value.startswith('http'): + return 'HTTP' + source = value.split(':')[0] + source = source.upper() + return source + + +def expand_xrefs(df, col): + # for each database that provides a cross reference + # write the identifier values to their own columns + for index, row in df.iterrows(): + l = row[col] + for d in l: + value = d['val'] + source = 'xrefs_' + determine_source(value) + df = seperate_xrefs_cols(value, index, source, df) + df = df.drop(['xrefs'], axis=1) + return df + + +def format_edges(data): + # get edges from the json file + # format them as links between parent and child nodes + edges = data['graphs'][0]['edges'] + df = pd.DataFrame(edges) + df['dcid'] = 'bio/SO_' + df['sub'].str.split('/SO_').str[1] + df['parent'] = 'bio/SO_' + df['obj'].str.split('/SO_').str[1] + df = df.drop(['sub', 'pred', 'obj'], axis=1) + return df + + +def expand_definition(df, col): + # dictionary {'val': '', 'xrefs': []} + for index, row in df.iterrows(): + if len(row[col]) == 0: + continue + l = row[col]['xrefs'] + for item in l: + list_items, source = extract_source_id(item) + df = check_if_col(df, source) + df.loc[index, source] = (',').join(list_items) + df.loc[index, col] = row[col]['val'].strip('"') + df[col] = df[col].str.replace('\n', ' ') # remove new line characters + return df + + +def expand_columns(df): + # Expand nested 'meta' dictionary into individual columns + #df = pd.concat([df.drop(['meta'], axis=1), pd.json_normalize(df['meta'])], axis=1) + df = expand_meta(df) + # expand other columns + df = expand_definition(df, 'definition') + df = expand_properties(df, 'basicPropertyValues') + df = expand_synonyms(df, 'synonyms') + df = expand_xrefs(df, 'xrefs') + # drop uninformative xrefs + df = df.drop(LIST_DROP_MIN_USED_SOURCES, axis=1) + df = df.drop(LIST_DROP_UNKNOWN_SOURCES, axis=1) + # reset indices + df = df.reset_index() + return df + + +def format_subsets(df, col): + # convert subset properties to SequenceOntologySubset enums + for index, row in df.iterrows(): + l = [] + for i in row[col]: + subset = DICT_SUBSETS[i] + l.append(subset) + subset_value = (',').join(l) + df.loc[index, col] = subset_value + return df + + +def format_columns(df): + # format dcid, id, and name + df = format_dcid_and_id(df) + # format date column: + df['creation_date'] = df['creation_date'].str[:10] + # format subsets column + df = format_subsets(df, 'subsets') + # set Sequence Ontology ID as name if no name is provided + df.loc[df['name'] == '', 'name'] = df['identifier'] + # drop unnecessary columns + df.drop(['type', 'lbl', 'hasOBONamespace'], axis=1, inplace=True) + # convert list columns to string values + col_names = ['comments'] + df = format_list_as_text_strings(df, col_names) + df['comments'] = df['comments'].str.replace('\n', ' ') # remove new line characters + # put quotes around string values + col_names = ['name', 'definition', 'PMID', 'HTTP', 'DOI', + 'ISBN', 'POMBASE', 'ISSN', 'CHEBI', 'PMC', 'created_by', + 'hasAlternativeId', 'consider', 'synonyms_value','synonyms_source', + 'xrefs_HTTP', 'xrefs_LOINC', 'xrefs_MOD', 'xrefs_RNAMOD', + 'xrefs_DOI', 'xrefs_WIKIPEDIA','xrefs_PMID', 'url', 'identifier'] + format_text_strings(df, col_names) + return df + + +def format_nodes(data): + # format information about each SequenceOntologyTerm node + # by parsing the json file + nodes = data['graphs'][0]['nodes'] + df = pd.DataFrame(nodes) # create pandas df + df = df[df['type'] == 'CLASS'] # limit to class values + # expand each feature into individual columns + df = expand_columns(df) + # data clean and format columns + df = format_columns(df) + return df + + +def read_json_to_df(file_input): + # convert the json to csv with each property represented + # in it's own seperate column + with open(file_input, 'r') as file: # load json + data = json.load(file) + # extract edges + df_edges = format_edges(data) + # extract nodes + df_nodes = format_nodes(data) + # merge nodes and edges dfs together + df = pd.merge(df_nodes, df_edges, on='dcid', how='outer') + # drop duplicates and replace missing values + df.drop_duplicates(inplace=True) + df = df.fillna('') + # drop index + df = df.drop(['index'], axis=1) + return df + + +def main(): + file_input = sys.argv[1] # read in file + file_output = sys.argv[2] + df = read_json_to_df(file_input) # convert json to pandas df + df.to_csv(file_output, doublequote=False, escapechar='\\', index=False) + + +if __name__ == "__main__": + main() diff --git a/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/run.sh b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/run.sh new file mode 100644 index 0000000000..8ef36a143d --- /dev/null +++ b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/run.sh @@ -0,0 +1,29 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Author: Samantha Piekos +Date: 05/10/2024 +Name: run +Description: This file formats The Sequence Ontology json file into a +CSV for import into the Data Commons knowledge graph. It preserves the +parent-child relationships between nodes, node information, and alternative +identifiers. It also includes links to corresponding MeSHDescriptor nodes. +""" + +#!/bin/bash + +mkdir -p CSV + +# extract the node information for the sequence +python3 scripts/format_the_sequence_ontology.py input/so.json CSV/the_sequence_ontology.csv diff --git a/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/tests.sh b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/tests.sh new file mode 100644 index 0000000000..604f7ff3c0 --- /dev/null +++ b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/scripts/tests.sh @@ -0,0 +1,34 @@ +# Copyright 2024 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Author: Samantha Piekos +Date: 05/10/2024 +Name: tests +Description: This file runs the Data Commons Java tool to run standard +tests on tmcf + CSV pairs for The Sequence Ontology import. This assumes +that the user has Java Remote Environment (JRE) installed, which is needed +to locally install Data Commons test tool (v. 0.1-alpha.1k) prior to +calling the tool to evaluate tmcf + CSV pairs. +""" + +#!/bin/bash + +# download data commons java test tool version 0.1-alpha.1k +mkdir -p tmp; cd tmp +wget https://github.com/datacommonsorg/import/releases/download/0.1-alpha.1k/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar +cd .. + +# run tests on desc file csv + tmcf pairs +java -jar tmp/datacommons-import-tool-0.1-alpha.1-jar-with-dependencies.jar lint tMCF/the_sequence_ontology.tmcf CSV/the_sequence_ontology.csv *.mcf +mv dc_generated the_sequence_ontology diff --git a/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/tMCF/the_sequence_ontology.tmcf b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/tMCF/the_sequence_ontology.tmcf new file mode 100644 index 0000000000..485d776af5 --- /dev/null +++ b/scripts/biomedical/Gene_Ontology_Consortium/The_Sequence_Ontology/tMCF/the_sequence_ontology.tmcf @@ -0,0 +1,40 @@ +Node: E:the_sequence_ontology->E1 +typeOf: dcs:SequenceOntologyTerm +dcid: C:the_sequence_ontology->parent + +Node: E:the_sequence_ontology->E2 +typeOf: dcs:SequenceOntologyTermSynonym +dcid: C:the_sequence_ontology->synonyms_dcid +name: C:the_sequence_ontology->synonyms_value +synonymType: C:the_sequence_ontology->synonyms_pred +source: C:the_sequence_ontology->synonyms_source + +Node: E:the_sequence_ontology->E3 +typeOf: dcs:SequenceOntologyTerm +dcid: C:the_sequence_ontology->dcid +name: C:the_sequence_ontology->name +chebiID: C:the_sequence_ontology->CHEBI +comment: C:the_sequence_ontology->comments +dateCreated: C:the_sequence_ontology->creation_date +description: C:the_sequence_ontology->definition +descriptionUrl: C:the_sequence_ontology->HTTP +digitalObjectIdentifier: C:the_sequence_ontology->DOI +digitalObjectIdentifier: C:the_sequence_ontology->xrefs_DOI +identifier: C:the_sequence_ontology->identifier +internationalStandardBookNumber: C:the_sequence_ontology->ISBN +internationalStandardSerialNumber: C:the_sequence_ontology->ISSN +isDeprecated: C:the_sequence_ontology->deprecated +loincCode: C:the_sequence_ontology->xrefs_LOINC +pomBaseId: C:the_sequence_ontology->POMBASE +proteinModificationOntology: C:the_sequence_ontology->xrefs_MOD +pubMedCentralId: C:the_sequence_ontology->PMC +pubMedID: C:the_sequence_ontology->xrefs_PMID +rnaModId: C:the_sequence_ontology->xrefs_RNAMOD +sequenceOntologySubset: C:the_sequence_ontology->subsets +specializationOf: E:the_sequence_ontology->E1 +submitter: C:the_sequence_ontology->created_by +synonym: E:the_sequence_ontology->E2 +synonym: C:the_sequence_ontology->consider +synonym: C:the_sequence_ontology->hasAlternativeId +url: C:the_sequence_ontology->url +url: C:the_sequence_ontology->xrefs_HTTP