Diseases at jensen lab #1107

spiekos · 2024-11-02T01:33:23Z

This adds all the documentation regarding the DISEASES by JensenLab import. This supersedes PR #998.

update `associationSource` to `associationType` and the names for the associated enum appropriately; update output csv file names; update checks for icd10 code dcids and update references to these links

change csv and tmcf file names to `experiment.*` and update `associationSource` to `associationType`

Update property names

fix links to associationType values

Update property names and name of referencing csv + tmcf file pair

….tmcf Update property names and the naming of the csv + tmcf pair files

fix links to NonCodingRNATypeEnum

…ng.tmcf Update property names and the names for the tmcf and csv file pair

…xtMining.tmcf Update property names and the file names for the csv + tmcf pair

update tmcf filepaths

add link to run.sh file

update output csv file names

Add commands to combine codingGenes-textMining csv files into a single csv

fix malformed tmcf line

fix link bug

update the script so that it downloads, cleans and formats the data in CSV, and removes the original files

…l.tmcf

…xtMining.tmcf

…nual.tmcf

…ining.tmcf

Add additional notes and caveats

google-cla · 2024-11-02T01:33:28Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

chejennifer · 2024-11-04T18:48:10Z

scripts/biomedical/diseasesAtJensenLab/README.md

+
+As you can see there is a cascading representation of the associated ICD10 codes of 'ICD10:N', 'ICD10:N0', 'ICD10:N04' and a second tree of 'ICD10:N3', 'ICD10:N39', 'ICD10:399'. 'ICD10:N', 'ICD10:N0', 'ICD10:N3', and 'ICD10:root' all do not correspond to any ICD10 codes and thus these lines were removed along with any other line in which an ICD10 code had one or two digits or was root following 'ICD10:'. Then for this particular association, the lowest unique tree leaves were taken in as associations with the Gene 'HSP86'. In this case that is 'ICD10:N04' and 'ICD10:N399'. The remaining line with 'ICD10:N39' was discarded as being a less specific referal than 'ICD10:N399'. Finally, the ICD10 codes were reformatted as necessary so that they follow the proper convention. There is a '.' following the regex string of [A-Z][0-9][0-9]. So, codes like 'ICD10:N399' were converted into the appropriate format of 'ICD10:N39.9' through insertion of the missing '.'.
+
+The DISEASES datasource is composed of three datasets that were generated using three distinct methods: experiment (experimental data), knowledge (manual curation), and textmining. The specific dataset from which the data is from (i.e. 'experiments', 'knowledge', or 'textmining') is indicated by the providence associated with each property value in the corresponding Disease, DiseaseGeneAssociation, Gene, or ICD10Code node. Also note that for each DiseaseGeneAssociation the link is between a Gene and a Disease represented by either a DOID or ICD10Code. Regardless of which the link is stored in the "diseaseID" property, which points to the corresponding Disease (in cases of DOID) or ICD10Code (in cases of ICD10 code) nodes.


is "providence" supposed to be "provenance"?

chejennifer · 2024-11-04T18:50:25Z

scripts/biomedical/diseasesAtJensenLab/README.md

+
+### dcid Generation
+
+Dcids for DiseaseGeneAssociation nodes were generated as follow either:


remove "follow"?

chejennifer · 2024-11-04T18:51:26Z

scripts/biomedical/diseasesAtJensenLab/README.md

+### dcid Generation
+
+Dcids for DiseaseGeneAssociation nodes were generated as follow either:
+'bio/DOID_<trailing_DOID>_<geneSymbol>


please fix formatting, everything is on one line when viewing the markdown

chejennifer · 2024-11-04T18:52:11Z

scripts/biomedical/diseasesAtJensenLab/README.md

+Dcids for DiseaseGeneAssociation nodes were generated as follow either:
+'bio/DOID_<trailing_DOID>_<geneSymbol>
+'bio/ICD10_<trailing_ICD10Code>_<geneSymbol>
+where the <DOID> and <trailing_ICD10Code> represent the id following the ':', <geneSymbol> represents the Gene's gene symbol. For example: `bio/DOID_0050177_SEMA3F` and `bio/DOID_0050736_SEMA3F`.


should be <trailing_DOID>? to be consistent with line 54

chejennifer · 2024-11-04T18:58:58Z

scripts/biomedical/diseasesAtJensenLab/README.md

+Generate the cleaned CSVs including splitting into seperate non-coding and coding genes into seperate csv files for each input file:
+
+```bash
+sh run.sh


nit: naming this script run is slightly confusing because when a script is called "run", I would expect it to be a script that does everything and to be the only script I need to run. Maybe call this "process" or "clean" or "generate_csvs" or just something more specific?

chejennifer · 2024-11-04T23:46:48Z