-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PharmGKB Data Import #1056
Open
spiekos
wants to merge
25
commits into
datacommonsorg:master
Choose a base branch
from
spiekos:PharmGKB
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
PharmGKB Data Import #1056
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add README.md file that explains the import including introducing the datasets, description of the import, file artifacts, and import procedure
Add initial draft of README.md from Suhana
Add tmcf and scripts files that support import of primary and relationships data from PharmGKB
Add manual mapping file between diseases PharmGKB Ids and MeSH Descriptor Ids, which are not retrieved via name matching to existing MeSH Descriptor nodes using the datacommons API
Add the mapping files between pharmgkb ids of chemicals and drugs to dcids representing existing corresponding nodes in data commons. These files are generated as output by the `format_chemicals.py` and the `format_drugs.py` scripts respectively.
Update table of contents including adding new sections, update the import procedure, script files and links, and the Notes and Caveat subsection
Fill in Dataset Documentation and Releavant Links subsection
Fix superscripting
Add About the Dataset, Download Data, and Dataset Overview subsections
Fill in the Artifacts subsection
Fill in schema overview subsection
update paths to tmcf files and the new schema subsection
update dcid Generation subsection for phenotypes
update New Schema subsection
update information regarding tests
copybara-service bot
pushed a commit
to datacommonsorg/schema
that referenced
this pull request
Jul 17, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056 Schema Changes: - Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum. - Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount. - Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags. PiperOrigin-RevId: 653311391
copybara-service bot
pushed a commit
to datacommonsorg/schema
that referenced
this pull request
Jul 17, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056 Schema Changes: - Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum. - Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount. - Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags. PiperOrigin-RevId: 653311391
copybara-service bot
pushed a commit
to datacommonsorg/schema
that referenced
this pull request
Jul 17, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056 Schema Changes: - Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum. - Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount. - Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags. PiperOrigin-RevId: 653311391
copybara-service bot
pushed a commit
to datacommonsorg/schema
that referenced
this pull request
Jul 17, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056 Schema Changes: - Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum. - Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount. - Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags. PiperOrigin-RevId: 653311391
copybara-service bot
pushed a commit
to datacommonsorg/schema
that referenced
this pull request
Jul 18, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056 Schema Changes: - Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum. - Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount. - Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags. PiperOrigin-RevId: 653311391
copybara-service bot
pushed a commit
to datacommonsorg/schema
that referenced
this pull request
Jul 23, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056 Schema Changes: - Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum. - Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount. - Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags. PiperOrigin-RevId: 653311391
copybara-service bot
pushed a commit
to datacommonsorg/schema
that referenced
this pull request
Jul 23, 2024
…ng and formatting of the PharmGKB CSV+tMCF pairs. It also breaks out the phenotypes to distinguish that one is a MeSHQualifier and seven are MeSHSupplementaryConceptRecords, unlike the rest of the phenotypes which are MeSHDescriptors. Therefore these were separated into 3 CSV+tMCF pairs. In particular, the links to the enums and between entity types were fixed. This was done by initializing all nodes referenced and then pointing to them within the tMCF. Because of this any existence missing errors in the json reports can be ignored. The changes to the scripts, tMCF files, and documentation (README.md) for this import are part of GitHub PR 1056 datacommonsorg/data#1056 Schema Changes: - Add CPICLevelEnum, DosageGuidelineSourceCpicNoRecommendation, DrugTypeEnum, PGxLevelEnum, PharmacogeneticAssociationEnum. - Add properties for clinicalAnnotationCount, clinicalAnnotationCountLevel1_2, clinicalGuidelineAnnotationCount, dosageGuideline, drugHasPrescribingInfo, drugLabelAnnotationCount, drugType, fdaTopPharmacogeneticLevel, geneticVariantAnnotationCount, hasCpicDosingGuideline, hasGenomicCoordinates, hasGeneticVariantAnnotation, hasPrescribingInfo, medicalDictionaryForRegulatoryActivitiesId, metabolicPathwayCount, pharmageneticAssociation, topClinicalAnnotationLevel, topCpicLevel, topPharmacogeneticLevel, veryImportantPharmacogeneCount. - Remove properties for fdaTopPGxLevel, mintID, nationalClinicalTrialNumber, nationalDrugCode, nationalDrugFileReferenceTerminologyCode, neuroMabID, patentID,pharmGkbClinicalAnnotationCount,pharmGkbPathwayCount, pkgbTags. PiperOrigin-RevId: 655002332
Add information about dcid illegal character @ being replaced with _Cluster when generating Gene dcids.
spiekos
requested review from
dwnoble and
clincoln8
and removed request for
pradh
October 16, 2024 05:28
Update gene_var.tmcf to ssign Entity2 as Variant and Entity1 as Gene in the output csv file.
Fix references for entity1 vs entity2 in output file so that they correctly map to the GeneticVariant and Gene entities
Fix hierarchical ontology for classes used in import
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This documents the PharmGKB data import including introducing the data, the import process, new schema, scripts, tmcfs, and import process.