Skip to content

Releases: floratos-lab/hipc-dashboard-pipeline

HIPC Dashboard pipeline v1.5.0

24 Apr 20:47
Compare
Choose a tag to compare

Changes in version 1.5.0 (Data)

  • Process curations generated by the Bob Hancock group, for the COVID-19 Literature
    Annotation Project, representing SARS-CoV-2 infection signatures. The following
    32 PMIDs were processed:
    32217835, 32307550, 32320677, 32514047, 32632085, 32650275, 32717743, 32788292,
    32794454, 32816392, 32826343, 32829467, 32941618, 32958614, 33013872, 33028979,
    33033248, 33060840, 33119547, 33187979, 33195414, 33208459, 33239683, 33391280,
    33425248, 33478949, 33546489, 33625796, 33785765, 33846275, 33937096, 34099652
  • Update the curation templates under docs\ with the addition of new values for
    some of the pick lists.

Changes in version 1.5.0 (Pipeline)

  • A large scope of code is cleaned up and re-organized.
    -- Especially, the main script is separated into four. This make the processing and output
    fully reproducible. Previously, various control parameters are changed manually
    to create all output.
    -- Many small errors and corner cases are fixed. Some of the issues showed up only with new data.
    -- Text encoding handling is corrected.
  • Move PMID lookup handling to separate routine to support different scenarios.
  • No longer accept space as a valid separator character for gene symbols.
  • Allow to complete cell-type runs even if some cell-types not mapped.
  • Modify the observation summary template for cell-type frequency signatures, to improve
    readability.

HIPC Dashboard pipeline v1.4.0

08 Mar 22:46
Compare
Choose a tag to compare

Changes in version 1.4.0 (Data)

  • Add additional gene expression signatures from Arunachalam et al., 2020

Changes in version 1.4.0 (Pipeline)

  • Add splitting on tissue_type_term_id (split 4).
    -- Split tissue_type_term_ids get individual columns in Dashboard submission files
  • Add splitting on comparison field entries (split 5)
  • Adapt code to new directory structure in github repository
  • Change "recreated_templates" to "standardized_curation_templates" in output file names
  • Add writing of standardized, fully denormalized versions of data for easy reuse
  • Rename submission id variables in code and standardized files for clarity
    -- subm_obs_id -> sig_subm_id, uniq_obs_id -> sig_row_id
  • Revise messages in code for clarity
  • Separate writing tab-delimited and csv versions of standardized submission files to different, specifiable directories
  • Fix problem with logging using log_no_valid_symbol_vs_pmid()
  • "pmid:" tags now removed immediately

HIPC Dashboard pipeline v1.3.0

09 Feb 17:11
418d1c5
Compare
Choose a tag to compare

HIPC Dashboard Pipeline

Changes in version 1.3.0

  • Support new exposure type: infection (COVID-19 studies)
  • For subject type "cell_subset", use cell ontology ids instead of display names.
    This concerns tissue types (column tissue_type_term_id) and cell-type response components (column response_component_id)
  • For subject type "pathogen", use NCBI taxonomy ids instead of display names
    (columns target_pathogen_taxonid for vaccine, exposure_material_id for infection).
  • For tissue types, support parsing out cell ontology IDs from curator-oriented data validation pulldown values added to templates:
    -- The pulldown value is shown first. Just the term is also permitted:
    CL:0000624 (CD4-positive, alpha-beta T cell)
    CL:0000624
  • For target pathogens in the vaccine templates, support parsing out NCBI taxonomy IDs
    from curator-oriented data validation pulldown values added to templates:
    -- The pulldown value is shown first. The other values are also accepted:
    “ncbi_taxid:11090 (Yellow fever virus 17D)”
    “ncbi_taxid:11090”
    “11090” (defaults to NCBI taxonomy id)
  • In the target_pathogen_taxonid column of vaccine templates, for influenza vaccine entries,
    allow use of lookup tags for taxonomy ids. The tags will be replaced by the actual viral
    components of the vaccine, looked up in the file "vaccine_years.txt. Multiple entries are allowed.
    -- For example
    “influ:2008” will be substituted with the several actual viral components of the vaccine for year 2008.
    “influ:2007; influ:2008; influ:2009” will be substituted with the union of the three year's viral components.
  • Add ability to return vaccine pathogen component NCBI taxonomy ids to lookup function.
  • Gene lookup using the NCBI data file now also checks the nomenclature_authority column (HGNC).
    If a query symbol is not found in the official NCBI symbols or aliases,
    but IS found in the nomenclature_authority column, the NCBI official symbol is substituted.

HIPC data/curation sheet-related changes

Changes in version 1.3.0

  • Add two infection (COVID-19) studies:
    -- Odak et al., 2020
    -- Arunachalam et al., 2020
  • Create pathogen and cell_subset lookup-values for curation sheets. Format is e.g.
    -- “ncbi_taxid:11090 (Yellow fever virus 17D)”
    -- "CL:2000001 (peripheral blood mononuclear cell)"
  • The following formats are also accepted
    -- “ncbi_taxid:11090”
    -- "CL:2000001"
    -- “11090” (tissues only, defaults to NCBI taxonomy id)
  • Add cell ontology and NCBI taxonomy IDs to existing data as needed.

HIPC Dashboard pipeline v1.2.1

28 Sep 02:43
95b783b
Compare
Choose a tag to compare

HIPC Dashboard Pipeline

Changes in version 1.2.1

  • update some signatures to clarify comparison column entries
  • update observation summary template to match change to Dashboard special character handling (left-paren, ampersand)

HIPC Dashboard pipeline v1.2.0

23 Sep 19:40
04b629f
Compare
Choose a tag to compare

HIPC Dashboard Pipeline

Changes in version 1.2.0

  • Update column headers:
    -- baseline_time to baseline_time_event
    -- exposure_material to exposure_material_id
    -- exposure_material_text to exposure_material
    -- publication_reference to publication_reference_id
    -- submission_date to curation_date
    -- extra_comments to curator_comments
  • Remove certain columns from templates:
    -- remove submission_name and template_name, as values are generated in code
    -- remove addntl_time_point_units, group1 and group0 because not in use
  • Merge certain columns into cohort and remove from templates
    -- merge subgroup into cohort, remove subgroup and update code
    -- merge age_group into cohort, remove age_group
  • New columns added by pipeline now added at end of templates rather than at fixed locations
  • Use tab-delimited files rather than Excel for all input and public output files
  • Change code to extract pubmed publish date (may be epub for some),
    and remove more complex code to construct pubmed-based print publish date
  • Use publication_year column for actual print year.
    Previously retrieved from Pubmed as PubDate option but value not always present.
  • No longer reconstruct article abstract from xml.
    It preserved special characters that are not wanted.
    Only used by mSigDB code.
  • Add time_point, time_point_units, and baseline_time_event to observation summary statements to disambiguate.
  • Add additional_exposure_material to observation summary statements if it has an entry (most do not).
  • Template data cleaning:
    -- change all ontology references to type:value format
    -- remove confirmed redundant subgroup entries
  • Add changelog.txt and HIPC_Dashboard_curation_template_fields.pdf to project.
    © 2021 GitHub, Inc.

HIPC Dashboard pipeline v1.1.0

18 Mar 01:42
Compare
Choose a tag to compare

This release includes further cleaning of the curated data and also changes to the pipeline code, including:

  • changes to how descriptive terms are divided between the "response_behavior" and "comparison" fields. Terms such as "correlated" now are included in the "response_behavior", e.g. "positively correlated"
  • add choose_joining_preposition() for constructing observation summary statements
  • add trimws() on input
  • add check_response_components_overlap() - quality control to check for the case where a gene list that should be split between two rows, e.g. up and down regulated genes, ends up with one half in one row but the entire list in the other row

HIPC Dashboard pipeline v1.0.1

12 Feb 23:41
Compare
Choose a tag to compare

Add handling the case where the first author listed by a journal is a consortium, which is coded differently in the pubmed xml download. Existing library function failed on this PMID (28842433). Add periods after journal abbreviations. Remove periods at end of some comparisons, so that they slot into observation summary statements smoothly.

HIPC Dashboard pipeline v1.0.0

11 Feb 17:19
Compare
Choose a tag to compare

First production release of HIPC Dashboard pipeline code and data.