HIVE Software Engineering Principles
-
Our software development teams use a multi-institutional Agile Scrum approach to create HuBMAP technologies deployed using microservices in a hybrid cloud. We run daily distributed stand-ups and two week sprint cycles. This enables continuous new deployments of features and enhancements under permissive open source licenses.
-
The HuBMAP Portal principally utilizes the following core technologies, frameworks, and languages: Globus (identity federation, data flow), Python (APIs), Javascript (UI), Neo4j (graph databases), Docker (container per micro service), and Airflow (workflows), among others. Core storage and other high performance services run locally at Pittsburgh Supercomputing Center whereas high availability services run on Amazon Web Services.
-
Software issues, enhancement, and feature requests are tracked using a GitHub issues board that is populated directly by developers and by user feedback via the help desk.
-
HuBMAP technology documentation resides in the Portal documentation area as well as within HuBMAP GitHub repositories. Other locations include our API viewable on SmartAPI. We manage our documentation using markdown.
-
HuBMAP technologies use a microservices architecture and is driven by the API Gateway, Provenance services, and Pipeline Container Orchestration.
-
We maintain dev, test, and production instances of most HuBMAP systems. In some areas we use continuous integration with Travis CI or GitHub CI.
HuBMAP Data Ingest
-
HuBMAP HIVE is responsible for producing and managing data ingest processes and associated software in collaboration with the Data Providers. HuBMAP Data Providers are responsible for producing data and metadata in collaboration with the HIVE. These processes are rapidly evolving into scalable pipelines.
-
The core ingest software and UI includes: the Data ingest tool (data & metadata, sample, assay, antibody report, contributor upload), Manual dataset ingest utilities, Workflow management + Common Workflow Language tool, individual data pipelines, common coordinate framework / spatial registration via RUI, with Federated identity management and file transfer via Globus.
-
HuBMAP metadata is ingested into a Dockerized Neo4j graph database for Provenance as well as various function-specific relational and no-sql databases.
-
Data providers submit data using a combination of web registration forms, tools noted above, and registration of experimental and sample protocols at Protocols.io. Metadata is submitted through the ingest process as Tab separated value (.TSV) files containing sample, assay, antibody, and contributor metadata that meets HuBMAP specifications.
-
The UUID API forms the basis of ID generation. Data providers use the Tissue & donor registration tool to generate donor, organ, tissue sample (including spatial data), and dataset-specific identifiers that are interlinked and displayed on the Portal.
-
We accept Donor data on a HIPAA conforming Globus site and de-identify Donor data using professional de-identification services via manual abstraction from organ procurement organizations, DICOM data, electronic health record and other tabular data, as available.
-
Our antibody validation database and query system (pending release) includes antibody validations done by RRID by assay by organ. For individual datasets data contributors will include the RRID (and related information) for each imaging channel in antibody tab separated values files enabling linkage of submitted antibodies & their validation reports.
-
Each HuBMAP collection, ASCT+B table, and reference object receives its own Digital Object Identifier (DOI) using HuBMAP’s DOI registration service. Each dataset will have its HuBMAP DOI soon. We produce protocol DOIs via protocols.io and standard publication DOIs via HuBMAP Publications.
-
The CCF RUI (Registration User Interface) is a tool that supports the registration of a three-dimensional (3D) tissue block within a 3D reference organ. The registration data is used in current versions of the Common Coordinate Framework (CCF, see CCF RUI SOP, CCF RUI GitHub repository, RUI Demo) and the CCF Exploration User Interface (EUI) developed within HuBMAP. The RUI currently supports 11 organs, written in TypeScript using libraries such as: Angular 11, Deck.gl, NGXS, Angular Material, and N3.js.
-
We will also associate ontologies for reference organs, anatomical structures, cell types, and biomarkers using CCF reference objects, ASCT+B tables, and Azimuth reference objects with the data ingest items.
HuBMAP Data Validation
-
HuBMAP Data Validation is a continuously improving process that starts with defining QC/QA standards and establishing definitions for donor, sample and assay metadata. Standards, definitions, metadata schema and data directory schema are created by teams under the Data Coordination Working Group. Metadata schemas are available here, along with Excel templates with dropdowns for data entry.
-
Data providers format their data and metadata files according to the metadata and data directory schema specifications for each assay type. Required formats for metadata field input are described in the Github page for each assay-specific metadata schema. Data providers also include the required QA/QC assessments of their data as components of the submission.
-
Data providers receive registration and validation guidance using HuBMAP’s data submission guide (currently v1.0) as well as Ingest tool documentation.
-
HuBMAP validation tools written in Python ensure data submissions conform to HuBMAP standards which are shared and documented for data providers to use to run many of HuBMAP’s checks on their own prior to submission. Other services include Metadata submission conversion, ingest validation and base checks (checksum, file type, etc.) as well as assay-specific checks.
-
HuBMAP staff conduct 178 (and growing) automated and manual QA/QC checks as part of the data submission & publication process. Manual validation steps are being automated as development capacity allows.
-
Prior to publication, each dataset is formally approved by the data-providing institution and one or more HIVE members. Data providers must also confirm the quality of spatial and semantic metadata using the CCF EUI.
HIVE Data Processing
-
The following HuBMAP pipelines are run by the HIVE on data from the Data Providers with their assent to gain maximum consistency and usability of final published datasets produced by HuBMAP: CODEX (Cytokit + SPRM), “Example Pipeline”, Imaging Mass Spectrometry & MxIF, sc/snATAC-seq (SnapTools, SnapATAC, and chromVAR), sc/snRNA-seq (Salmon, Scanpy, scVelo), SPRM (Imaging pipeline), Spatial Transcriptomics (Starfish).
-
Pipelines are Dockerized by HIVE or data providers and verified by HIVE and integrated with the other portal components, including these general pipeline tools:Data ingest pipeline, Mixed datatype pipeline tools, OME.TIFF Pyramid, Pipeline visualization (CWL), Pipeline deployment. These are run by the HIVE in the process of generating datasets for publication.
-
The HuBMAP pipelines generate these data types via these tools: Sequencing (FASTQ) file tools, Sequencing (snap) file tools, Visualization pre-processing, Vitessce pre-processing, Base QA pipeline. QA metrics service (assay specific pipeline QA metric sharing).
-
Each of the pipelines produce data and metadata back to the ingest services to enable management of publication status and controlled access of metadata and datasets. Datasets, once approved, are pushed to published and public status, using custom code which changes the status to public of upstream Provenance entities (e.g., samples, donors) and downstream files (e.g., movement of data to Globus public access endpoints if not protected sequence data).
-
We currently manually capture dataset submission & publication efforts including active datasets’ status, target month of publication, and future datasets. We comprehensively track donor, sample, dataset, spatial, pipeline, visualization, antibody, security (identifiably sequencing), protocol, documentation, metadata & QA/QC standards compliance, and data contributors.
-
Internally, we regularly update data into a spreadsheet and use our Sankey diagram tool to view HuBMAP’s current and planned state of dataset publication (Figure).
HuBMAP Data Portal
-
The HuBMAP Data Portal UI is principally a Flask app, using React on the front end and primarily Elasticsearch on the back end, wrapped in a Docker container for deployment using Docker Compose. It is deployed at portal.hubmapconsortium.org. Scientists access summary data, visualizations, and data downloads by dataset on the Portal. Globus facilitates file transfer for local use of data.
-
The HuBMAP Portal Style Guide is used for the Data Portal and other HuBMAP sites.
-
While HuBMAP published datasets are openly accessible, HuBMAP consortium level access is managed via the HuBMAP profile system and uses Globus authentication for credential checking.
-
The Vitessce Viewer is a visual integration tool for exploration of spatial single cell experiments. Its modular design is optimized for scalable, linked visualizations that support the spatial and non-spatial representation of tissue-, cell- and molecule-level data. Vitessce integrates the Viv library to visualize highly multiplexed, high-resolution, high-bit depth image data directly from OME-TIFF files and Bio-Formats-compatible Zarr stores.
-
Multiple opportunities to query the data use these mechanisms: General Search (Elasticsearch), Query tools and Facets (integrated in UI), and Semantic query (not yet available to Portal users) including by Gene, Cell, Spatial, and Multidimensional; while the CCF EUI provides a detailed look at different parts of the human body, including the heart, kidney, and spleen and spatial data query.
-
HuBMAP’s APIs support registration and loading of data that complies with HuBMAP data standards and ingest formats as well as core functions underpinning the Portal UI itself. Data Search - Search API is a thin wrapper of the Elasticsearch. It handles data indexing and reindexing into the backend Elasticsearch. Identity system - The uuid-api service is a restful web service used to create and query UUIDs used across HuBMAP.
-
The HuBMAP Portal provides access to cutting-edge tools to help analyze the data such as the ASCT+B Reporter - includes a partonomy tree that presents relationships between various anatomical structures and substructures, that is combined with their respective cell types and biomarkers via a bimodal network - and Azimuth - is a Shiny app demonstrating a query-reference mapping algorithm for single-cell data - and the Cells API: backend, js client, py client - with other tolls coming such as the Knowledge Graph and associated Schema for Ontology ingest & API services and application and biomedical ontologies
-
The HIVE monitors HuBMAP portal activity including usage, download, and limited demographic factors using Monitoring services. Current State FAIRness Assessment.
HuBMAP Governance & Due Diligence
-
HuBMAP consortium policies are located on the consortium website and cover associate membership, consent, data sharing, data use, material transfer, publication, and NIH-applicable Genomic Data Sharing with HuBMAP data.
-
We use three categories of permissions for securing access to HuBMAP data: protected, consortium, and public
-
Consortium-level access is driven from an integrated user registration tool that collects and associates credentials among Members’ institutions, Globus file transfer service, GitHub code repositories, Google Drive document storage, and other services presented via the WordPress based HuBMAP consortium website.
-
Any identifiable sequencing data is accessible via dbGaP within 6 months of initial publication on the HuBMAP portal in order to ensure secure access to this sensitive data -- for details, see the Sequencing data dbGaP submission tool
-
Data providers and the HIVE are responsible for secure loading and storage of identifiable sequencing data -- generally, the data providers manage administrative interaction with dbGaP and the HIVE (IEC) manages technical interaction & data loading of identifiable sequencing datasets.
-
We also automatically collect and display HuBMAP-generated and referenced publications using Google Scholar.