OpenCGA Storage Overview

Overview

OpenCGA Storage is the submodule in charge of ingest and provide a query language over biological BigData. This storage-engine handles the most common file formats for NGS.

The aim of this module is provide a query engine that will be the source of data for the analysis.

Bioformats

There are an increasing number of biological formats supported by OpenCGA related with the most common steps on a common NGS pipeline. Within this formats, we focus on Genomic Variants due to the complexity and analysis capabilities.

Variants

Study oriented
Cohort definition

More information

Alignments

Coverage calculation
Alignment stats

More information

Other

Sequence (fastA)
Feature formats (GFF, BED, BigWIG...)

Multiple Storage Engine Implementations

Based on a common definition of the Storage Engines, depending on the requirements and the resources of the study, there are multiple implementations using different technologies. This document specifies the core functionality that all the implementations must share. Technical details or customization parameters are explained in the plugin specific section.

Pipeline

📄 -(transform)-> intermediate -(load)-> DB -(query)-> result

Configuration file

All the configuration needed to work with is centralized in the storage-configuration.yml file, usually located in the configuration folder.

This file contains the configuration for all the storage engines, the database connections and other cellbase and server information.

In the Storage Configuration you can find an extended explanation of the file structure and all the parameters.

REST/gRPC Servers

OpenCGA is an open source project and it is freely available.

General

OpenCGA Catalog

OpenCGA Storage

About

Provide feedback

Saved searches

Use saved searches to filter your results more quickly