-
Notifications
You must be signed in to change notification settings - Fork 97
Index Pipelines
Index pipeline is the process of ingesting data into an OpenCGA-Storage backend. We define a general pipeline that is used and extended by the multiple format supported. This pipeline can be extended by additional steps of enrichment, which will be highly dependent on the file format. At the end, the data may be filtered to be visualized, or used as analysis input data.
This concept is represented in Catalog to help the tracking of this status in different files.
Indexing data pipeline consists in three steps, transforming the input raw data into an intermediate format, loading it into the selected database, depending on the implementation, and adding more information to the loaded data by calculating statistics or adding extra information like annotation.
- Transform
The first and one of the most important steps is the transformation. At this point, the pipeline ensures that the input file is valid and can be loaded into the database. The input file is read and converted into the OpenCB models, defined in Biodata. See Data Models for more info.
Depending on the input data the process will be more or less complex, and, at the end, the file will be serialized into disk using some serialization schema like Avro or Protobuf.
Next steps can not start after the transformation is completely finished, so we ensure avoid loading non valid data. Only if the file is correctly transformed, we ensure that the data is able to be loaded, so the next steps can take place.
This step is shared between all the storage engine implementations of the same bioformat, so the result should be valid for any implementation.
- Load
This step is intended to be as fast as possible, to avoid unnecessary downtimes in the database due to the work load. Because of this, all the convert and validate operations are made in the previous step.
Most of the storage engines are not going to load directly the opencb models, and some more engine dependent transformations are still expected.
Two steps load
- Enrichment
The asdf
- Transform
- Variant Normalization
- Validation
- Load
- Stats calculation
- Annotation
OpenCGA is an open source project and it is freely available.
General
- Home
- Architecture
- Data Models
- RESTful Web Services
- Configuration
- Download and Installation
- Tutorials
OpenCGA Catalog
OpenCGA Storage
About