Architecture

This section describes the internal architecture of OpenCGA and the main ideas about how it has been implemented. If you want to collaborate with the project, this is the place to start reading! :)

Data model

We believe that it is important to keep the databases mostly unaware in which format the data was originally stored. A reference to this format will only be stored for specific purposes involving file transfers.

A data model for variants and alignments has been designed and implemented in Java. It explicitly specifies the most commonly used fields, and at the same time provides mechanisms for preserving all the information of a certain format. For a variant, the specified fields would be (among others) chromosome, position, reference and alternatives; if a VCF file is being stored then columns such as INFO are also saved in a key-value data structure.

Data storage

As many organizations involved in Big Data projects have done, we decided to use technologies such as Hadoop and MongoDB. Relatively small datasets can be stored in a Mongo-only backend, while those expecting a really high load of information could opt for a combination of both technologies, which we decided to name Monbase :)

Mongo-only backend

This schema uses Mongo for storing all the raw data from the file. The Mongo collection establishes the relationships between variants and sources (typically files). It also stores the most commonly accessed variant statistics for each source. When translated to JSON, the schema would be something similar to the following:

{
        "_id" : ObjectId("53207bebe41ada68993d75de"),
        "id" : "rs12345",        
        "chromosome" : "X",
        "start" : 60034,
        "end" : 60034,
        "assembly": "GRCh37",
        "length" : 1,
        "ref" : "C",
        "alt" : "A",
        "type" : "SNV",
        "hgvs" : [
                        {"name": "X:g.64224271C>T", "type": "genomic"},
                        {"name": "ENST00000581797.1:c.-73G>A", "type": "RNA"}
        ],
        "effect" : [
                        {"geneName": "", "so": "regulatory_region_variant"},
                        {"geneName": "BRCA2", "so": "coding_sequence_variant"}
        ],       
        "files" : [
                {
                        "fileId" : "chrX",
                        "studyId" : "1000G",
                        "attributes" : {
                                "QUAL" : "256",
                                "FILTER" : "PASS",
                                "AA" : "...",
                                "AC" : "117",
                                "AF" : "0.05",
                                "AFR_AF" : "0.09",
                                "AMR_AF" : "0.05",
                                "AN" : "2184",
                                "ASN_AF" : "0.07",
                                "AVGPOST" : "0.9664",
                                "ERATE" : "0.0027",
                                "EUR_AF" : "0.02",
                                "LDAF" : "0.0610",
                                "RSQ" : "0.7797",
                                "THETA" : "0.0087",
                                "VT" : "INDEL"
                        },
                        "samples" : [
                                {
                                    "id": "NA20818",
                                    "attributes": {
                                        "GL" : "0.00,-1.20,-22.90",
                                        "GT" : "C|C",
                                        "DS" : "0.000"
                                    }
                                },
                                {
                                    "id": "NA20819",
                                    "attributes": {
                                        "GL" : "0.00,-2.10,-34.30",
                                        "GT" : "C|C",
                                        "DS" : "0.000"
                                    }
                                },
                                {
                                    "id": "NA20826",
                                    "attributes": {
                                        "GL" : "0.00,-0.60,-11.40",
                                        "GT" : "C|A",
                                        "DS" : "0.000"
                                    }
                                }
                        ],
                        "stats" : {
                                "maf" : 0.0535714291036129,
                                "mgf" : 0.002747252816334367,
                                "alleleMaf" : "A",
                                "genotypeMaf" : "A|A",
                                "missAllele" : 0,
                                "missGenotypes" : 0,
                                "genotypeCount" : {
                                        "0/0" : 978,
                                        "0/1" : 111,
                                        "1/1" : 3
                                }
                        }
                }
        ]
}

Whether to store samples, statistics and variant effects can be configured.

Monbase (Mongo + HBase) backend

This schema uses HBase for storing raw data and Mongo as an index for the HBase database.

In HBase, variants are stored in a table with a chr:position in each row. There are 2 column families, one (data) for storing the variant and samples information, and another one (info) for storing variant statistics in a certain file.

Row Key	Column Family: Data (d)	Column Family: Info (i)
chromosome:position	columns in the input file	study statistics
1:123456	d:f1_ref = { A } d:f1_alt = { C } d:f1_NA001 = { GT : A/C, DP = 5 }	i:f1_stats = { MAF : 0.05, MGF : 0.20, miss : 10 } i:f1_stats = { MAF : 0.04, MGF : 0.15, miss : 2 }

Storing pre-calculated statistics allows to retrieve global information for a file very fast. Statistics for a subset of the samples in the file or for combinations of samples from multiple studies must be calculated on demand and won't be stored afterwards; saving all of them would not be affordable.

The Mongo collection establishes the relationships between variants and files. It also stores the most commonly accessed variant statistics for each file. When translated to JSON, the schema would be something similar to the following:

{
 "position" : "1:123456",
 "sources" : [ 
    {   
      "sourceId" : "f1",  
      "sourceName" : "file.vcf",
      "ref" : "A",
      "alt" : [ "AT", "TT" ],
      "stats" : {
         "MAF" : 0.05,
         "allele_maf" : "A",
         "missing" : 2,
         "genotype_count" : {
            "0/0" : 12,
            "0/1" : 23,
            "./." : 2
         }
    },
    { "sourceId" : "f2", ...  }
  ]
}

Fault Tolerance

In order to keep memory usage under control, input files are processed and stored in batches of some thousands of variants. Since the process can fail in any batch, a mechanism for fault tolerance and rollback is necessary.

Failures can occur both while inserting raw data in HBase and creating "indexes" in Mongo. When the insertion of a variant fails the system will retry (note: there could be an argument for configuring this feature).

If the system can't recover from the failure after several tries, the whole operation must be rollbacked. It is necessary to know which registers have been inserted up to that moment, so an entry must be written to a log file every time a register is successfully stored in the database.

OpenCGA is an open source project and it is freely available.

General

OpenCGA Catalog

OpenCGA Storage

About

Provide feedback

Saved searches