Lunaris is an app to extract and aggregate data from multiple files, with particular focus on location-sorted block-gzipped tabix-indexed files, which are typically too large to fit into memory.
Lunaris can be run in batch mode or as a web server. In batch mode, it reads a request from a file (or Google Cloud
Storage object) and writes the output to one or more files (or Google Storage objects). As a web service, the request
is submitted via HTTP POST, and the output is sent back as response. By default, the server expects requests at
/lunaris/query
and offers a WebUI at /lunaris/lunaris.html
to edit requests (with examples) and view responses.
A typical use-case of batch mode is extracting data from files stored in a Terra workspace, but any files on local disk or in Google Cloud Storage can be used.
Since files are too large to fit into memory, Lunaris relies on stream-processing, traversing all location-sorted files simultaneously in parallel, aggregating data from different files, but pointing at the same genomic location in each file at any given time. Lunaris also accepts unsorted support files as long as they are small enough to be comfortably loaded into memory.
Lunaris Version 1.2.1 (c) 2020 Broad Institute
Usage: lunaris batch|server ...
Lunaris is a stream processor to extract, combine and munge genomics-related data from
location-sorted block-gzipped tabix-indexed files.
Files can be local, or on Google Cloud Storage, including on Terra.
-h, --help Show help message
-v, --version Show version of this program
Subcommand: batch
Loads a request from file or Google Cloud Storage object and executes it.
-r, --request-file <arg> Location of file containing request in JSON.
-h, --help Show help message
Subcommand: server
Web service: accepts HTML POST requests at http://<host>/lunaris/query
and offers a WebUI at http://<host>/lunaris.lunaris.html
-h, --host <arg> Host to bind to, e.g. localhost, 0.0.0.0
-p, --port <arg> Port to bind to, e.g. 80, 8080
--help Show help message
For more or more updated information, check https://github.com/broadinstitute/lunaris
Lunaris accepts a JSON file as input. Output file(s) can be specified to be TSV or JSON.
The input file contains a JSON object with properties id, regions and recipe.
The id field is an arbitrary String chosen by the requester to identify the request.
The regions field contains an object with a key for every chromosome or sequence, each pointing to an array of regions which have a start and an end.
The recipe field points to an object whose properties represent the steps necessary to produce the desired output.
Status: working
A minimal request: read data from one block-gzipped, location-sorted, tabix-indexed file and export the data as tab-separated values (TSV). It contains two steps, read and write. Each step specifies a tool to describe what is being done in this step and some arguments.
The read step uses the tool IndexedRecordReader to read from a location-sorted block-gzipped tabix-index file into a stream of objects. Required arguments are the file and the name of the id column. If the index file is omitted, it is assumed to be the data file plus suffix ".tbi".
The write step uses the TSVWriter tool to write a stream of objects to a TSV file. The from argument contains the name of the step that produces the stream of objects to be written. file is the file to be written to.
A file in Lunaris can be a local file, or a file stored on Google Cloud Storage.
{
"id" : "requestMinimalTsv",
"regions" : {
"1" : [
{
"begin" : 100000,
"end" : 200000
}
]
},
"recipe" : {
"read" : {
"file" : "gs://fc-6fe31e1f-2c36-411c-bf23-60656d621184/data/t2d/variants.tsv.gz",
"idField" : "varId",
"tool" : "IndexedRecordReader"
},
"write" : {
"from" : "read",
"file" : "examples/responses/responseMinimalTsv.tsv",
"tool" : "TSVWriter"
}
}
}
#varId chromosome position alt maf amino_acids ...
1:10583:G:A 1 10583 A 0.15852650000000001 ...
#varId chromosome position alt maf amino_acids ...
1:103905:A:G 1 103906 G 0.066208 ...
1:106544:C:G 1 106545 G 0.1955385 ...
1:106699:A:G 1 106700 G ...
...
Status: working
Like the previous example, but instead of writing the output file as tab-separated values (TSV), we write it in JavaScript Object Notation (JSON).
This is achieved by using, in the write step, the JSONWriter tool (instead of the TSVWriter).
{
"id" : "requestMinimalJson",
"regions" : {
"1" : [
{
"begin" : 100000,
"end" : 200000
}
]
},
"recipe" : {
"read" : {
"file" : "gs://fc-6fe31e1f-2c36-411c-bf23-60656d621184/data/t2d/variants.tsv.gz",
"idField" : "varId",
"tool" : "IndexedRecordReader"
},
"write" : {
"from" : "read",
"file" : "examples/responses/responseMinimalJson.json",
"tool" : "JSONWriter"
}
}
}
#varId chromosome position alt maf amino_acids ...
1:10583:G:A 1 10583 A 0.15852650000000001 ...
{
"1:103905:A:G" : {
"varId" : "1:103905:A:G",
"chromosome" : "1",
"position" : 103906,
"alt" : "G",
"maf" : "0.066208",
"amino_acids" : "",
...
},
"1:106544:C:G" :
...
}
Status: working
The previous example only extracted region 1:100000-200000. Here is an example that extracts data from four regions: 1:100000-200000, 1:300000-400000, X:0-100000, X:400000-500000.
{
"id" : "requestMoreRegions",
"regions" : {
"1" : [
{
"begin" : 100000,
"end" : 200000
},
{
"begin" : 300000,
"end" : 400000
}
],
"X" : [
{
"begin" : 0,
"end" : 100000
},
{
"begin" : 400000,
"end" : 500000
}
]
},
"recipe" : {
"read" : {
"file" : "gs://fc-6fe31e1f-2c36-411c-bf23-60656d621184/data/t2d/variants.tsv.gz",
"idField" : "varId",
"tool" : "IndexedRecordReader"
},
"write" : {
"from" : "read",
"file" : "examples/responses/responseMoreRegions.json",
"tool" : "JSONWriter"
}
}
}
#varId chromosome position alt maf amino_acids ...
1:10583:G:A 1 10583 A 0.15852650000000001 ...
{
"1:103905:A:G" : {
"varId" : "1:103905:A:G",
"chromosome" : "1",
"position" : 103906,
"alt" : "G",
"maf" : "0.066208",
"amino_acids" : "",
...
},
"1:106544:C:G" :
...
}
Status: working
In the previous examples, all data fields were turned into JSON strings except for position. Lunaris knows from the index which is the position field (or, alternatively, the start and end fields), and it knows that these fields are integers. To assign other types, we use the types argument of the IndexedRecordReader. Lunaris types include "String", "File", "Int", "Float", "Bool" as well as arrays and objects.
{
"id" : "requestTypingFields",
"regions" : {
"1" : [
{
"begin" : 100000,
"end" : 200000
}
]
},
"recipe" : {
"read" : {
"file" : "gs://fc-6fe31e1f-2c36-411c-bf23-60656d621184/data/t2d/variants.tsv.gz",
"idField" : "varId",
"types" : {
"maf" : "Float"
},
"tool" : "IndexedRecordReader"
},
"write" : {
"from" : "read",
"file" : "examples/responses/responseTypingFields.json",
"tool" : "JSONWriter"
}
}
}
#varId chromosome position alt maf amino_acids ...
1:10583:G:A 1 10583 A 0.15852650000000001 ...
{
"1:175810:T:A" : {
"varId" : "1:175810:T:A",
"chromomsome" : "1",
"position" : 175810,
"alt" : "A"
"maf" : 0.654
},
...
}
Status: working
The following example filters records, only retaining records where the field "phenotype" has the String "T2D" as value.
IndexedRecordReader reads the data, RecordsFilter filters, and the TSVWriter writes it.
{
"id" : "requestFilterTsv",
"regions" : {
"1" : [
{
"begin" : 0,
"end" : 1000000
}
]
},
"recipe" : {
"read" : {
"file" : "gs://fc-6fe31e1f-2c36-411c-bf23-60656d621184/data/t2d/associations.tsv.gz",
"idField" : "varId",
"tool" : "IndexedRecordReader"
},
"filter" : {
"from" : "read",
"field" : "phenotype",
"stringValue" : "T2D",
"tool" : "RecordsFilter"
},
"write" : {
"from" : "filter",
"file" : "responseFilterTsv.tsv",
"tool" : "TSVWriter"
}
}
}
TODO
TODO
Status: in progress
In the previous examples, all fields were included, but it is possible to pick which fields are included using the fields argument. However, id, chromosome, position, start and end fields are never discarded.
{
"id" : "exampleTyping",
"regions" : {
"1" : [
{
"start" : 100000,
"end" : 200000
}
]
},
"recipe" : {
"read" : {
"tool" : "IndexedRecordReader",
"file" : "gs://yeoldegooglebucket/variants.tsv.gz",
"idField" : "varId"
"fields" ["maf", "p_value"]
"types" : {
"maf" : "Float",
"p_value" : "Float"
}
},
"write" : {
"tool" : "JSONWriter",
"from" : "read",
"file" : "sweetLocalFile.json"
}
}
#varId chrom pos maf p_value is_coding
1:175810:T:A 1 175810 0.654 0.0034 false
{
"1:175810:T:A" : {
"varId" : "1:175810:T:A",
"chrom" : "1",
"pos" : 175810,
"maf" : 0.654,
"p_value" : 0.0034
},
...
}
Status: in progress
Objects from multiple streams can be joined into a single object to obtain a new stream of joined objects. This, however only works under the following conditions: Each stream contains no more than one object for any given id. Objects with the same ids must also have the same chromosome and position (or start and end), across all streams to be joined For every id, all objects of the same id are combined into a new object with the same id and all the fields of the original object.
{
"id" : "exampleTyping",
"regions" : {
"1" : [
{
"start" : 100000,
"end" : 200000
}
]
},
"recipe" : {
"readVariants" : {
"tool" : "IndexedRecordReader",
"file" : "gs://yeoldegooglebucket/variants.tsv.gz",
"idField" : "varId"
"types" : {
"maf" : "Float",
"is_coding" : "Bool"
}
},
"readAssociations" : {
"tool" : "IndexedRecordReader",
"file" : "gs://yeoldegooglebucket/associations.tsv.gz",
"idField" : "varId"
"types" : {
"p_value" : "Float"
}
},
"join" : {
"tool" : "ObjectJoiner",
"from" : [ "readVariants", "readAssociations" ]
}
"write" : {
"tool" : "JSONWriter",
"from" : "read",
"file" : "sweetLocalFile.json"
}
}
#varId chrom pos maf is_coding
1:175810:T:A 1 175810 0.0034 false
#varId chrom pos p_value
1:175810:T:A 1 175810 0.0034
{
"1:175810:T:A" : {
"varId" : "1:175810:T:A",
"chrom" : "1",
"pos" : 175810,
"maf" : 0.654,
"p_value" : 0.0034,
"is_coding" : false
},
...
}
Todo: More examples
Status: in progress
A realistic example of how T2D portal data would be extracted.
{
"id" : "examplePortal",
"regions" : {
"1" : [
{
"start" : 100000,
"end" : 200000
}
]
},
"recipe" : {
"readVariants" : {
"tool" : "IndexedRecordReader",
"file" : "gs://fc-6fe31e1f-2c36-411c-bf23-60656d621184/data/t2d/variants.tsv.gz",
"idField" : "varId",
"fields" : ["varId", "chromosome", "position", "alt", "maf"],
"types" : {
"varId" : "String",
"chromosome" : "String",
"position" : Int,
"maf" : "Float"
}
},
"readAssociations" : {
"tool" : "IndexedRecordReader",
"file" : "gs://fc-6fe31e1f-2c36-411c-bf23-60656d621184/data/t2d/associations.tsv.gz",
"idFields" : ["varId", "phenotype"],
"fields" :
[
"varId", "chromosome", "position", "phenotype", "pValue", "beta",
"stdErr", "zScore", "n"
],
"types" : {
"varId" : "String",
"chromosome" : "String",
"position" : "Int",
"phenotype" : "String",
"pValue" : "Float",
"beta" : "Float",
"stdErr" : "Float",
"zScore" : "Float",
"n" : "Int"
}
},
"groupAssociations": {
"tool" : "GroupAsObject",
"from" : "readAssociations"
"groupedFields: ["pValue", "beta", "stdErr", "zScore", "n"],
"path" : "phenotypes"
}
"readPosteriors" : {
"tool" : "IndexedRecordReader",
"file" : "gs://fc-6fe31e1f-2c36-411c-bf23-60656d621184/data/t2d/posteriors.tsv.gz",
"idFields" : ["varId", "phenotype"],
"fields" :
[
"varId", "chromosome", "position", "phenotype", "probability",
"credible_set_id"
],
"types" : {
"varId" : "String",
"chromosome" : "String",
"position" : "Int",
"phenotype" : "String",
"probability" : "Float",
"credible_set_id" : "Float"
}
}
"groupPosteriors": {
"tool" : "GroupAsObject",
"from" : "readPosteriors",
"subIdField" : "phenotype",
"groupedFields: ["probability", "credible_set_id"],
"path" : "phenotypes"
}
"readRegions" : {
"tool" : "IndexedRecordReader",
"file" : "gs://fc-6fe31e1f-2c36-411c-bf23-60656d621184/data/t2d/regions.tsv.gz",
"idFields" : ["varId", "regionId"],
"fields" : [ "varId", "chromosome", "position", "regionId" ],
"types" : {
"varId" : "String",
"chromosome" : "String",
"position" : "Int",
"regionId" : "Int"
}
}
"readRegionTable" : {
"tool" : "LoadTSVTable",
"idField" : "regionId",
"fields" :
[
"regionId", "method", "annotation", "tissue", "tissue_description"
],
"types" : {
"regionId" : "Int",
"method" : "String",
"annotation" : "String",
"tissue" : "String",
"tissue_description" : "String"
}
},
"joinRegions" : {
"tool" : "JoinWithTable",
"idField" : "regionId",
"path" : "region",
"from" : "readRegions",
"table" : "readRegionTable"
},
"filterRegions" : {
"tool" : "FilterByValues",
"from" : "joinRegions",
"path" : "region/annotation",
"values" : ["GenePrediction"]
}
"groupRegions" : {
"tool" : "GroupAsArray",
"from" : "filterRegions",
"field" : "region"
"path" : "regions"
}
"join" : {
"tool" : "ObjectJoiner"
"from" : [
"readVariants", "groupAssociations", "groupPosteriors", "groupRegions"
]
}
"write" : {
"tool" : "JSONWriter",
"from" : "join"
}
}
}
{
"1:175810:T:A" : {
"varId" : "1:175810:T:A",
"chromosome" : "1",
"position" : 175810,
"maf" : "0.654",
"phenotypes" : {
"T2D" : {
"pValue" : 0.003,
"beta": 3.4,
"stdErr" : 1.2,
"zScore" : 1.2,
"n" : 1000,
"probability" : 0.45,
"credible_set_id" : "chr1:185921-4829384"
}
}
"regions" : [
{
"method" : "ABC",
"annotation" : "GenePrediction",
"tissue" : "CL:0000127",
"tissue_description" : "astrocyte"
},
{
"method" : "ABC",
"annotation" : "GenePrediction",
"tissue" : "CL:0000129",
"tissue_description" : "microglial cell"
}
]
},
...
}
Each location-sorted block-gzipped tabix-indexed file is read as a sorted stream of records. A record in Lunaris is an ordered list of key/value pairs, where the key is a String, and the value can be of one of a number of types. Lunaris' types are very similar to those of JSON, but have a few more distinctions, for example Lunaris has an Int type.
Each record has as a minimum four core fields representing the id, the chromosome or sequence, begin and end. These can have any keys. begin and end may be the same field, in which case the actual end is the begin plus one, in line with Tabix specs. id and chromosome are of type String, while begin and end are of type Int.
The stream of records is sorted by genomic location, which means it is sorted by chromosome, begin and end, in this order. Multiple records in a stream can have the same id, but any records with the same id must also have the same chromosome, begin and end. Some operations, such as joining streams or writing records to JSON, require that a stream has only one record per id. This can be ensured by grouping, which collapses all records with the same id into one. It can also be achieved by filtering, depending on the data.
Typically, the id is a variant id, but it does not need to be.
The recipe is a JSON object with each property representing a step consisting of a tool and a set of arguments. Some of these arguments are references to other steps, which means this step consumes the output of the other step, typically a stream of records.
This tool reads a stream of records from a given file, which can be a local file, or an object on Google Cloud Storage.
file (required): The location of the local file or Google Cloud Storage object to read from.
index (optional): The location of the index. If not given, the index is assumed to be the file plus the ".tbi" suffix.
idField (required): the name of the column containing the record id.
fields (optional): the fields to be read. If missing, read all fields. Core fields are always included, regardless of this argument: idField is given separately, and the chromosome, begin and end are taken from the index file.
types (optional): the types of the fields. The types of core fields are fixed to String for id and chromosome and Int for begin and end, and cannot be changed. All fields not explicitly typed are treated as String.
Merges all records with the same id in a stream into a single record. Core fields, which are assumed the same if the id is the same, are kept intact. Other fields are moved to sub-objects according to the specified path followed by the value of the subIdField.
from (required): reference to the step that provides the stream of records as input.
subIdField (required): field whose values are used to point to the sub-objects containing the grouped fields
groupedFields (required): Fields to be grouped.
path: path to the sub-object containing the sub-objects containing the grouped fields.
TODO: document all tools