README sqlslurp

This script can ingest an XML-dump of an EGA submitter box (as produced by its sister, ega-xml-dl) into an SQLite database for easier querying.

The SQL tables follow the EGA naming scheme where possible, with two notable exceptions:

given the many "IDs" and "accessions", the IDs used for crosslinking are named after their prefixes, which this author finds less-confusing and seems to be the only terminology that is consistent throughout the entire ecosystem.
- EGA* for the ID's used for the public-facing identifiers we all know and love.
- ER* for the internal representation, which seems to be a historical inheritance from the ENA/Webin days.
- The star/* is one of:
  
  Datatype EGA* ER* Note
  
  Study S P from Study and Project, respectively
  
  Sample N S Careful to not confuse ERS with Study (ERP)
  
  Run R R
  
  Analyses Z Z
  
  Dataset D D
IDs referencing another datatype are prefixed with XREF_

Usage

./sqlslurp.py ega-box-NNN

The resulting SQLite database is placed in the top level of the corresponding XML directory: $HOME/ega-xml/ega-box-NNN/box-contents.sqlite.

example queries

Overview of all the "runs" in a Study

"Runs" are usually fastq files, but can also be (usually unaligned) bam-files, as well as contain some other more exotic filetypes. The guiding principle is that Runs should be "not interpreted", so free of human/processing biases/decisions.

Note that this query completely ignores any "analyses" that may or may not be part of the same dataset.

-- Details for fastq files (Runs) associated with a Study.
SELECT
  s.EGAS,
  n.submitter_id,
  n.EGAN,
  r.EGAR,
  r.forward_filename,
  r.forward_md5,
  r.reverse_filename,
  r.reverse_md5
FROM
  studies s
  LEFT JOIN experiments x ON x.XREF_ERP = s.ERP
  LEFT JOIN samples n     ON x.XREF_ERS = n.ERS
  LEFT JOIN runs r        ON r.XREF_ERX = x.ERX
WHERE
  s.EGAS = 'EGAS00001003953'
ORDER BY
  n.EGAN,
  r.EGAR;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README-sqlslurp.md

README-sqlslurp.md

README sqlslurp

Usage

example queries

Overview of all the "runs" in a Study

Datatype	EGA*	ER*	Note
Study	`S`	`P`	from Study and Project, respectively
Sample	`N`	`S`	Careful to not confuse `ERS` with Study (`ERP`)
Run	`R`	`R`
Analyses	`Z`	`Z`
Dataset	`D`	`D`

Files

README-sqlslurp.md

Latest commit

History

README-sqlslurp.md

File metadata and controls

README sqlslurp

Usage

example queries

Overview of all the "runs" in a Study