Skip to content

Latest commit

 

History

History
66 lines (48 loc) · 2.17 KB

README.md

File metadata and controls

66 lines (48 loc) · 2.17 KB

EGA XML downloader

This is a collection of scripts to build a local information store of an EGA submitter account.

ega-xml-dl will use the WebIn API to download the complete metadata of a submitter account (metadata only, not the actual genome stuff, so a handfull of megabytes, not hundreds of gigabytes)

sqlslurp.py will read ('slurp') the most pertinent fields of an XML dump into an SQLite database, for easier querying.

EGA XML downloader

This simple scripts lets you batch-download the contents of an ENA/EGA submission box at the EBI. It fetches the XML representation of all objects in the submission box.

# for ega-boxes
bash ega-xml-dl ega-box-NNN
# for ENA boxes
bash ega-xml-dl Webin-NNNNN

For more info, please see the detailed ega-xml-dl README.

SQL Slurp

Given the interconnected nature of EGA data, the XML representation makes it very hard to answer questions such as 'which files are in dataset X' or 'which samples have been published under study Y'. SQL, with its support for JOINs, is far more suitable. Thus, sqlslurp.py will take a box's XML dump, and parse the most pertinent fields into an SQLite database:

python3 sqlslurp.py ega-box-NNN

You can then query this database with queries like:

-- find all Samples in Study, as linked via eXperiment.
SELECT
  s.EGAS,
  n.submitter_id,
  n.EGAN,
FROM
  studies s
  LEFT JOIN experiments x ON x.XREF_ERP = s.ERP
  LEFT JOIN samples n     ON x.XREF_ERS = n.ERS
WHERE
  s.EGAS = 'EGAS00001003953'
ORDER BY
  n.EGAN,

For more info, please see the detailed ega-xml-dl README.

Feedback welcome

These scripts started out as (and still are) a small internal tool at the Omics IT & Datamanagement Core Facility at the German Cancer Research Centre (DKFZ), a publicly funded body. It is made openly available under the MIT license under the philosophy of "public money, public code"

If you use it, and have any ideas, suggestions or wish to contribute improvements, feel free to contribute or open issues at upstream:

https://gitlab.com/DKFZ-ODCF/ega-xml-dl

(Remember: asking questions is a way to improve it too!)