This is a collection of scripts to build a local information store of an EGA submitter account.
ega-xml-dl
will use the WebIn API to download the complete metadata of a submitter account
(metadata only, not the actual genome stuff, so a handfull of megabytes, not hundreds of gigabytes)
sqlslurp.py
will read ('slurp') the most pertinent fields of an XML dump into an SQLite database, for easier querying.
This simple scripts lets you batch-download the contents of an ENA/EGA submission box at the EBI. It fetches the XML representation of all objects in the submission box.
# for ega-boxes
bash ega-xml-dl ega-box-NNN
# for ENA boxes
bash ega-xml-dl Webin-NNNNN
For more info, please see the detailed ega-xml-dl
README.
Given the interconnected nature of EGA data, the XML representation makes it very hard to answer questions such as
'which files are in dataset X' or 'which samples have been published under study Y'.
SQL, with its support for JOIN
s, is far more suitable.
Thus, sqlslurp.py
will take a box's XML dump, and parse the most pertinent fields into an SQLite database:
python3 sqlslurp.py ega-box-NNN
You can then query this database with queries like:
-- find all Samples in Study, as linked via eXperiment.
SELECT
s.EGAS,
n.submitter_id,
n.EGAN,
FROM
studies s
LEFT JOIN experiments x ON x.XREF_ERP = s.ERP
LEFT JOIN samples n ON x.XREF_ERS = n.ERS
WHERE
s.EGAS = 'EGAS00001003953'
ORDER BY
n.EGAN,
For more info, please see the detailed ega-xml-dl
README.
These scripts started out as (and still are) a small internal tool at the Omics IT & Datamanagement Core Facility at the German Cancer Research Centre (DKFZ), a publicly funded body. It is made openly available under the MIT license under the philosophy of "public money, public code"
If you use it, and have any ideas, suggestions or wish to contribute improvements, feel free to contribute or open issues at upstream:
https://gitlab.com/DKFZ-ODCF/ega-xml-dl
(Remember: asking questions is a way to improve it too!)