Skip to content

batch download everything in your EGA or ENA submission account

License

Notifications You must be signed in to change notification settings

DKFZ-ODCF/ega-xml-dl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

55 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EGA XML downloader

This is a collection of scripts to build a local information store of an EGA submitter account.

ega-xml-dl will use the WebIn API to download the complete metadata of a submitter account (metadata only, not the actual genome stuff, so a handfull of megabytes, not hundreds of gigabytes)

sqlslurp.py will read ('slurp') the most pertinent fields of an XML dump into an SQLite database, for easier querying.

EGA XML downloader

This simple scripts lets you batch-download the contents of an ENA/EGA submission box at the EBI. It fetches the XML representation of all objects in the submission box.

# for ega-boxes
bash ega-xml-dl ega-box-NNN
# for ENA boxes
bash ega-xml-dl Webin-NNNNN

For more info, please see the detailed ega-xml-dl README.

SQL Slurp

Given the interconnected nature of EGA data, the XML representation makes it very hard to answer questions such as 'which files are in dataset X' or 'which samples have been published under study Y'. SQL, with its support for JOINs, is far more suitable. Thus, sqlslurp.py will take a box's XML dump, and parse the most pertinent fields into an SQLite database:

python3 sqlslurp.py ega-box-NNN

You can then query this database with queries like:

-- find all Samples in Study, as linked via eXperiment.
SELECT
  s.EGAS,
  n.submitter_id,
  n.EGAN,
FROM
  studies s
  LEFT JOIN experiments x ON x.XREF_ERP = s.ERP
  LEFT JOIN samples n     ON x.XREF_ERS = n.ERS
WHERE
  s.EGAS = 'EGAS00001003953'
ORDER BY
  n.EGAN,

For more info, please see the detailed ega-xml-dl README.

Feedback welcome

These scripts started out as (and still are) a small internal tool at the Omics IT & Datamanagement Core Facility at the German Cancer Research Centre (DKFZ), a publicly funded body. It is made openly available under the MIT license under the philosophy of "public money, public code"

If you use it, and have any ideas, suggestions or wish to contribute improvements, feel free to contribute or open issues at upstream:

https://gitlab.com/DKFZ-ODCF/ega-xml-dl

(Remember: asking questions is a way to improve it too!)

About

batch download everything in your EGA or ENA submission account

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published