GitHub - sidbdri/sidbdri-utils: Download and process genome sequences and annotations from Ensembl.

SIDB/DRI Informatics utility scripts

`download_ensembl.sh`

This script downloads and processes genome sequences and annotations for a particular species from an Ensembl release, e.g.

download_ensembl mouse 94 <your_email> <output_directory>

The script:

Downloads top-level sequences for the species' genome in FASTA format (primary assembly sequences for human and mouse),
Downloads gene annotations in GTF format,
Creates a genome index for the STAR mapping tool,
Builds Salmon indexes for the transcripts defined in the GTF file,
Downloads gene information (gene id, name, chromosome, description, Entrez ID, gene biotype) into a file 'genes.tsv'. Only genes on primary assembly sequences are retained (i.e. those on patch sequences are removed),
Downloads transcript information (transcript id, transcript biotype, gene id, chromosome) into a file 'transcripts.tsv'. Only transcripts for genes on primary assembly sequences are retained,
Downloads information on mouse and human orthologous genes to files '_orthologs.tsv'.

Now the output directory will look like this: (example from mouse Ensembl 94)

|-- BOWTIE2_indices
|   |-- primary_assembly_2.3.4
|   `-- primary_assembly_2.3.4.3
|-- KALLISTO_indices
|   |-- primary_assembly -> KALLISTO_indices/primary_assembly_kallisto0.45.0/kallisto_index
|   |-- primary_assembly_0.43.1
|   `-- primary_assembly_0.45.0
|-- Mus_musculus.GRCm38.94.gtf
|-- README
|-- SALMON_indices
|   |-- primary_assembly -> SALMON_indices/primary_assembly_salmon0.12.0
|   |-- primary_assembly_0.12.0
|   `-- primary_assembly_0.8.2
|-- STAR_indices
|   |-- primary_assembly -> STAR_indices/primary_assembly_STAR2.6.1d
|   |-- primary_assembly_2.5.3a
|   `-- primary_assembly_2.6.1d
|-- download_ensembl.sh
|-- mouse_primary_assembly.fa
|-- primary_assembly
|   |-- 1.fa
|   |-- 10.fa
|   |-- .........
`-- transcripts_ref
    |-- transcripts
    |-- transcripts.chrlist
    |-- transcripts.grp
    |-- transcripts.idx.fa
    |-- transcripts.n2g.idx.fa
    |-- transcripts.seq
    |-- transcripts.ti
    `-- transcripts.transcripts.fa

`ena_upload_preprocess.sh`

The script will pre-process sample read files into a format for uploading to ENA. The procedure is as follows:

generate md5sums before copying the read files to output folder
copy the read files to the output folder
check the md5sums after copying
for each sample:
- find all read1/read2 files
- check read1/read2 have the same number of files
- for read1/read2
  - unzip and concatenate into one file
  - check the number of lines in the merged file is the sum of pre-merged read files
  - compress the merged file
  - move compressed, merged file to output folder
  - check the compressed, merged file has the same number of lines as the uncompressed, merged file
- check merged, compressed read1/read2 files have the same number of rows
create a file in the output folder containing md5sums for the merged, compressed read files
finish

`query_biomart.sh`

Execute an XML query against a particular version of Biomart.

Name		Name	Last commit message	Last commit date
Latest commit History 146 Commits
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
includes.sh		includes.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SIDB/DRI Informatics utility scripts

`download_ensembl.sh`

`ena_upload_preprocess.sh`

`query_biomart.sh`

About

Releases

Packages

Contributors 2

Languages

sidbdri/sidbdri-utils

Folders and files

Latest commit

History

Repository files navigation

SIDB/DRI Informatics utility scripts

download_ensembl.sh

ena_upload_preprocess.sh

query_biomart.sh

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

`download_ensembl.sh`

`ena_upload_preprocess.sh`

`query_biomart.sh`

Packages