Skip to content

USMortality/Megahit-SARS-CoV-2

Repository files navigation

Megahit SARS-CoV-2

This repo contains a 1-click script to configure an Amazon (Amazon Linux) Server with all necessary tools, to run the de-novo assembly of the claimed SARS-CoV-2 sequence with Megahit.

Usage

Either run the ./megahit.sh file on a Amazon EC2 Linux instance, or run the Dockerfile via docker build . -t megahit -f ./Dockerfile --platform linux/amd64.

Output

out/ folder contains the output of a run, that was carried out on a 32 core Amazon c5 instance, which took about ~15 minutes or so to assemble.

Result

Megahit result: 29463 contigs, total 14438186 bp, min 200 bp, max 29802 bp, avg 490 bp, N50 458 bp

The longest sequence generated by Megahit is 29,802bp long.

Here we are not able to generate the exact claimed SARS-CoV-2 sequences, that Wu et al. (2020) had published (30,473bp/29,875bp/29,903bp).

Run

aws ec2 run-instances --image-id ami-026b57f3c383c2eec --count 1 --instance-type c5a.8xlarge --key-name ben --ebs-optimized --block-device-mapping "[ { \"DeviceName\": \"/dev/xvda\", \"Ebs\": { \"VolumeSize\": 100 } } ]"

vim ~/.ssh/config

Host megahit
Hostname 18.232.52.37
Port 22
User ec2-user
IdentityFile ~/.ssh/id_rsa
IdentitiesOnly yes
scp megahit.sh megahit:/home/ec2-user/. 
ssh megahit  
chmod +x megahit.sh
screen
./megahit.sh

scp megahit:/home/ec2-user/out.zip .

Download original genomes

curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.1'>MN908947.1.fasta
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.2'>MN908947.2.fasta
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.3'>MN908947.3.fasta

Show differences

cat MN908947.1.fasta MN908947.3.fasta|mafft -|awk '{printf(/^>/?"\n%s\n":"%s"),$0}'|grep .>temp.aln;paste <(sed -n 2p temp.aln|grep -o .) <(sed -n 4p temp.aln|grep -o .)|awk '$1!=$2{print NR,$1,$2}'

Compare the produced sequence to the original MN908947.3 Wuhan-Hu-1 Isolate

# Make Blast DB of target sequence
makeblastdb -in MN908947.3.fasta -dbtype nucl  

# Invert the sequence (end to front)
seqtk seq -r k141_13590.fasta > k141_13590r.fasta

# Compare thes genomes
blastn -query k141_13590r.fasta -db MN908947.3.fasta -evalue 1 -task megablast -outfmt 6 > k141_13590r_MN908947.3.crunch

Random Genome and Reads Generator

This script generates a random genome and short sequence reads based upon.

Installation

Make sure to have a NodeJS installation in your path, then:

npm i

Usage

  • Generate genome/reads: npm run generate
  • Align reads: npm run align SRR00000001 XX000001

You can configure settings in ./genome.ts

Common Virus Sequences

Norovirus AB039774 Ebola Virus AF086833 Hantavirus AF291704 Marburg Virus AY430365 RSV NC_001803 HCoV-229E NC_002645 SARS-CoV(-1) NC_004718 HCoV-NL63 NC_005831 HCoV-OC43 NC_006213 HCoV-HKU1 NC_006577 Influenza A NC_026431 Rhinovirus NC_001617 Adenovirus NC_044935 SARS-CoV-2 NC_045512 SARS-CoV-2 (JN.1) PP785598.1 Parainfluenza NC_075446 Parvovirus NC_075988 Measles virus OK424761 HIV-1 NC_001802 HIV-2 NC_001722 Poliovirus 1 OQ286203 HPV (16) NC_001526 Hepatitis A NC_008250 Hepatitis B NC_003977 Hepatitis C NC_038882

See folder: reference_genomes downloaded via: curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_038311' > NC_038311.fa

About

SARS-CoV-2 Genome Assembly

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published