This repo contains a 1-click script to configure an Amazon (Amazon Linux) Server with all necessary tools, to run the de-novo assembly of the claimed SARS-CoV-2 sequence with Megahit.
Either run the ./megahit.sh
file on a Amazon EC2 Linux instance, or run the Dockerfile
via docker build . -t megahit -f ./Dockerfile --platform linux/amd64
.
out/
folder contains the output of a run, that was carried out on a 32 core Amazon c5 instance, which took about ~15 minutes or so to assemble.
Megahit result:
29463 contigs, total 14438186 bp, min 200 bp, max 29802 bp, avg 490 bp, N50 458 bp
The longest sequence generated by Megahit is 29,802bp long.
Here we are not able to generate the exact claimed SARS-CoV-2 sequences, that Wu et al. (2020) had published (30,473bp/29,875bp/29,903bp).
aws ec2 run-instances --image-id ami-026b57f3c383c2eec --count 1 --instance-type c5a.8xlarge --key-name ben --ebs-optimized --block-device-mapping "[ { \"DeviceName\": \"/dev/xvda\", \"Ebs\": { \"VolumeSize\": 100 } } ]"
vim ~/.ssh/config
Host megahit
Hostname 18.232.52.37
Port 22
User ec2-user
IdentityFile ~/.ssh/id_rsa
IdentitiesOnly yes
scp megahit.sh megahit:/home/ec2-user/.
ssh megahit
chmod +x megahit.sh
screen
./megahit.sh
scp megahit:/home/ec2-user/out.zip .
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.1'>MN908947.1.fasta
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.2'>MN908947.2.fasta
curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=MN908947.3'>MN908947.3.fasta
cat MN908947.1.fasta MN908947.3.fasta|mafft -|awk '{printf(/^>/?"\n%s\n":"%s"),$0}'|grep .>temp.aln;paste <(sed -n 2p temp.aln|grep -o .) <(sed -n 4p temp.aln|grep -o .)|awk '$1!=$2{print NR,$1,$2}'
# Make Blast DB of target sequence
makeblastdb -in MN908947.3.fasta -dbtype nucl
# Invert the sequence (end to front)
seqtk seq -r k141_13590.fasta > k141_13590r.fasta
# Compare thes genomes
blastn -query k141_13590r.fasta -db MN908947.3.fasta -evalue 1 -task megablast -outfmt 6 > k141_13590r_MN908947.3.crunch
This script generates a random genome and short sequence reads based upon.
Make sure to have a NodeJS installation in your path, then:
npm i
- Generate genome/reads:
npm run generate
- Align reads:
npm run align SRR00000001 XX000001
You can configure settings in ./genome.ts
Norovirus AB039774 Ebola Virus AF086833 Hantavirus AF291704 Marburg Virus AY430365 RSV NC_001803 HCoV-229E NC_002645 SARS-CoV(-1) NC_004718 HCoV-NL63 NC_005831 HCoV-OC43 NC_006213 HCoV-HKU1 NC_006577 Influenza A NC_026431 Rhinovirus NC_001617 Adenovirus NC_044935 SARS-CoV-2 NC_045512 SARS-CoV-2 (JN.1) PP785598.1 Parainfluenza NC_075446 Parvovirus NC_075988 Measles virus OK424761 HIV-1 NC_001802 HIV-2 NC_001722 Poliovirus 1 OQ286203 HPV (16) NC_001526 Hepatitis A NC_008250 Hepatitis B NC_003977 Hepatitis C NC_038882
See folder: reference_genomes
downloaded via: curl 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&rettype=fasta&id=NC_038311' > NC_038311.fa