Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create custom database #6

Open
jacodela opened this issue Jun 20, 2018 · 3 comments
Open

Create custom database #6

jacodela opened this issue Jun 20, 2018 · 3 comments

Comments

@jacodela
Copy link

I'm interested in mapping metagenome reads to genome bins I've previously assembled and are not available in public databases. The documentation regarding the creation of a custom databases is limited to subsetting the provided refseq representative database given a genome with a know accession number, but I can't seem to find how to create a truly custom database. Is this even possible?

@palomo11
Copy link

I have the same question. @jacodela Did you figure out how to do it?

@jacodela
Copy link
Author

jacodela commented Apr 19, 2019

Hi @palomo11, I never got an answer, nor I figured out how to do it by myself, so I used other tools. I would recommend you take a look at Bracken or (meta)Kallisto: they run quite fast and perform well in some tests I ran myself on synthetic communities. If you have NCBI taxIDs, go for Bracken, otherwise, check Kallisto

@jfy133
Copy link

jfy133 commented Dec 17, 2020

@palomo11 @jacodela I know this is a very old thread to bring up, but given the author doesn't seem to have replied, I leave this as a possible response:

I saw in the toy dataset the following commands

#!/bin/bash
echo ':::: Creating an empty database with a name "toyset"'
    sparse init --dbname toyset

echo ':::: Filling database "toyset" with 22 Salmonella complete genomes'
    sparse index --dbname toyset --seqlist Salmonella_toyset.txt

echo ':::: Building a mapping database named "Salmonella" in "toyset"'
    sparse query --dbname toyset --tag m==a | sparse mapDB --dbname toyset --mapDB Salmonella --seqlist stdin

The crucial thing I think is the --seqlist Salmonella_toyset.txt flag. This is simply the RefSeq TSV file you can download from the NCBI FTP: https://github.com/zheminzhou/SPARSE/blob/master/example/Salmonella_toyset.txt.

Presumably SPARSE will read this file to look for the location and file name. I'm guessing you could be able to 'fake' info for 'custom' genomes and as long as it follows the same column format as the RefSeq file.

Note I'm assuming this, have not tried it myself.

EDIT: looking at the output it does have NCBI taxonomy info (and downloads the NCBI taxonomy dump), however the clusters seem to be independent of this, so 'faking' the genomes might still work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants