Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify usage for genome alignment #20

Open
ssadedin opened this issue Aug 25, 2020 · 2 comments
Open

Clarify usage for genome alignment #20

ssadedin opened this issue Aug 25, 2020 · 2 comments

Comments

@ssadedin
Copy link

ssadedin commented Aug 25, 2020

Hi,

Thanks for publishing pufferfish! I was interested in trying it for genome alignment, but when I tried to index GRCh38 it printed out a lot of warnings as if it was interpreting it as a transcriptome:

pufferfish index -r Homo_sapiens_assembly38.fasta -o pufferfish
[2020-08-25 22:30:24.855] [puff::index::jointLog] [info] Running fixFasta

[Step 1 of 4] : counting k-mers
[2020-08-25 22:30:34.352] [puff::index::jointLog] [warning] Entry with header [chr1] was longer than 200000 nucleotides.  This is probably a chromosome instead of a transcript.
[2020-08-25 22:30:38.014] [puff::index::jointLog] [warning] Entry with header [chr2] was longer than 200000 nucleotides.  This is probably a chromosome instead of a transcript.
[2020-08-25 22:30:41.037] [puff::index::jointLog] [warning] Entry with header [chr3] was longer than 200000 nucleotides.  This is probably a chromosome instead of a transcript.
[2020-08-25 22:30:44.030] [puff::index::jointLog] [warning] Entry with header [chr4] was longer than 200000 nucleotides.  This is probably a chromosome instead of a transcript.
[2020-08-25 22:30:46.990] [puff::index::jointLog] [warning] Entry with header [chr5] was longer than 200000 nucleotides.  This is probably a chromosome instead of a transcript.

Is there a different procedure for indexing a genome? Or are these warnings misleading?

NB: it would be also good to know if there is any handling for alt contigs or if these should be removed from a reference to avoid multimapped alignments from them?

Thanks!

@rob-p
Copy link
Contributor

rob-p commented Aug 25, 2020

@ssadedin,

Thanks for trying out puffaligner, and for your feedback! These error messages are indeed misleading. The reason for this is that we have a unified codebase (the same pufferfish index that powers puffaligner also powers the selective alignment procedure used in our RNA quantification software salmon). Salmon expects users to index the transcriptome, and so issues these warnings to the user that they may be indexing the wrong thing if they instead try to index the genome. Obviously, indexing of chromosomal contigs is normal / expected behavior for puffaligner if the user is aligning against the genome. We will fix this on the back-end so that these warnings are only issued when the indexer is invoked from salmon.

Regarding alt contigs, you raise a good question. We have not done extensive testing regarding alignments to alt contigs. If you make use of the --bestStrata mode, then puffaligner will look for all equally best alignments and if the alt contig at a locus has the same sequence as the primary, it will return them both. In the case of strict ties in alignment score, the alignment that is marked as "primary" in the SAM record is essentially random. If there are use cases where alternative / custom behavior is likely to be preferred, we'd be happy to discuss!

@ssadedin
Copy link
Author

ssadedin commented Aug 25, 2020

Thanks @rob-p - good to know about the warnings, and thanks for the info on the alt contig situation. I'll do some testing and see what the empirical behaviour is for the alt contigs, and let you know any further thoughts there in a separate issue if necessary.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants