Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: presence/absence matrix #10

Open
SionBayliss opened this issue Oct 6, 2020 · 4 comments
Open

Request: presence/absence matrix #10

SionBayliss opened this issue Oct 6, 2020 · 4 comments

Comments

@SionBayliss
Copy link

Great tool! I have a quick feature request. An option to output a presence/absence matrix or similar output would be very useful.

On a similar note, do you know of any software that would achieve this in a scalable manner for application to large prokaryotic datasets?

@karel-brinda
Copy link
Member

Hi Sion, thanks for using ProphAsm! How exactly do you envision this presence/absence matrix to look like? Would it be N x N matrix of files with simplitigs from shared k-mers across pairs of genomes?

@SionBayliss
Copy link
Author

Hi Karel, thanks for the quick reply. I was thinking of something similar but with simpltigs x sample showing simplitig presence/absence in each sample. If this info is not calculated in prophasm it might be better achieved using external programs. I have used untig-caller to achieve this (which relies on suffix arrays (FM-index) provided by SeqAn3). Maybe a description in the README of this process would suffice?

@anwarMZ
Copy link

anwarMZ commented Mar 17, 2022

@karel-brinda any thoughts on adding this feature? something like unitig-caller would be great.
@SionBayliss did you find another tool that can generate the matrix?

@SionBayliss
Copy link
Author

@anwarMZ It was a long time ago that I worked on this so I will summarise what my (incomplete) notes say I did at the time. Please note that this was most likely insufficient for a complete analysis but might be useful to you to expand upon in the future.

I used unitig-caller to convert the workflow to one I was familiar with:

# starting with a unitig file generated from my complete bacterial genome collection using bcalm (I think) 
mkdir simplitig_pop
prophasm -i ./unitigs_unitigs.fasta -k 31 -o simplitig_pop/simplitigs.fa -s simplitig_pop/k-mer_stats.txt

# make into single line fasta - expected format
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < simplitig_pop/simplitigs.fa > simplitig_pop/simplitigs.fa1 && mv simplitig_pop/simplitigs.fa1 simplitig_pop/simplitigs.fa

# call unitigs using unitig caller - input if a list of two columns (col1 = name, col2 = path)
# this --simple flag uses string matching
unitig-caller --simple --refs unitig.named.list --unitigs simplitig_pop/simplitigs.fa --output simplitig_pop/counts

# homebrew script to convert to a PA matrix
../../scripts/convert_kmer_to_tab -i ./simplitig_pop/counts_pyseer.txt -o ./simplitig_pop/counts_pyseer.rtab -s sample.list # output in pyseer format

I believe that the limitation here was that simplitigs are effectively concatenations of unitigs (a stochastic choice of a single path in a De Brujin graph constructed from k-mers) and as such might not actually exist in any samples in your collection (or might exist but be interrupted by another genetic element). Simple string matching may not be sufficient to identify the presence of a simplitig in a sample unless it includes partial matching as it might be comprised of two or more unitigs.@karel-brinda might be able to correct me or expand upon this.

Hope that helped!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants