Replies: 1 comment
-
Hi, thanks for the very thoughtful and constructive message. I think I'm mainly in agreement with all your points. My personal outlook is:
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, community,
I was not sure where to start this conversation but I hope this is a good place.
I have tracked the progress of Pangenomics during the last couple of years and it has been a remarkable development.
Minigraph-Cactus pipeline and
pggb
are now the de facto go-to tools if you want a comprehensive representation while minimizing ref. bias. With these workhorses, I think we are reaching a point where Pangenome building of a given species is a feasible task if you can manage the scale (genome size, number of haplotypes, genetic within-population diversity, etc.).However, we are slowly getting into the space of super-pangenomes (multi-species pangenomes), apparently led by breeders/domestication interest (e.g. Bovine, Rice, and Grape Super-pangenomes), and our workhorses are not designed
yetto scale and perform as well on this ride.I am currently working in a super-pangenome of a plant group that is bigger than the examples above (4 Gb, highly repetitive), and it already feels like I am pushing on the realms of what it is possible with our current tools.
In addition to the scale challenge present in within-population pangenomes (e.g. human), for super-pangenomes we usually have a high degree of divergence between the different haplotypes that are supposed to compose the pangenome.
To give you an idea of what I mean,
wfmash
, the mapping component of thepggb
pipeline, is incredibly performant in the with-population case (with low divergences) taking just a few minutes to map/align entire chromosomes. But if you increase the divergence parameter to accommodate what you can expect super-pangenome case, it easily jumps from minutes to days, making the All-vs-All operation ofpggb
unfeasible. Scalability is hindered by divergence when building Super-pangenomes.Similarly,
minigraph
is not intended to accommodate high degrees of divergence, or it would have to be at least tweaked to do so. By extension, and as far as my understanding goes, the minigraph-cactus pipeline currently can not accommodate to increased levels of divergence.So how should we tackle this challenge? It seems like if the scale and the divergence is not so severe,
pggb
can handle it (see grape case above), but how do we do with more complicated cases?Here is where I think coming back to the Progressive$\rightarrow$ $\rightarrow$ $\rightarrow$ gfa graph. Again, these were 7 haplotypes of 4 Gb, highly repetitive, plant genomes, with up to 12.5% of divergence between haplotypes.
cactus
algorithm as a backbone for the alignment step can be a big part of the puzzle. Taking advantage of the phylogenetic relationships between your taxa to guide the alignment, makes a lot of sense. I am happy to share that I have been able to make some super pangenome graphs by chaining Progressivecactus
hal2vg
vg convert
However, when I tried to go for my full dataset (28 haplotypes), although
cactus
alignment succeeded after several days, resulting in a 46 GB hal file,hal2vg
does not scale well here (requesting too much memory). I remember @glennhickey saidhal2vg
is not designed to scale well and this can be too much, but it feels like a pity that the main alignment part was feasible and it is a "conversion" module that prevents us from getting to a graph representation.So I have several questions:
hal2vg
to scale better?hal
alignment (which I am discovering could be more useful than I thought), and can not get a graph, what am I losing regarding what I can do with a graph super-pangenome in contrast to can I do with the MSA? I have to dig more, but Hal tools seem very rich in the explorations that can be done on the MSA (synteny, orthology, etc.), but I imagine there have to be advantages to having it as a graph representation (at a minimum, benefiting from the tools already designed for graphs likeodgi
orvg
).hal2vg
conversion might have some problems and not keep some properties of a proper pangenome graph. Would you mind to expand on this? Do you think this might be addressed by doing some postprocessing like usingsmoothxg
andgfaffix
?To me, it is clear that further algorithmic and focused development is needed to tackle super-pangenomes. I would love to hear what you think about how this space is developing and if you have any comments or related ideas.
Beta Was this translation helpful? Give feedback.
All reactions