On the challenge of building Super-pangenomes. #1386

sivico26 · 2024-05-14T12:03:49Z

sivico26
May 14, 2024

Hello, community,

I was not sure where to start this conversation but I hope this is a good place.

I have tracked the progress of Pangenomics during the last couple of years and it has been a remarkable development.

Minigraph-Cactus pipeline and pggb are now the de facto go-to tools if you want a comprehensive representation while minimizing ref. bias. With these workhorses, I think we are reaching a point where Pangenome building of a given species is a feasible task if you can manage the scale (genome size, number of haplotypes, genetic within-population diversity, etc.).

However, we are slowly getting into the space of super-pangenomes (multi-species pangenomes), apparently led by breeders/domestication interest (e.g. Bovine, Rice, and Grape Super-pangenomes), and our workhorses are not designed ~~yet~~ to scale and perform as well on this ride.

I am currently working in a super-pangenome of a plant group that is bigger than the examples above (4 Gb, highly repetitive), and it already feels like I am pushing on the realms of what it is possible with our current tools.

In addition to the scale challenge present in within-population pangenomes (e.g. human), for super-pangenomes we usually have a high degree of divergence between the different haplotypes that are supposed to compose the pangenome.

To give you an idea of what I mean, wfmash, the mapping component of the pggb pipeline, is incredibly performant in the with-population case (with low divergences) taking just a few minutes to map/align entire chromosomes. But if you increase the divergence parameter to accommodate what you can expect super-pangenome case, it easily jumps from minutes to days, making the All-vs-All operation of pggb unfeasible. Scalability is hindered by divergence when building Super-pangenomes.

Similarly, minigraph is not intended to accommodate high degrees of divergence, or it would have to be at least tweaked to do so. By extension, and as far as my understanding goes, the minigraph-cactus pipeline currently can not accommodate to increased levels of divergence.

So how should we tackle this challenge? It seems like if the scale and the divergence is not so severe, pggb can handle it (see grape case above), but how do we do with more complicated cases?

Here is where I think coming back to the Progressive cactus algorithm as a backbone for the alignment step can be a big part of the puzzle. Taking advantage of the phylogenetic relationships between your taxa to guide the alignment, makes a lot of sense. I am happy to share that I have been able to make some super pangenome graphs by chaining Progressive cactus $\rightarrow$ hal2vg $\rightarrow$ vg convert $\rightarrow$ gfa graph. Again, these were 7 haplotypes of 4 Gb, highly repetitive, plant genomes, with up to 12.5% of divergence between haplotypes.

However, when I tried to go for my full dataset (28 haplotypes), although cactus alignment succeeded after several days, resulting in a 46 GB hal file, hal2vg does not scale well here (requesting too much memory). I remember @glennhickey said hal2vg is not designed to scale well and this can be too much, but it feels like a pity that the main alignment part was feasible and it is a "conversion" module that prevents us from getting to a graph representation.

So I have several questions:

Are there any plans to revamp hal2vg to scale better?
If I am forced to stick with the hal alignment (which I am discovering could be more useful than I thought), and can not get a graph, what am I losing regarding what I can do with a graph super-pangenome in contrast to can I do with the MSA? I have to dig more, but Hal tools seem very rich in the explorations that can be done on the MSA (synteny, orthology, etc.), but I imagine there have to be advantages to having it as a graph representation (at a minimum, benefiting from the tools already designed for graphs like odgi or vg).
In a similar front, I remember reading discussions about the usefulness of these gigantic super-pangenome graphs (that would be for instance too complicated to map to), especially if they are very divergent and hence probably very bubbly. It is something we are wondering ourselves. What can we say about this? Are not the graphs at least as useful as the MSA in the Hal file? Can we make these graphs more amenable or accessible with some postprocessing?
If not through the pipeline I propose above, how should we tackle high-scale super-pangenomes? Are there plans to expand on current tools/algorithms? I think I read @glennhickey mentioned planning to expand Minigraph-Cactus to enable incorporation of outgroups, but I am unsure if that means opening the door to tackle super-pangenomes and if that would manage high degrees of divergence.
In a related issue, I have heard that hal2vg conversion might have some problems and not keep some properties of a proper pangenome graph. Would you mind to expand on this? Do you think this might be addressed by doing some postprocessing like using smoothxg and gfaffix?

To me, it is clear that further algorithmic and focused development is needed to tackle super-pangenomes. I would love to hear what you think about how this space is developing and if you have any comments or related ideas.

glennhickey · 2024-05-14T14:47:42Z

glennhickey
May 14, 2024
Maintainer

Hi, thanks for the very thoughtful and constructive message. I think I'm mainly in agreement with all your points. My personal outlook is:

The sequence graph format and tools (such as vg) indeed do not scale well with divergence between the input haplotypes. So even if you manage to construct a very diverged pangenome, applications are currently very limited. This issue is exacerbated by the fact that more distant genomes will have inter-chromosomal events that can require much more memory for construction and some types of indexing. The 10-way great apes graph pushes minigraph-cactus to near its limit. It has very comparable coverage to Progressive Cactus alignments which is encouraging, but performs worse than a human pangenome for read mapping (though it's possible vg giraffe can be parameterized to work better with such graphs).
As such I'm more focused on "comparative pangenomics" than "super-pangenome graphs" in Cactus. This means making pangenome graphs of individual species, then aligning those together with Progressive Cactus. I have some interesting preliminary results, and am hoping to share something more substantial on this subject in the coming months.
hal2vg doesn't scale well to large progressive alignments. But I think if it doesn't run out of memory it should produce the correct for these cases. There may be some low-hanging fruit to improve here (or alternatives like going to taf->paf->seqwish). Likewise I'm hoping to improve hal tools to scale better to some of these larger datasets.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ComparativeGenomicsToolkit

On the challenge of building Super-pangenomes. #1386

{{title}}

Replies: 1 comment

{{title}}

Select a reply

ComparativeGenomicsToolkit

On the challenge of building Super-pangenomes. #1386

sivico26 May 14, 2024

Replies: 1 comment

glennhickey May 14, 2024 Maintainer

sivico26
May 14, 2024

glennhickey
May 14, 2024
Maintainer