tldr: front-load a bunch of compute to warmstart the process
...the more I think about it, the more this feels like a "that would probably work, but would cost more money than it would save"
- compute a clustering of your training data in representation space
- finetune a model to each cluster
we can treat each model as the marginalized distribution for that cluster.
- now, take any incoming finetuning task, and project it into cluster space.
- use this projection to compute cluster responsibilities.
- use cluster responsibilities as weights for a model merge as a linear combination of the menu of marginals.