accelerated finetuning with "menu of marginals"

tldr: front-load a bunch of compute to warmstart the process

...the more I think about it, the more this feels like a "that would probably work, but would cost more money than it would save"

compute a clustering of your training data in representation space
finetune a model to each cluster

we can treat each model as the marginalized distribution for that cluster.

now, take any incoming finetuning task, and project it into cluster space.
use this projection to compute cluster responsibilities.
use cluster responsibilities as weights for a model merge as a linear combination of the menu of marginals.