Thanks for the info, I now also found the model card. So it seems like they went...

Thanks for the info, I now also found the model card. So it seems like they went the way of grafting models together, which I find less interesting tbh.

In the Transfusion paper, they use both discrete (text tokens) and continuous (images) signals to train a single transformer. To do this, they use a VAE to create a latent representation of the images (split into patches) which are fed into the transformer within one linear sequence along the text tokens - they trained the whole model from scratch (the largest being a 7B model trained on 2T token with a 1:1 split text:images.) The loss they trained the model on was a combination of the normal language modeling LM loss (cross entropy on tokens) and diffusion DDPM on the images.

There was some prior art on this, but models like Chameleon discretized the images into a token codebook of a certain size - so there were special tokens representing the images. However, this incurred a severe information loss which Transfusion claims to have alleviated using the continuous latent vectors of images.

Training a single set of weights (shared weights) on different modalities seems more interesting looking forward, in particular for emergent phenomena imo.

Some of the authors of the transfusion paper work at meta so I was hoping they trained a larger-scale model. Or released any transfusion-based weights at all.

Anyways, exciting stuff either ways.