Mixed-Modal Early-Fusion Foundation Models: Paper run-throughs for ‘Chameleon’ and ‘MoMa’

Video
Text-only LLMs are great, and we’ve seen people pasting on some image support here and there, but the future it seems is multi-modal. What does it take to train models from scratch that take in both images and text (and more)? In this video we look at two key papers from FAIR at Meta, introducing their Chameleon approach and making it more efficient with mixture of experts.
Published

August 13, 2024

Text-only LLMs are great, and we’ve seen people pasting on some image support here and there, but the future it seems is multi-modal. What does it take to train models from scratch that take in both images and text (and more)? In this video we look at two key papers from FAIR at Meta, introducing their Chameleon approach and making it more efficient with mixture of experts.