Jet-Nemotron, Gated DeltaNet, and the slow triumph of hybrid models
Video
Reading the Jet Nemotron paper to get a feel for how next-gen models might replace most of their attention blocks with more efficient alternatives, achieving much higher throughput without sacrificing too much quality.
Jet Nemotron paper: https://www.arxiv.org/abs/2508.15884 Gated Delta Networks: https://arxiv.org/abs/2412.06464 The review of hybrid linear attention variants we didn’t really chat about: https://arxiv.org/abs/2507.06457 Qwen 3 Next: https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list
A few days after the video came the official Qwen 3 Next announcement, scaling Gated Delta-Net hybrid models up a bunch (and achiving high sparsity in their MoE too) - hooray for more efficient models! Hope it’s interesting.