Paper walkthrough: rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

Video
Published

January 10, 2025

Looking at a recent paper getting impressive MATH performance using test-time compute. I especially like the use of paired ‘preference’ data for training their equivalent of a process reward model. The video calls out a few caveats (use of code, use of big models for SFT data despite ‘no distillation’ claim) but also that overall the results are impressive. Link: https://www.youtube.com/watch?v=BoC_P1NgGTk