Chinese Team Crack Token Limitations, Expanding Model's Potential Three Times Greater than Autoregressive
Chinese team found diffusion LMs learn 3x faster than autoregressive ones under token limits. Using 1B-param model, 480 epochs, 1B tokens, it excels in HellaSwag/MMLU benchmarks, showing potential for LM training breakthroughs.....