2 posts tagged with "benchmarks" | Abhinav Raghunathan

The New Normal? Early Results from InceptionLabs' Diffusion-Based LLM Look Promising

February 28, 2026 · 7 min read

The longer your context, the slower transformer-based LLMs get. It's not a tuning problem, it's architectural: autoregressive models produce one token at a time and attend over the full context on each step. At 20,000 tokens, you're paying for 20,000 tokens of attention on every single generation step.

InceptionLabs' Mercury-2 uses a diffusion architecture that generates output in parallel across the full sequence, so its latency doesn't scale the same way. I benchmarked it against GPT-4.1-nano and GPT-5-nano in a RAG pipeline at two context lengths. At short context, Mercury-2 finishes last on every metric. At 21k tokens, it's 5x faster than the alternatives and the only model that stays under 1.5 seconds. The crossover is around 4,500 tokens.