Skip to main content

The New Normal? Early Results from InceptionLabs' Diffusion-Based LLM Look Promising

· 7 min read
AI Scientist & PM

The longer your context, the slower transformer-based LLMs get. It's not a tuning problem, it's architectural: autoregressive models produce one token at a time and attend over the full context on each step. At 20,000 tokens, you're paying for 20,000 tokens of attention on every single generation step.

InceptionLabs' Mercury-2 uses a diffusion architecture that generates output in parallel across the full sequence, so its latency doesn't scale the same way. I benchmarked it against GPT-4.1-nano and GPT-5-nano in a RAG pipeline at two context lengths. At short context, Mercury-2 finishes last on every metric. At 21k tokens, it's 5x faster than the alternatives and the only model that stays under 1.5 seconds. The crossover is around 4,500 tokens.

Predicting the 2026 World Cup Group Stage - A Monte Carlo Simulation Deep Dive

· 13 min read
AI Scientist & PM

Introduction

The 2026 FIFA World Cup will be the largest in history, featuring 48 teams across 12 groups. I ran a comprehensive Monte Carlo simulation — 100,000 iterations — accounting for player injuries, red cards, altitude, and H2H records to predict which teams make it out of the group stage.

The Three AI Frontiers for 2026 - A Prediction and What to Look Out For

· 5 min read
AI Scientist & PM

I've lived and breathed AI and agents for 365 days of 2025 (and 2026 won't be any different). The AI boom of the last few years has been incredible to watch, but I keep coming back to the same few issues:

  1. Energy is not scaling with demand.
  2. When we say want autonomous AI agents, we don't really want autonomous AI agents.
  3. Data is not scaling with demand and synthetic data will never be the whole story.