Blog | Abhinav Raghunathan

Everything is RL: Automatically Creating Optimal Agents with Evolutionary Methods

March 10, 2026 · 11 min read

The way most people build software with LLMs today is: write something, look at it, ask an LLM to make it better, look at it again. Vibecoding. It's fast, it's surprisingly effective, and it's still fundamentally local search guided by human intuition.

Human intuition is a bad guide for knowing if something actually got better. You can feel like an iteration was good without having moved the metric that matters. You can iterate eight times and end up with a more sophisticated-looking solution that scores lower than what you started with. I know this because I ran the experiment.

I Made LLMs Play Scrabble Against Each Other. Here's What the Data Shows.

March 6, 2026 · 8 min read

Abhinav Raghunathan and Claude

I built ScrabbLLM as a side project: a live arena where language models play Scrabble against each other. Four models per game, real Scrabble rules, ELO ratings, game replays you can step through move by move. It's been running for a few weeks now and the data has gotten interesting enough to write up.

The New Normal? Early Results from InceptionLabs' Diffusion-Based LLM Look Promising

February 28, 2026 · 7 min read

Abhinav Raghunathan and Claude

The longer your context, the slower transformer-based LLMs get. It's not a tuning problem, it's architectural: autoregressive models produce one token at a time and attend over the full context on each step. At 20,000 tokens, you're paying for 20,000 tokens of attention on every single generation step.

InceptionLabs' Mercury-2 uses a diffusion architecture that generates output in parallel across the full sequence, so its latency doesn't scale the same way. I benchmarked it against GPT-4.1-nano and GPT-5-nano in a RAG pipeline at two context lengths. At short context, Mercury-2 finishes last on every metric. At 21k tokens, it's 5x faster than the alternatives and the only model that stays under 1.5 seconds. The crossover is around 4,500 tokens.

Predicting the 2026 World Cup Group Stage - A Monte Carlo Simulation Deep Dive

January 13, 2026 · 13 min read

Abhinav Raghunathan

Introduction

The 2026 FIFA World Cup will be the largest in history, featuring 48 teams across 12 groups. I ran a comprehensive Monte Carlo simulation — 100,000 iterations — accounting for player injuries, red cards, altitude, and H2H records to predict which teams make it out of the group stage.

AI Downside Misalignment - Why Most AI Systems Never Make It to Production

January 7, 2026 · 2 min read

Abhinav Raghunathan

Why do so many AI systems never make it to production?

I call it downside misalignment, or where AI can fail in ways humans cannot. The pattern seems clear when you compare some of the more exciting technological milestones of the past and present:

The Three AI Frontiers for 2026 - A Prediction and What to Look Out For

January 3, 2026 · 5 min read

Abhinav Raghunathan

I've lived and breathed AI and agents for 365 days of 2025 (and 2026 won't be any different). The AI boom of the last few years has been incredible to watch, but I keep coming back to the same few issues:

Energy is not scaling with demand.
When we say want autonomous AI agents, we don't really want autonomous AI agents.
Data is not scaling with demand and synthetic data will never be the whole story.

Introduction​

Introduction