Why RAG Quality Breaks at Scale

Sep 03, 2025

I've worked on a fair few RAG implementations and one of the things I have found is that as RAG datasets scale, response quality can quickly decrease. This can be due to how the data is presented to the AI (lack of curation), lack of re-ranking, chunk sizes etc but a new report from researchers at Google DeepMind suggests there's a much more fundamental problem.

They've proven that vector embeddings (the backbone of most RAG systems) have hard mathematical limits on what they can represent. It's not about better training data or bigger models. It's about the physics of cramming information into fixed-dimensional vectors.

As another article on embeddings recently pointed out, embedding dimensions have grown from 200-300 in the early days to 4096+ now, and this growth was driven by GPU architecture constraints and market competition (rather than proven need).

Even with all these extra dimensions, the DeepMind researchers show we're still hitting fundamental walls.

The researchers created a simple test dataset called LIMIT with basic queries such as "who likes apples?" Even state-of-the-art embedding models with 4096 dimensions scored less than 20% recall. Meanwhile, old-school BM25 search pretty much aced it ( because of its very high dimensionality).

The issue is that as you try to represent more combinations of relevant documents, you hit a wall where your embedding dimension simply isn't large enough. The Google researchers show this mathematically and prove it empirically (even with direct optimization of the vectors themselves).

This explains a lot about why RAG systems struggle with complex queries that need to connect previously unrelated information. Embeddings have been getting pushed beyond their theoretical limits without realizing it.

So what does this mean when you're building RAG systems for enterprise with millions or hundreds of millions of documents? The single embedding approach that works fine for smaller datasets is likely going to break at scale. Cross-encoders, multi-vector approaches, hybrid sparse/dense systems, or agentic setups (where retrieval is distributed across specialized agents) is likely the way forward for this type of scale.

Definitely worth a read if you're building RAG systems at scale: 👉 https://github.com/google-deepmind/limit

Why AI Man

Discussion about this post