The industry spent five years making context windows bigger. RLMs suggest the entire framing was wrong from the start.
In 2023, the race to expand context windows felt like genuine progress. GPT-4 shipped with 8K tokens; within months, Anthropic pushed Claude to 100K. By mid-2024, Google's Gemini 1.5 Pro was advertising 1 million tokens. The message was clear: bigger windows meant better models.
That narrative is starting to crack. Not because the windows stopped growing -- they did not -- but because a growing body of evidence shows that having a large context window and being able to use it effectively are two very different things. The distinction matters enormously, and Recursive Language Models (RLMs) represent the clearest articulation of why.
The most popular benchmark for long-context models is the needle-in-a-haystack test. You insert a specific piece of information at some position in a long document, then ask the model to retrieve it. Most frontier models pass this test comfortably at 100K tokens and beyond.
But retrieval is the easiest possible long-context task. Finding a single piece of information in a large document is fundamentally O(n) -- you scan until you find the needle. The attention mechanism, despite its quadratic cost, handles this because only a small region of the input is actually relevant to the answer.
The problems begin when the task requires dense reasoning across the entire input. Consider a task where you need to compare every paragraph in a 200-page document against every other paragraph to find contradictions. That is O(n^2) in information complexity. Or consider summarizing a codebase where understanding any single file requires understanding its imports, which requires understanding their imports, creating a dependency web across the entire context. These are the tasks that matter in practice, and they are precisely the tasks where large context windows fail.
Zhang, Kraska, and Khattab from MIT's OASYS lab demonstrated this concretely with the OOLONG-Pairs benchmark. This task requires quadratic-complexity reasoning over the full input. GPT-5, with its 272K-token context window, scores below 0.1% F1 on this task. The model has the capacity to fit the input, but it cannot reason over it effectively. The context window is not the bottleneck. The architecture is.
The self-attention mechanism in Transformers computes a weighted relationship between every pair of tokens in the input. In theory, this gives the model the ability to attend to any relevant information regardless of position. In practice, the mechanism breaks down in predictable ways as inputs grow longer.
First, there is the dilution problem. Attention weights must sum to one across the entire sequence. As the sequence length grows, each individual attention weight shrinks proportionally, making it harder for the model to concentrate on the few tokens that actually matter for a given computation. Techniques like ALiBi and RoPE help with positional encoding at long distances, but they do not solve the fundamental arithmetic of distributing a fixed attention budget over more tokens.
Second, there is the serial computation constraint. A Transformer model processes its entire input in a fixed number of layers -- typically 32 to 128 for frontier models. Each layer can perform one round of attention and feed-forward computation. If the task requires a chain of reasoning that is longer than the number of layers, the model simply cannot perform it, regardless of how many tokens fit in the window. This is not a theoretical concern: multi-hop reasoning tasks routinely exceed the depth capacity of even the largest models.
Third, there is the training distribution gap. Most long-context models are trained predominantly on shorter sequences and fine-tuned to extrapolate. The result is that the model's internal representations are tuned for the kind of patterns that appear in short-to-medium inputs. When those representations encounter a genuinely dense 200K-token input during inference, they are operating outside their comfort zone. This shows up as degraded coherence, missed dependencies, and a general decline in reasoning quality that practitioners have come to call "context rot."
Recursive Language Models take a fundamentally different view of the problem. Instead of trying to cram the entire input through the neural network in a single forward pass, RLMs treat the prompt as an external environment that the model can programmatically explore.
The input is stored as a variable in a REPL (Read-Eval-Print Loop) environment. The model does not receive the full input in its context window. Instead, it receives a description of what the input contains and the tools to examine it: it can slice the variable, read sections, write code to search it, and -- critically -- recursively invoke itself on sub-sections.
This is not a retrieval trick. The model does not index the input and look things up. It writes a program to process the input, and that program can decompose the task into sub-tasks, each of which gets its own fresh context window with only the relevant slice of input. Sub-tasks can themselves decompose further, creating a recursive tree of processing that adapts its depth and breadth to the complexity of the specific task.
The result is that an 8B parameter model can effectively process 10 million tokens or more -- not by having a 10M-token context window, but by never needing one. Each individual invocation operates within a standard context window (say, 32K tokens), processing a manageable chunk. The recursive structure handles the coordination.
This reframe suggests that the industry has been optimizing the wrong variable. Context window size is a measure of how much raw text you can fit into a single forward pass. But the actual constraint on performance is computational depth -- how many sequential steps of reasoning the model can perform, each informed by the results of prior steps.
A standard LLM has a fixed computational depth equal to its number of layers. A 96-layer model gets 96 steps of computation, regardless of whether the input is 1K tokens or 200K tokens. An RLM has a variable computational depth that scales with the task: a simple extraction might require two recursive calls, while a complex cross-document analysis might require dozens, each building on the results of the previous ones.
This distinction maps cleanly to the computer science theory of computational complexity. Some tasks are inherently sequential -- they require O(n) or O(n^2) or O(n log n) steps that cannot be parallelized. A fixed-depth Transformer is fundamentally limited in its ability to solve such tasks, no matter how wide its context window. An RLM can adapt its computational depth to match the task, making it capable of solving problems that are provably beyond the reach of fixed-depth architectures.
None of this means context windows are useless. They still matter for tasks where the model needs to attend to multiple pieces of information simultaneously -- the kind of parallel, pattern-matching reasoning that Transformers excel at. A well-sized context window is a good thing.
But the era of treating context window size as the primary measure of a model's long-context capability should be ending. The MIT results show that a model with a modest context window, augmented with recursive self-invocation, can outperform a model with a vastly larger window on tasks that require genuine reasoning over long inputs.
The implication for practitioners is significant. Instead of waiting for a model with a window large enough to fit your data, you can use an RLM-enabled model with a standard window and get better results on most tasks that matter. Instead of paying the quadratic cost of attention over your entire input, you pay the linear cost of scanning it programmatically with recursive calls that only attend to the parts that matter at each step.
The context window race was a response to a real problem. But it was the wrong response. The right abstraction is not "how much can the model see at once" but "how deeply can the model think about what it sees." RLMs are the first clean implementation of that insight, and the benchmark results suggest the difference is not marginal. It is categorical.