Research -- Paper Deep Dive and Benchmark Results

The paper

What it claims and why you should believe it

The central claim: you can dramatically scale the effective input and output lengths of any LLM, at inference time, by treating the prompt as an external environment and enabling symbolic recursion.

This isn't another "we made the context window bigger" paper. It's an argument that the entire paradigm of stuffing tokens into a Transformer is wrong for information-dense tasks, and that the right abstraction is recursive self-invocation over programmatic slices of the input.

The evidence is strong. Four diverse benchmarks, two frontier models (GPT-5 and Qwen3-Coder-480B), multiple baselines (vanilla LLM, CodeAct, CodeAct+BM25, summary agents, CodeAct with sub-calls), and a small-scale post-training experiment. The results are consistent across all of them.

Benchmarks

Four tasks, four complexity levels

S-NIAH (Single Needle in a Haystack) -- Find a specific phrase or number in a large body of unrelated text. 50 tasks. Complexity: O(1) with respect to input length. This is the easy case -- frontier models already handle it well at moderate lengths.

BrowseComp-Plus (1K documents) -- Multi-hop question answering over 1,000 documents. Requires piecing together information from several gold/evidence documents buried in hard negatives. 150 instances. Harder than S-NIAH because it requires finding and connecting multiple documents.

OOLONG (trec_coarse) -- Transform every chunk of input semantically, then aggregate to form a final answer. 50 tasks. Complexity: O(n) -- the answer depends on nearly every entry in the dataset. This is where standard models start breaking hard.

OOLONG-Pairs -- A modified version requiring aggregation over pairs of chunks. 20 tasks. Complexity: O(n^2). The worst case for standard models. Frontier models essentially can't solve this at all.

Results

The numbers that matter

Selected results from Table 1 of the paper:

Method	S-NIAH	BrowseComp+	OOLONG	OOLONG-Pairs
GPT-5 (vanilla)	92.0	*	41.1	<0.1
RLM(GPT-5)	98.0	47.3	69.5	58.0
Summary Agent (GPT-5)	--	18.0	48.8	1.5
CodeAct+BM25 (GPT-5)	98.0	41.3	24.5	<0.1
Qwen3-8B (vanilla)	*	*	low	low
RLM-Qwen3-8B	+28.3% avg improvement over base Qwen3-8B

* indicates input exceeded context limits. Simplified from Table 1; see paper for full results including Qwen3-Coder-480B numbers and cost breakdowns.

The standout: OOLONG-Pairs. GPT-5 scores essentially zero. The RLM version scores 58% F1. This is a task that is mathematically impossible to solve well in a single forward pass because it requires O(n^2) semantic operations. The RLM writes a nested loop that compares every pair of entries -- exactly the kind of thing no amount of attention mechanism improvement will achieve.

Scaling

Performance vs input length

Figure 1 of the paper is the money chart. It plots performance on S-NIAH, OOLONG, and OOLONG-Pairs as input length scales from 2^13 (8K) to 2^18 (262K) tokens.

For S-NIAH (constant complexity): GPT-5 holds steady, the RLM holds steady. Not much difference at shorter lengths -- the gap appears beyond 2^14 tokens.

For OOLONG (linear complexity): GPT-5 degrades steadily. The RLM maintains strong performance throughout. The crossover happens around 2^14 tokens.

For OOLONG-Pairs (quadratic complexity): GPT-5 collapses immediately. Even at 2^13 tokens (the shortest tested), it's already struggling. The RLM maintains reasonable performance across the entire range.

Beyond 2^18 tokens (the red line in the figure -- past GPT-5's 272K context window), the base model simply can't run. The RLM keeps going.

The paper also tested at the 10M+ token scale on BrowseComp-Plus, where input corpora are 6-11M tokens. A linearly extrapolated cost for GPT-5-mini ingesting that much would be $1.50-$2.75. The RLM averaged $0.99 while outperforming all baselines by 29%+.

Cost analysis

Cheaper at the median, volatile at the tail

One of the more counterintuitive findings: RLMs are often cheaper than base model calls. At the 50th percentile, RLM(GPT-5) costs less than vanilla GPT-5 across most benchmarks.

Why? Because the RLM selectively examines context. Instead of ingesting a full 200K-token prompt, it might only look at 30K tokens total across its sub-calls. You pay for what you use.

The catch: high variance. At the 95th percentile, some RLM runs are significantly more expensive due to long trajectories. The model sometimes explores more paths than necessary. Compared to the summarization agent (which always ingests everything), RLMs are up to 3x cheaper at comparable performance levels.

Ablations

What actually matters in the design

The paper runs careful ablations:

REPL without sub-calls: Just having the prompt as an external variable (without recursive self-invocation) already helps a lot. It beats most baselines and scales beyond context limits. But on information-dense tasks (OOLONG, OOLONG-Pairs), sub-calls provide an additional 10-59% improvement.

CodeAct with sub-calls (but prompt in context): Giving an agent sub-call ability without externalizing the prompt doesn't close the gap. The prompt-in-context bottleneck is real.

Different root/sub models: Using a cheaper model for sub-calls (GPT-5-mini for subs, GPT-5 for root) works well and reduces cost. The sub-call model doesn't need to be as capable as the root.

The takeaway: both the REPL (prompt as variable) and symbolic recursion (programmatic sub-calls) contribute independently, and their combination is greater than either alone.

Related work

Where RLMs sit in the literature

RLMs draw on and improve several existing lines of research:

Inference-time compute scaling -- the reasoning model paradigm (OpenAI o-series, DeepSeek-R1) showed that spending more compute at inference improves results. RLMs apply the same idea to context length rather than reasoning depth.

Coding agents (CodeAct, SWE-agent) -- these treat external files as an environment, but can't handle arbitrarily long user prompts because the prompt still goes into context.

Self-delegation (Anthropic sub-agents, Sentient AI) -- these let models invoke themselves, but autoregressively rather than programmatically, limiting the scale of delegation.

Context compaction (DSPy, OpenAI context condensation) -- useful for agent trajectories but lossy for dense reasoning tasks.

The theoretical contribution: RLMs show that with symbolic recursion and external prompt storage, you can achieve effectively unbounded input tokens, unbounded output tokens, and unbounded semantic horizon.