Fundamentals

The prompt is not an input. It's an environment.

That single design decision is what separates RLMs from every other long-context approach. Instead of feeding a million tokens into a Transformer and hoping for the best, you load the prompt into a REPL as a variable and let the model write code to interact with it.

The core insight

Why shoving everything into context doesn't work

Every LLM has a context window -- a maximum number of tokens it can process at once. The industry keeps making these bigger: 128K, 272K, 1M. But size isn't the issue. Quality is.

Even within their stated limits, models exhibit context rot: performance degrades as prompts get longer, especially on tasks that require reasoning over the entire input rather than just locating a specific fact. GPT-5 handles needle-in-a-haystack fine at 200K tokens. Ask it to aggregate information from every paragraph in a 200K-token document, and it falls apart.

The degradation gets worse as task complexity increases. Constant-complexity tasks (find one thing) survive longer contexts. Linear-complexity tasks (process every chunk) degrade faster. Quadratic-complexity tasks (reason about pairs of chunks) collapse almost immediately.

RLMs sidestep this entirely. The neural network never sees the full prompt. It only sees metadata about it -- length, a short prefix, type information -- and writes code to interact with it piece by piece.

The architecture

How an RLM actually works

An RLM wraps any base language model with an inference-time scaffold. The flow:

Initialize a REPL. Given an arbitrary-length prompt P, the RLM starts a persistent programming environment (Python REPL). P is stored as a string variable inside this environment. The model also gets a function for invoking sub-RLM calls.
Provide metadata, not content. The root model receives only constant-size metadata about P: its length, a short prefix, how to access slices of it. The full text of P never enters the model's context window.
Model writes code. The model generates code that peeks into P, slices it, transforms it, and launches sub-RLM calls on the slices. These sub-calls are themselves full RLMs that can recurse further.
Execute and observe. The REPL runs the code, updates state, and returns only metadata about stdout back to the model. Intermediate results live as variables in the REPL, not in the model's context.
Aggregate and return. When the model sets a special "Final" variable in the REPL, iteration stops and that value becomes the response.

The key: at every level of recursion, the model's context window only contains constant-size turns. All the heavy data lives in REPL variables. This is what makes unbounded input processing possible.

Visualized

The recursive call pattern

User Prompt P (e.g., 10M tokens)
    |
    v
[RLM Root] -- sees: len(P)=10M, P[:200]="The first..."
    |
    |-- writes: chunks = [P[i:i+8000] for i in range(0, len(P), 8000)]
    |-- writes: results = [sub_rlm(f"Summarize: {c}") for c in chunks]
    |                         |
    |                         +--[Sub-RLM 1] processes chunk 1 (8K tokens)
    |                         +--[Sub-RLM 2] processes chunk 2 (8K tokens)
    |                         +--[Sub-RLM 3] processes chunk 3 (8K tokens)
    |                         +-- ... (1,250 sub-calls for 10M tokens)
    |
    |-- writes: combined = "\n".join(results)
    |-- writes: Final = sub_rlm(f"Given these summaries: {combined}, answer: ...")
    |
    v
Response Y

Each sub-RLM is itself a full RLM that can recurse further if its input is still too large. The recursion bottoms out when chunks fit comfortably in the base model's context window.

Three design choices

What makes RLMs different from "just using agents"

The paper identifies three specific design decisions that separate RLMs from existing agent scaffolds:

1. The prompt is a variable, not context. Coding agents and retrieval agents put the user prompt directly into the LLM's context window. An RLM stores it externally. This sounds trivial but it's the entire game -- it means the model is never bounded by its context window with respect to user input.

2. Output is symbolic, not autoregressive. Standard scaffolds ask the model to generate its final answer token-by-token into the context window, which means outputs are also bounded by the window. RLMs build up the response in REPL variables, enabling unbounded output length.

3. Recursion is programmatic, not verbal. Previous self-delegation approaches (like Anthropic's sub-agent patterns) let models invoke themselves, but the sub-calls are generated autoregressively -- one at a time, limited by output length. RLMs write programs that launch sub-calls inside loops, enabling the model to invoke itself O(|P|) or even O(|P|^2) times through a few lines of code.

Point 3 is the killer. A standard agent might verbalize "now process chunk 1... now process chunk 2..." and run out of context after a dozen chunks. An RLM writes for chunk in chunks: results.append(sub_rlm(chunk)) and processes thousands.

Complexity classes

Constant, linear, quadratic -- and why it matters

Not all long-context tasks are created equal. The paper categorizes them by how processing complexity scales with input length:

Constant complexity -- tasks like needle-in-a-haystack where you're looking for one thing regardless of input size. Frontier models handle these reasonably well even at long contexts. RLMs help but the gap is smaller.

Linear complexity -- tasks like OOLONG where the answer depends on processing every chunk of the input. These break standard models quickly. RLMs with GPT-5 outperform vanilla GPT-5 by 28.4% here.

Quadratic complexity -- tasks like OOLONG-Pairs where you need to reason about pairs of chunks. Vanilla GPT-5 scores below 0.1% F1. RLM(GPT-5) scores 58% F1. The gap is comical.

This hierarchy is the real insight. Context windows aren't just too small -- they're the wrong abstraction for information-dense tasks. No amount of window expansion will help a model that needs to do O(n^2) semantic work in a single forward pass.