RLMs are a simple idea with nuanced implementation. This page covers the specific techniques: how the REPL works, what decomposition strategies the model uses, how sub-calls are structured, and how RLM-Qwen3-8B was trained on just 1,000 samples.
The REPL (Read-Eval-Print Loop) is where the magic happens. When an RLM receives a prompt P, it initializes a persistent Python environment with:
The model then generates code in iterative turns. Each turn: write code, execute it, observe metadata about the result (not the full stdout -- just its length and a prefix). This forces the model to keep heavy data in REPL variables rather than polluting its own context window.
If each turn is trimmed to c tokens, you get at most K/c root iterations (where K is the context window), each of which can launch arbitrarily many sub-calls. In practice, the model self-terminates by setting a "Final" variable when it has its answer.
The model decides its own decomposition strategy. Nobody hardcodes chunk sizes or overlap windows. The model examines metadata about P (length, prefix, type) and writes appropriate slicing code. Common patterns observed in the paper:
Fixed-size chunking -- the simplest approach. Split P into N-token chunks, process each with a sub-RLM, aggregate results. Used for straightforward aggregation tasks.
Semantic chunking -- the model peeks at P to find natural boundaries (document separators, paragraph breaks, function definitions in code) and splits on those.
Hierarchical decomposition -- for tasks requiring deep reasoning, the model might first chunk at a coarse level (documents), then have sub-RLMs further decompose within each document. True recursion, not just one level of delegation.
Targeted probing -- for search-like tasks, the model might use BM25-style keyword matching in the REPL to identify relevant sections, then only launch sub-RLMs on those sections. This is why RLMs can be cheaper than base model calls -- selective context access.
The key difference from RAG or sliding-window approaches: the model is in control. It writes the decomposition logic itself, adapting to the specific task and input structure. No one-size-fits-all chunking strategy imposed from outside.
The sub_rlm() function is what gives RLMs their recursive power. When the root model writes:
Each sub_rlm() call spins up a fresh RLM instance. That instance gets its own REPL, its own context window, its own ability to recurse further. The sub-model can be the same model or a smaller/cheaper one.
In the paper's GPT-5 experiments, the root model is GPT-5 while sub-calls use GPT-5-mini -- striking a balance between capability and cost. For the Qwen3-Coder experiments, the same model is used throughout.
This is fundamentally different from autoregressive sub-agent delegation. When Anthropic's agent patterns or similar scaffolds "delegate" to a sub-agent, they verbalize the delegation in their output stream -- one sub-call per generated token sequence. An RLM writes a for loop that launches thousands of sub-calls through a few tokens of code. The semantic work scales with the program, not with the output length.
The paper's most surprising result might be how little training it takes to make a model natively recursive.
RLM-Qwen3-8B was created by fine-tuning Qwen3-8B on just 1,000 filtered trajectories. These trajectories were generated by running Qwen3-Coder-480B as an RLM with Qwen3-8B sub-calls on tasks from LongBenchPro -- so the training data shows what good RLM behavior looks like from a stronger model.
The clever insight: training a good sub-call model is roughly the same as training a good general-purpose reasoning model. You don't need to teach the model recursion at both levels simultaneously. Focus on teaching the root model how to manipulate the REPL and launch sub-calls effectively. The sub-call model just needs to be a competent reasoner, which smaller models already are.
The training domains were deliberately unrelated to the evaluation tasks. No overlap. Yet the model improved by a median of 28.3% across four benchmarks. The RLM scaffold is genuinely task-agnostic -- learning to be recursive in one domain transfers to others.
vs RAG (Retrieval-Augmented Generation) -- RAG retrieves a fixed number of relevant chunks and feeds them to the model. Great for lookup tasks, terrible for aggregation. If the answer requires reasoning across every chunk in a corpus, RAG can't help you. RLMs can.
vs Sliding Window / Context Compaction -- Summarization agents iteratively compress context as it fills up. This works okay for shallow tasks but presumes you can safely forget early details to make room for new ones. For dense reasoning tasks, that assumption is fatal. On BrowseComp-Plus, RLMs outperform the summarization baseline by over 29%.
vs CodeAct / ReAct Agents -- These agents can execute code in a loop, but they put the user prompt directly into the model's context. They inherit all the limitations of the base model's context window. Adding BM25 retrieval helps for search tasks but doesn't address aggregation.
vs CodeAct with Sub-calls -- The closest baseline. This gives the agent both code execution and the ability to invoke sub-LM calls. But because the prompt is in-context rather than in a variable, it still hits the wall on long inputs. The paper tests this ablation directly: on information-dense tasks, RLMs outperform by 10-59%.
vs Bigger Context Windows -- This is the elephant in the room. Why not just wait for 10M-token context windows? Because context rot isn't a scaling problem. It's an attention problem. Bigger windows don't help if the model can't maintain quality across them. RLMs solve the quality problem, not the size problem.
The reference implementation supports multiple REPL environments:
For production use with untrusted inputs, isolated environments are non-negotiable -- the model is writing and executing arbitrary code. But for research and controlled settings, the local REPL is fast and simple.
RLMs are already being adopted by major frameworks, which tells you something about how seriously the community is taking this.
DSPy (v3.1.2+) -- Stanford's programmatic LLM framework has built-in RLM support. You can initialize an RLM with dspy.RLM('articles, question -> trends: list[str]') and it handles the REPL, sub-calls, and aggregation transparently. It supports using a smaller model for sub-calls via the sub_lm parameter to reduce costs.
Google ADK -- Liam Connell (Google Cloud) published an enterprise-ready reimplementation of RLMs using Google's Agent Development Kit. The ADK version extends the original paper with two notable innovations: lazy file loading (the context object references files on disk or in GCS buckets rather than loading everything into memory) and parallelism (sub-calls can run concurrently rather than sequentially). The implementation pushed ADK's primitives to their limits -- the recursive nature required dropping down from LLMAgent to the bare-bones BaseAgent.
Coding agents -- As several commentators have noted, tools like Claude Code and Gemini CLI already use sub-agent patterns that resemble RLMs. The difference is that these tools don't externalize the user prompt as a REPL variable, so they're still bounded by context limits on the input side. But the conceptual overlap suggests RLMs may become the default inference pattern for coding agents.
Alex Zhang described RLMs as a "bitter-lesson-pilled approach" on X. The reference is to Rich Sutton's famous essay arguing that general methods leveraging computation always win over clever domain-specific tricks.
The insight Zhang emphasized: "LMs can often ignore most of their context for certain problems. LMs can more efficiently solve problems when only looking locally at certain parts of their input. The REPL environment provides a programmatic way for the model to peek at and infer long contexts without the model ever actually viewing it. It's a partially observable problem that you're giving the LM, where it can make logical decisions based on the structure of the task and context."
This framing matters. RLMs aren't a workaround for insufficient context windows. They're an argument that the model shouldn't see the full context in the first place -- that treating the input as a partially observable environment you interact with programmatically is fundamentally more expressive than attention over a flat token sequence.