Both use the same underlying neural networks. The difference is how they interact with long inputs. One brute-forces it. The other thinks programmatically.
| Dimension | LLM | RLM |
|---|---|---|
| Max effective input | 4K–272K tokens (degrades at scale) | 10M+ tokens (tested) |
| Processing model | Single forward pass — all tokens attend to all tokens | Recursive self-calls — decompose, process chunks, aggregate |
| How input is accessed | Loaded entirely into context window | Stored as environment variable, examined programmatically |
| Scaling behavior | O(n²) attention — quality drops as input grows | Recursive decomposition — quality maintained at any scale |
| Selective attention | No — must attend to everything | Yes — model decides what to examine |
| Code execution | Not part of inference | Central — model writes code in REPL to slice and process |
| Cost at scale | Linear or worse — paying for all tokens | Often cheaper — only processes relevant chunks |
| Failure mode | "Context rot" — gradually loses information | Can miss connections across chunks (mitigated by overlap strategies) |
| Best suited for | Short-to-medium inputs, conversational tasks | Book-length docs, codebases, legal corpora, deep research |
| Training required | Standard pretraining + fine-tuning | ~1,000 post-training samples on any base LLM |
An RLM is not a different model architecture. It's the same transformer — the same weights, the same attention mechanism — wrapped in a recursive execution framework.
Think of it this way: an LLM is a person trying to read an entire library by cramming all the books into their field of vision at once. An RLM is the same person, but now they have a desk, a notepad, and a system. They pick up one book at a time, take notes, cross-reference, and build understanding incrementally.
The key components that make this work:
1. REPL Environment — The input becomes a variable in a code sandbox. The model doesn't "see" the full text. It writes Python to examine it.
2. Recursive Self-Calls — The model can call itself on sub-problems. Process chunk 1, get a partial answer, process chunk 2 with that context, repeat.
3. Programmatic Decomposition — The model decides how to split the input. It's not fixed chunking — it's task-aware. A summarization task splits differently than a search task.
The MIT OASYS lab tested RLM-Qwen3-8B (an 8B parameter model post-trained on just 1,000 samples) against vanilla GPT-5 on four long-context benchmarks:
| Benchmark | GPT-5 | RLM(GPT-5) | Delta |
|---|---|---|---|
| S-NIAH (retrieval) | High | Comparable | ≈0% |
| OOLONG-Pairs (quadratic) | <0.1% F1 | 58% F1 | +580x |
| BrowseComp-Plus | Moderate | Strong | Significant |
| CodeQA | Degrades | Maintained | +28.3% avg |
The OOLONG-Pairs result is the most striking. This benchmark requires comparing information across the entire input — the kind of task where attention mechanisms fundamentally struggle at scale. GPT-5 essentially fails. The RLM version handles it because it doesn't try to attend to everything at once.
On cost: at the median, RLM calls on GPT-5 are cheaper than vanilla GPT-5, because the model selectively examines context rather than paying for attention over all tokens.
The two approaches aren't mutually exclusive. An RLM uses standard LLM calls internally — it just orchestrates them recursively. You can think of it as a meta-layer on top of any LLM.
DSPy v3.1.2+ ships with built-in RLM support. If you're already using DSPy for prompt programming, adding recursive processing is a configuration change.
Google's Agent Development Kit (ADK) has an enterprise-ready implementation with lazy file loading and parallel sub-calls — optimized for production workloads.
The original MIT paper (arXiv:2512.24601) includes the full algorithm, training data, and post-training recipe. The model weights for RLM-Qwen3-8B are available on HuggingFace. The pip install rlm package provides a reference implementation.
This isn't a research curiosity anymore. It's being deployed in production systems for document processing, code analysis, and deep research applications.
Ready to go deeper? Start with how RLMs work, or see the benchmark results.