Applications -- Where Recursive Language Models Change Things

Deep research

Reasoning over massive document corpora

The paper benchmarks RLMs on BrowseComp-Plus, a deep research task that requires reasoning over 1,000 documents to answer multi-hop questions. The documents contain gold evidence, supporting evidence, and hard negatives -- mimicking real-world research scenarios where you have a huge corpus and need to find and connect the relevant pieces.

At 6-11 million tokens of input, no standard model can even fit this in context. RAG helps for simple lookups but fails when the answer requires synthesizing information across multiple documents that wouldn't all appear in a top-k retrieval. The RLM approach lets the model systematically examine the corpus, identify relevant documents, and recursively reason over their connections.

The business case is obvious: any organization sitting on thousands of reports, memos, research papers, or technical documents could use RLMs to answer questions that span their entire knowledge base. Not retrieval -- actual dense reasoning across everything.

Code

Understanding entire repositories

The LongBench-v2 CodeQA benchmark tests exactly this: given a code repository, answer questions that require understanding the relationships between multiple files. This is the kind of task that developers do every day when onboarding to a new codebase or debugging cross-module issues.

Current AI coding assistants typically work file-by-file or with a handful of files in context. RLMs can process an entire repository as a single prompt, recursively examining files, tracing dependencies, and building up an understanding of the codebase structure before answering specific questions about it.

The model writes code to explore the repo -- listing files, reading specific functions, tracing imports -- using the same kind of systematic exploration a human developer would. But it does it across the entire codebase simultaneously, with sub-RLMs processing individual files in parallel.

Legal

Contract analysis at scale

Consider a due diligence review: hundreds of contracts, each dozens of pages, and you need to identify every instance of a specific clause type, compare terms across all agreements, and flag inconsistencies. This is O(n) or O(n^2) work depending on whether you need cross-document comparison.

Standard models can summarize individual contracts fine. But "find every non-compete clause across 500 employment agreements and identify which ones have terms inconsistent with the master agreement" requires dense processing of every document and comparison across all of them. That's exactly the pattern RLMs excel at -- decompose into individual documents, extract relevant clauses via sub-RLMs, then aggregate and compare.

The OOLONG benchmark results are directly relevant here. OOLONG requires transforming each chunk of input semantically and then aggregating -- precisely what contract analysis demands. RLMs outperform vanilla GPT-5 by 28.4% on this class of task.

Books and long-form

Processing book-length texts

A typical novel is 80,000-100,000 words, roughly 100K-130K tokens. That fits (barely) in some context windows. But actually reasoning over an entire book -- tracking character arcs, identifying thematic patterns, cross-referencing plot points across chapters -- degrades rapidly even within window limits.

RLMs make book-length analysis practical. The model can recursively process chapters, extract structured information from each, and then reason over the extracted data. For literary analysis, this means genuine engagement with the full text rather than a lossy summary. For nonfiction, it means synthesizing arguments and evidence across an entire work.

Scale this up to multiple books -- comparative literature analysis, regulatory code spanning thousands of pages, historical archives -- and you're in territory where no other approach comes close.

Agent frameworks

Integration with Google ADK and other platforms

RLMs are designed as a drop-in replacement for standard LLM completion calls. The API surface is identical: rlm.completion(prompt, model) instead of llm.completion(prompt, model). This makes integration with existing agent frameworks straightforward.

The Google ADK (Agent Development Kit) community has already started discussing RLM integration for building agents that need to process long contexts. The pattern fits naturally: any ADK agent that currently hits context limits on long inputs could swap in an RLM completion call and immediately gain the ability to handle 10M+ token inputs.

The broader implication: as agent systems take on longer-horizon tasks (multi-day research, ongoing monitoring, large-scale analysis), the inputs they accumulate will routinely exceed any context window. RLMs provide the plumbing to handle this without rearchitecting the agent.

Liam Connell's ADK implementation also introduced lazy file loading -- instead of loading all context into memory, the RLM holds references to files on disk or in GCS/Sharepoint. The model calls methods to read metadata and contents on demand. This is a practical extension that makes RLMs viable for enterprise document stores where downloading everything upfront is impossible.

Task decomposition

Not just context splitting -- reasoning delegation

A subtlety that the community coverage has surfaced: RLMs don't just decompose context. They decompose tasks.

When an agent invokes sub_rlm(), it sets both the query (task definition) and the context that the child agent receives. This means RLMs can tackle reasoning problems that exceed a single model's capacity -- not because the input is too long, but because the reasoning chain itself is too complex for one context window.

As Liam Connell put it: "Modern LLMs reason by generating streams of text about the problem before coming to a final answer. This is inherently limited by context length. Eventually, there will be a problem that cannot be solved within the context limits of a single language model. Task decomposition via recursive delegation allows the agent to hand off tasks to other agents without burning its own precious reasoning tokens."

This opens up a second axis of scaling beyond context length: reasoning depth. The root model can delegate sub-problems that themselves require extended reasoning, each in their own fresh context window.

The pattern

When to reach for an RLM

RLMs are not always the right tool. The paper is honest about this: for short inputs within the model's effective context window, vanilla LLM calls are simpler and sometimes better. There's a crossover point around 2^14 tokens (16K) where RLMs start outperforming.

Use an RLM when:

Your input exceeds the model's context window
Your input fits in context but the task requires dense reasoning over most of it (not just finding one thing)
The task involves cross-referencing or comparing multiple sections of the input
You need to scale to millions of tokens

Don't bother with an RLM when:

Your input is short and the task is straightforward
You just need to find one specific piece of information (RAG is cheaper)
Latency is more important than quality (RLMs add wall-clock time from multiple calls)