Inference-Time Compute: The New Scaling Frontier

Back to Blog

February 2026

For roughly eight years, the dominant strategy in AI was straightforward: make the model bigger, train it on more data, and performance improves. This was the era of scaling laws, first articulated by Kaplan et al. at OpenAI in 2020 and refined by Hoffmann et al. at DeepMind with the Chinchilla paper in 2022. The prescription was clear -- given a fixed compute budget, there is an optimal balance between model size and training data, and following that prescription reliably produces better models.

That strategy has not stopped working, but it has started yielding diminishing returns. Training runs for frontier models now cost hundreds of millions of dollars. The datasets are approaching the limits of available high-quality text. And the gains from each doubling of compute are shrinking. The community has been searching for the next scaling axis, and there is growing consensus that it lies not in training but in inference.

What inference-time scaling means

The core idea is simple: instead of making the model smarter by training it longer, make it smarter by letting it think longer on each problem. A fixed model, given more computation at inference time, should be able to solve harder problems than the same model given less computation.

This is not a new observation. Chain-of-thought prompting, introduced by Wei et al. in 2022, was an early demonstration: by asking the model to show its reasoning step by step, you effectively increase the amount of computation (measured in generated tokens) between the question and the answer. The model uses the intermediate tokens as scratch space for sequential reasoning that it cannot perform in a single forward pass.

More recently, OpenAI's o1 and o3 models have made inference-time scaling a product feature. These models are trained to produce extended reasoning traces before answering, and their performance on difficult reasoning tasks scales with the length of the trace. Give o3 more time to think (more tokens in its reasoning chain), and it solves more problems correctly. The relationship is reliable and roughly log-linear: doubling the inference compute yields a consistent improvement in accuracy.

RLMs as inference-time scaling

Recursive Language Models fit squarely in this paradigm, but they push it in a different direction than chain-of-thought or extended reasoning traces. Where those approaches scale computation depth (more sequential reasoning steps within a single context), RLMs scale computation breadth and depth simultaneously through recursive decomposition.

A standard chain-of-thought approach lets the model think for more steps, but it still operates within a single context window. The reasoning happens in a linear sequence: step 1, step 2, step 3, and so on. The model cannot go back and re-examine earlier input based on what it discovers later (except by scrolling through its own context, which is expensive and unreliable at scale).

An RLM, by contrast, can adaptively allocate compute based on what it discovers during processing. If section A of the input turns out to be straightforward, the model processes it in one call. If section B is complex and requires cross-referencing with section C, the model spawns additional sub-calls, each getting fresh context and fresh reasoning capacity. The total compute is proportional to the actual difficulty of the problem, not a fixed budget determined before processing begins.

This adaptive allocation is a key distinction. In chain-of-thought scaling, you typically set a compute budget in advance (a maximum number of reasoning tokens) and the model uses as much of it as it needs. In RLM scaling, the model dynamically decides how much compute to allocate based on what it finds in the input. Simple tasks get minimal recursion. Complex tasks get deep recursion with many sub-calls. The compute scales with the problem, not with a predetermined budget.

The economics of inference-time scaling

Training-time scaling has a brutal cost structure. You pay the full training cost up front, amortized over all future queries. The marginal cost of serving each query is relatively low (just inference compute), but the fixed cost is enormous and must be committed before you know how the model will perform.

Inference-time scaling inverts this. The model's training cost is fixed (and can be modest -- RLM-Qwen3-8B was post-trained on just 1,000 samples). The variable cost shows up at inference, where harder tasks consume more compute. This has several attractive properties:

Pay-per-difficulty. Simple tasks are cheap. Hard tasks are expensive. You only pay for the compute that a specific problem actually requires. This is more efficient than training a model that is over-provisioned for easy tasks and under-provisioned for hard ones.

Continuous improvement without retraining. As inference-time techniques improve (better decomposition strategies, more efficient aggregation, smarter recursion depth control), the same base model gets better. You do not need to retrain; you need to improve the inference-time algorithm.

Scalable quality. If a customer needs a higher quality result, they can simply authorize more inference compute. This creates a natural product tier: fast answers for routine queries, deep answers for complex ones. The same model serves both tiers, with the quality difference coming entirely from inference-time allocation.

The three axes of inference-time compute

It is useful to think about inference-time scaling along three axes, each representing a different way to spend compute at inference time.

Depth: extended reasoning. This is the chain-of-thought / o1 axis. The model generates more tokens of reasoning before producing an answer. Each additional token is a step of sequential computation. This scales the model's ability to perform multi-step logical reasoning, mathematical proofs, and planning tasks.

Breadth: parallel exploration. This is the best-of-n / self-consistency axis. The model generates multiple candidate answers and selects the best one (by voting, by verification, or by a separate judge model). Each additional candidate increases the probability of finding a correct answer. This scales the model's reliability on tasks where the correct approach is uncertain.

Structure: recursive decomposition. This is the RLM axis. The model breaks the problem into sub-problems, processes each independently, and aggregates the results. This scales the model's ability to handle problems that are larger or more complex than any single invocation can manage. Unlike depth and breadth, structural scaling adapts the compute topology to the problem structure.

These axes are complementary, not competing. An RLM can use chain-of-thought reasoning within each recursive sub-call (combining depth and structure). It can generate multiple candidate decompositions and select the best one (combining breadth and structure). The most powerful inference-time systems will likely operate along all three axes simultaneously, allocating compute adaptively across depth, breadth, and structure based on the specific problem.

What the benchmarks show

The MIT results provide concrete evidence for inference-time scaling through recursive structure. RLM-Qwen3-8B -- an 8B parameter model, roughly 50x smaller than frontier models -- achieves performance that approaches GPT-5 on three out of four long-context benchmarks. On the fourth (OOLONG-Pairs), it dramatically outperforms GPT-5.

The key variable is not model size but inference-time structure. The 8B model is not smarter than GPT-5 in any meaningful sense. It has worse representations, less world knowledge, and weaker single-pass reasoning. But when equipped with recursive processing, it can bring effective compute to bear on the problem that exceeds what GPT-5's single forward pass can muster.

This is the promise of inference-time scaling: performance that scales with the compute you are willing to spend per query, not the compute you spent on training. A smaller, cheaper model with a good inference-time strategy can match or exceed a larger, more expensive model with a naive inference strategy.

Implications for the industry

If inference-time scaling continues to prove effective, several consequences follow.

First, the model size race becomes less important. Organizations will compete less on who has the largest model and more on who has the best inference-time algorithms. This is good for competition: training a 1T+ parameter model requires billions of dollars, but developing novel inference-time strategies requires primarily research talent and modest compute for experimentation.

Second, deployment becomes more flexible. Smaller models with inference-time scaling can run in more environments (edge devices, cost-constrained deployments, privacy-sensitive settings) while still achieving high quality on complex tasks by selectively applying additional compute when needed.

Third, the API pricing model changes. Current API pricing is primarily based on tokens in and tokens out. Inference-time scaling introduces a third dimension: compute per query. Providers will need pricing models that account for the variable cost of recursive processing, extended reasoning, and multi-candidate generation. Some already charge per "reasoning token"; RLMs may require pricing per "recursive depth" or "sub-call count."

The shift from training-time to inference-time scaling is not abrupt. Training still matters -- you need a good base model to run inference-time algorithms on. But the marginal dollar invested in inference-time research and infrastructure is increasingly yielding more capability improvement than the marginal dollar invested in larger training runs. RLMs are one of the clearest demonstrations of this shift, and the trend shows no signs of reversing.

References

Zhang, A., Kraska, T., & Khattab, O. (2025). Recursive Language Models. arXiv:2512.24601. MIT OASYS Lab.
Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. OpenAI.
Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556. DeepMind.
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
Snell, C., et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314.
OpenAI. (2024). Learning to Reason with LLMs. openai.com/index/learning-to-reason-with-llms.
Wang, X., et al. (2023). Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023.