32 private links
Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.
When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.
Large language model (LLM) based agents have shown impressive capabilities by interleaving internal reasoning with external tool use. However, as these agents are deployed in long-horizon workflows, such as coding for a big, long-term project, context management becomes a critical bottleneck. We introduce Git-Context-Controller (GCC), a structured context management framework inspired by software version control systems. GCC elevates context as versioned memory hierarchy like Git. It structures agent memory as a persistent file system with explicit operations: COMMIT, BRANCH, MERGE, and CONTEXT, enabling milestone-based checkpointing, exploration of alternative plans, and structured reflection. Our approach empowers agents to manage long-term goals, isolate architectural experiments, and recover or hand off memory across sessions and agents. Empirically, agents equipped with GCC achieve state-of-the-art performance on the SWE-Bench-Lite benchmark, resolving 48.00 of software bugs, outperforming 26 competitive systems. In a self-replication case study, a GCC-augmented agent builds a new CLI agent from scratch, achieving 40.7 task resolution, compared to only 11.7 without GCC. The code is released at: this https URL
LCM attempts to decompose the recursion from RLMs into deterministic primitives so that the control flow can be managed by an engine rather than left to the whims of the LLM. In practice, this means we replace bespoke scripts with two mechanisms: (1) A DAG-based context management system that works like paged virtual memory, except for managing conversations and files; and (2) Operator-level recursion, like "Map" for LLMs, which lets one tool call process thousands of tasks.
An analogy we draw in the paper is the evolution from GO-TO statements (of Dijkstra's "Considered Harmful" fame) to structured programming. RLMs are maximally expressive, but all of that power comes with the risk of things going awry. We have built a more mechanistic system, which can provide stronger guarantees when deployed in production with today's models.
Reinforcement learning has become the central approach for language models (LMs) to learn from environmental reward or feedback. In practice, the environmental feedback is usually sparse and delayed. Learning from such signals is challenging, as LMs must implicitly infer how observed failures should translate into behavioral changes for future iterations. We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience-reflection-consolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy. This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks. These results suggest that integrating explicit self-reflection into policy training provides a practical mechanism for transforming feedback into durable behavioral improvement.
Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.
The strong zero-shot and long-context capabilities of recent Large Language Models (LLMs) have paved the way for highly effective re-ranking systems. Attention-based re-rankers leverage attention weights from transformer heads to produce relevance scores, but not all heads are created equally: many contribute noise and redundancy, thus limiting performance. To address this, we introduce CoRe heads, a small set of retrieval heads identified via a contrastive scoring metric that explicitly rewards high attention heads that correlate with relevant documents, while downplaying nodes with higher attention that correlate with irrelevant documents. This relative ranking criterion isolates the most discriminative heads for re-ranking and yields a state-of-the-art list-wise re-ranker. Extensive experiments with three LLMs show that aggregated signals from CoRe heads, constituting less than 1% of all heads, substantially improve re-ranking accuracy over strong baselines. We further find that CoRe heads are concentrated in middle layers, and pruning the computation of final 50% of model layers preserves accuracy while significantly reducing inference time and memory usage.
Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two (O(1)) forward passes to re-rank N documents, making it substantially more efficient than generative re-ranking methods that require at least O(N) forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR's performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.
Observational Memory achieves the highest score ever recorded on LongMemEval — 94.87% with gpt-5-mini — while maintaining a completely stable, cacheable context window. It beats the oracle, outperforms complex multi-step reranking systems with a single pass, and scales better with model quality than existing approaches.
MaxRL is a framework that turns more compute into increasingly better approximations of the maximum likelihood objective in sampling-based tasks.
these findings characterise LLM reasoning as a versatile computational process that emerges with scale and generalises beyond training data to novel contexts, highlighting the broader potential of the compute scaling paradigm
This list bridges the Transformer foundations
with the reasoning, MoE, and agentic shift
Recommended Reading Order
-
Attention Is All You Need (Vaswani et al., 2017)
The original Transformer paper. Covers self-attention,
multi-head attention, and the encoder-decoder structure
(even though most modern LLMs are decoder-only.) -
The Illustrated Transformer (Jay Alammar, 2018)
Great intuition builder for understanding
attention and tensor flow before diving into implementations -
BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018)
Encoder-side fundamentals, masked language modeling,
and representation learning that still shape modern architectures -
Language Models are Few-Shot Learners (GPT-3) (Brown et al., 2020)
Established in-context learning as a real
capability and shifted how prompting is understood -
Scaling Laws for Neural Language Models (Kaplan et al., 2020)
First clean empirical scaling framework for parameters, data, and compute
Read alongside Chinchilla to understand why most models were undertrained -
Training Compute-Optimal Large Language Models (Chinchilla) (Hoffmann et al., 2022)
Demonstrated that token count matters more than
parameter count for a fixed compute budget -
LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023)
The paper that triggered the open-weight era
Introduced architectural defaults like RMSNorm, SwiGLU
and RoPE as standard practice -
RoFormer: Rotary Position Embedding (Su et al., 2021)
Positional encoding that became the modern default for long-context LLMs
-
FlashAttention (Dao et al., 2022)
Memory-efficient attention that enabled long context windows
and high-throughput inference by optimizing GPU memory access. -
Retrieval-Augmented Generation (RAG) (Lewis et al., 2020)
Combines parametric models with external knowledge sources
Foundational for grounded and enterprise systems -
Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (Ouyang et al., 2022)
The modern post-training and alignment blueprint
that instruction-tuned models follow -
Direct Preference Optimization (DPO) (Rafailov et al., 2023)
A simpler and more stable alternative to PPO-based RLHF
Preference alignment via the loss function -
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)
Demonstrated that reasoning can be elicited through prompting
alone and laid the groundwork for later reasoning-focused training -
ReAct: Reasoning and Acting (Yao et al., 2022 / ICLR 2023)
The foundation of agentic systems
Combines reasoning traces with tool use and environment interaction -
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Guo et al., 2025)
The R1 paper. Proved that large-scale reinforcement learning without
supervised data can induce self-verification and structured reasoning behavior -
Qwen3 Technical Report (Yang et al., 2025)
A modern architecture lightweight overview
Introduced unified MoE with Thinking Mode and Non-Thinking
Mode to dynamically trade off cost and reasoning depth -
Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts (Shazeer et al., 2017)
The modern MoE ignition point
Conditional computation at scale -
Switch Transformers (Fedus et al., 2021)
Simplified MoE routing using single-expert activation
Key to stabilizing trillion-parameter training -
Mixtral of Experts (Mistral AI, 2024)
Open-weight MoE that proved sparse models can match dense quality
while running at small-model inference cost -
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (Komatsuzaki et al., 2022 / ICLR 2023)
Practical technique for converting dense checkpoints into MoE models
Critical for compute reuse and iterative scaling -
The Platonic Representation Hypothesis (Huh et al., 2024)
Evidence that scaled models converge toward shared
internal representations across modalities -
Textbooks Are All You Need (Gunasekar et al., 2023)
Demonstrated that high-quality synthetic data allows
small models to outperform much larger ones -
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)
The biggest leap in mechanistic interpretability
Decomposes neural networks into millions of interpretable features -
PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022)
A masterclass in large-scale training
orchestration across thousands of accelerators -
GLaM: Generalist Language Model (Du et al., 2022)
Validated MoE scaling economics with massive
total parameters but small active parameter counts -
The Smol Training Playbook (Hugging Face, 2025)
Practical end-to-end handbook for efficiently training language models
Bonus Material
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019)
Toolformer (Schick et al., 2023)
GShard (Lepikhin et al., 2020)
Adaptive Mixtures of Local Experts (Jacobs et al., 1991)
Hierarchical Mixtures of Experts (Jordan and Jacobs, 1994)
If you deeply understand these fundamentals; Transformer core, scaling laws, FlashAttention, instruction tuning, R1-style reasoning, and MoE upcycling, you already understand LLMs better than most
Time to lock-in, good luck ;)
If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations,...
Creating training data for software engineering agents is difficult. Until now.
Introducing SWE-smith: Generate 100s to 1000s of task instances for any GitHub repository.
We've generated 50k+ task instances for 128 popular GitHub repositories, then trained our own LM for SWE-agent.
The result? SWE-agent-LM-32B achieve 40% pass@1 on SWE-bench Verified.
Now, we've open-sourced everything, and we're excited to see what you build with it!
Check out the tutorial below to generate 100 task instances for any GitHub repository in 10 minutes.
The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as epsilon-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.
A tracing of the history of GPT-1 and its predecessors.
Replace 'hub' with 'ingest' in any GitHub URL for a prompt-friendly text.
A new report reveals OpenAI's audio transcription tool, Whisper, has recorded consistent "hallucinations", according to multiple studies.
Google is gearing up to unveil its latest AI language model, Gemini 2.0, in December, according to insider sources from The Verge.
Another indication of the plateau thesis: OpenAI has just confirmed that a new model, internally considered as a potential successor to GPT-4, will not be released this year, despite looming competition from Google Gemini 2.0.
Similarly, Anthropic is rumored to have put a previously announced version 3.5 of its flagship Opus model on hold due to a lack of significant progress, instead focusing on an improved version of Sonnet 3.5 that emphasizes agent-based AI.