Search: [LLM]

[2510.27246] Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

llm · paper · memory

March 23, 2026 at 8:40:12 AM EDT * · permalink

·

[2603.13875] GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

a solid, well-executed paper with a clean idea and good ablations, but limited in ambition by the small scale and synthetic-heavy evaluation. The core insight — that gradient-based memory writing with meta-learned initialization beats forward-only writing — is believable and likely to hold at larger scale, though the computational tradeoff gets harder.

llm · paper · memory

March 22, 2026 at 1:08:55 PM EDT * · permalink

·

https://arxiv.org/abs/2603.13875

[2603.16862] Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

llm · paper · memory

March 22, 2026 at 10:06:35 AM EDT * · permalink

·

https://arxiv.org/abs/2603.16862

Attention-Residuals/Attention_Residuals.pdf at master · MoonshotAI/Attention-Residuals

ai · llm · paper · memory

March 16, 2026 at 11:05:29 AM EDT * · permalink

·

https://github.com/MoonshotAI/Attention-Residuals/blob/master/Attention_Residuals.pdf

[2602.10715] Locomo-Plus: Beyond-Factual Cognitive Memory Evaluation Framework for LLM Agents

Experiments across diverse backbone models, retrieval-based methods, and memory systems demonstrate that cognitive memory remains challenging and reveals failures not captured by existing benchmarks.

ai · llm · memory · agent · paper

March 8, 2026 at 4:16:04 PM EDT * · permalink

·

https://arxiv.org/abs/2602.10715

[2603.04304v1] $V_1$: Unifying Generation and Self-Verification for Parallel Reasoners

Having generation and verification co-evolve on the same online rollouts is the fix, and the ablation (Figure 11) shows it matters — co-evolving consistently beats non-co-evolving by 4–6%.

ai · llm · rl · paper

March 7, 2026 at 9:14:49 PM EST * · permalink

·

https://arxiv.org/abs/2603.04304v1

awni/mylm: Self-personalizing LM

The above command enters you into a chat loop. You can talk to the model and share information like your name. Every now and then /sleep the model to transition short-term memory to long-term memory

The /sleep command:

Generates Q&A pairs based on the context
LoRA fine-tunes the model on the new Q&A pairs plus any from previous sessions
Resets the KV cache

After the /sleep command the model should remember context from previous sessions even though that context is no longer in the KV cache.

ai · llm · experiment

March 5, 2026 at 7:20:19 PM EST * · permalink

·

https://github.com/awni/mylm

Qwen3.5 - How to Run Locally Guide | Unsloth Documentation

Qwen3.5 Small models disable thinking by default. Use llama-server to enable it.

ai · llm · inference

March 2, 2026 at 11:57:34 AM EST * · permalink

·

https://unsloth.ai/docs/models/qwen3.5#how-to-enable-or-disable-reasoning-and-thinking

[2602.16284] Fast KV Compaction via Attention Matching

Scaling language models to long contexts is often bottlenecked by the size of the key-value (KV) cache. In deployed settings, long contexts are typically managed through compaction in token space via summarization. However, summarization can be highly lossy, substantially harming downstream performance. Recent work on Cartridges has shown that it is possible to train highly compact KV caches in latent space that closely match full-context performance, but at the cost of slow and expensive end-to-end optimization. This work describes an approach for fast context compaction in latent space through Attention Matching, which constructs compact keys and values to reproduce attention outputs and preserve attention mass at a per-KV-head level. We show that this formulation naturally decomposes into simple subproblems, some of which admit efficient closed-form solutions. Within this framework, we develop a family of methods that significantly push the Pareto frontier of compaction time versus quality, achieving up to 50x compaction in seconds on some datasets with little quality loss.

ai · paper · llm

February 19, 2026 at 4:49:38 PM EST * · permalink

·

https://arxiv.org/abs/2602.16284

[2512.14982] Prompt Repetition Improves Non-Reasoning LLMs

When not using reasoning, repeating the input prompt improves performance for popular models (Gemini, GPT, Claude, and Deepseek) without increasing the number of generated tokens or latency.

ai · paper · llm

February 18, 2026 at 11:10:49 AM EST * · permalink

·

https://arxiv.org/abs/2512.14982

[2508.00031] Git Context Controller: Manage the Context of LLM-based Agents like Git

Large language model (LLM) based agents have shown impressive capabilities by interleaving internal reasoning with external tool use. However, as these agents are deployed in long-horizon workflows, such as coding for a big, long-term project, context management becomes a critical bottleneck. We introduce Git-Context-Controller (GCC), a structured context management framework inspired by software version control systems. GCC elevates context as versioned memory hierarchy like Git. It structures agent memory as a persistent file system with explicit operations: COMMIT, BRANCH, MERGE, and CONTEXT, enabling milestone-based checkpointing, exploration of alternative plans, and structured reflection. Our approach empowers agents to manage long-term goals, isolate architectural experiments, and recover or hand off memory across sessions and agents. Empirically, agents equipped with GCC achieve state-of-the-art performance on the SWE-Bench-Lite benchmark, resolving 48.00 of software bugs, outperforming 26 competitive systems. In a self-replication case study, a GCC-augmented agent builds a new CLI agent from scratch, achieving 40.7 task resolution, compared to only 11.7 without GCC. The code is released at: this https URL

ai · paper · llm · memory

February 17, 2026 at 2:31:12 PM EST * · permalink

·

https://arxiv.org/abs/2508.00031

LCM: Lossless Context Management [pdf] | Hacker News

LCM attempts to decompose the recursion from RLMs into deterministic primitives so that the control flow can be managed by an engine rather than left to the whims of the LLM. In practice, this means we replace bespoke scripts with two mechanisms: (1) A DAG-based context management system that works like paged virtual memory, except for managing conversations and files; and (2) Operator-level recursion, like "Map" for LLMs, which lets one tool call process thousands of tasks.

An analogy we draw in the paper is the evolution from GO-TO statements (of Dijkstra's "Considered Harmful" fame) to structured programming. RLMs are maximally expressive, but all of that power comes with the risk of things going awry. We have built a more mechanistic system, which can provide stronger guarantees when deployed in production with today's models.

ai · paper · llm

February 17, 2026 at 11:46:11 AM EST * · permalink

·

https://news.ycombinator.com/item?id=47038411

Paper page - Experiential Reinforcement Learning

Reinforcement learning has become the central approach for language models (LMs) to learn from environmental reward or feedback. In practice, the environmental feedback is usually sparse and delayed. Learning from such signals is challenging, as LMs must implicitly infer how observed failures should translate into behavioral changes for future iterations. We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience-reflection-consolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy. This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks. These results suggest that integrating explicit self-reflection into policy training provides a practical mechanism for transforming feedback into durable behavioral improvement.

ai · paper · llm

February 17, 2026 at 11:31:48 AM EST * · permalink

·

https://huggingface.co/papers/2602.13949

[2602.13517] Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

ai · paper · llm

February 17, 2026 at 8:16:11 AM EST * · permalink

·

https://arxiv.org/abs/2602.13517

[2510.02219] Contrastive Retrieval Heads Improve Attention-Based Re-Ranking

The strong zero-shot and long-context capabilities of recent Large Language Models (LLMs) have paved the way for highly effective re-ranking systems. Attention-based re-rankers leverage attention weights from transformer heads to produce relevance scores, but not all heads are created equally: many contribute noise and redundancy, thus limiting performance. To address this, we introduce CoRe heads, a small set of retrieval heads identified via a contrastive scoring metric that explicitly rewards high attention heads that correlate with relevant documents, while downplaying nodes with higher attention that correlate with irrelevant documents. This relative ranking criterion isolates the most discriminative heads for re-ranking and yields a state-of-the-art list-wise re-ranker. Extensive experiments with three LLMs show that aggregated signals from CoRe heads, constituting less than 1% of all heads, substantially improve re-ranking accuracy over strong baselines. We further find that CoRe heads are concentrated in middle layers, and pruning the computation of final 50% of model layers preserves accuracy while significantly reducing inference time and memory usage.

llm · paper · memory

February 13, 2026 at 10:23:10 AM EST * · permalink

·

https://arxiv.org/abs/2510.02219

[2410.02642] Attention in Large Language Models Yields Efficient Zero-Shot Re-Rankers

Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two (O(1)) forward passes to re-rank N documents, making it substantially more efficient than generative re-ranking methods that require at least O(N) forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR's performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.

llm · paper · memory

February 13, 2026 at 10:22:58 AM EST * · permalink

·

https://arxiv.org/abs/2410.02642

Observational Memory: 95% on LongMemEval - Mastra Research

Observational Memory achieves the highest score ever recorded on LongMemEval — 94.87% with gpt-5-mini — while maintaining a completely stable, cacheable context window. It beats the oracle, outperforms complex multi-step reranking systems with a single pass, and scales better with model quality than existing approaches.

ai · llm · agent

February 10, 2026 at 9:55:30 PM EST * · permalink

·

https://mastra.ai/research/observational-memory

MaxRL: Maximum Likelihood Reinforcement Learning

MaxRL is a framework that turns more compute into increasingly better approximations of the maximum likelihood objective in sampling-based tasks.

ai · llm · paper

February 8, 2026 at 2:10:02 PM EST * · permalink

·

https://zanette-labs.github.io/MaxRL/

Reasoning in the Time of Scaling - UCL Discovery

these findings characterise LLM reasoning as a versatile computational process that emerges with scale and generalises beyond training data to novel contexts, highlighting the broader potential of the compute scaling paradigm

llm · paper

February 5, 2026 at 11:44:54 AM EST * · permalink

·

https://discovery.ucl.ac.uk/id/eprint/10220690/

The Top 26 Essential Papers (+5 Bonus Resources) for Mastering LLMs and Transformers

This list bridges the Transformer foundations
with the reasoning, MoE, and agentic shift

Recommended Reading Order

Attention Is All You Need (Vaswani et al., 2017)

The original Transformer paper. Covers self-attention,
multi-head attention, and the encoder-decoder structure
(even though most modern LLMs are decoder-only.)
The Illustrated Transformer (Jay Alammar, 2018)

Great intuition builder for understanding
attention and tensor flow before diving into implementations
BERT: Pre-training of Deep Bidirectional Transformers (Devlin et al., 2018)

Encoder-side fundamentals, masked language modeling,
and representation learning that still shape modern architectures
Language Models are Few-Shot Learners (GPT-3) (Brown et al., 2020)

Established in-context learning as a real
capability and shifted how prompting is understood
Scaling Laws for Neural Language Models (Kaplan et al., 2020)

First clean empirical scaling framework for parameters, data, and compute
Read alongside Chinchilla to understand why most models were undertrained
Training Compute-Optimal Large Language Models (Chinchilla) (Hoffmann et al., 2022)

Demonstrated that token count matters more than
parameter count for a fixed compute budget
LLaMA: Open and Efficient Foundation Language Models (Touvron et al., 2023)

The paper that triggered the open-weight era
Introduced architectural defaults like RMSNorm, SwiGLU
and RoPE as standard practice
RoFormer: Rotary Position Embedding (Su et al., 2021)

Positional encoding that became the modern default for long-context LLMs
FlashAttention (Dao et al., 2022)

Memory-efficient attention that enabled long context windows
and high-throughput inference by optimizing GPU memory access.
Retrieval-Augmented Generation (RAG) (Lewis et al., 2020)

Combines parametric models with external knowledge sources
Foundational for grounded and enterprise systems
Training Language Models to Follow Instructions with Human Feedback (InstructGPT) (Ouyang et al., 2022)

The modern post-training and alignment blueprint
that instruction-tuned models follow
Direct Preference Optimization (DPO) (Rafailov et al., 2023)

A simpler and more stable alternative to PPO-based RLHF
Preference alignment via the loss function
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al., 2022)

Demonstrated that reasoning can be elicited through prompting
alone and laid the groundwork for later reasoning-focused training
ReAct: Reasoning and Acting (Yao et al., 2022 / ICLR 2023)

The foundation of agentic systems
Combines reasoning traces with tool use and environment interaction
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Guo et al., 2025)

The R1 paper. Proved that large-scale reinforcement learning without
supervised data can induce self-verification and structured reasoning behavior
Qwen3 Technical Report (Yang et al., 2025)

A modern architecture lightweight overview
Introduced unified MoE with Thinking Mode and Non-Thinking
Mode to dynamically trade off cost and reasoning depth
Outrageously Large Neural Networks: Sparsely-Gated Mixture of Experts (Shazeer et al., 2017)

The modern MoE ignition point
Conditional computation at scale
Switch Transformers (Fedus et al., 2021)

Simplified MoE routing using single-expert activation
Key to stabilizing trillion-parameter training
Mixtral of Experts (Mistral AI, 2024)

Open-weight MoE that proved sparse models can match dense quality
while running at small-model inference cost
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (Komatsuzaki et al., 2022 / ICLR 2023)

Practical technique for converting dense checkpoints into MoE models
Critical for compute reuse and iterative scaling
The Platonic Representation Hypothesis (Huh et al., 2024)

Evidence that scaled models converge toward shared
internal representations across modalities
Textbooks Are All You Need (Gunasekar et al., 2023)

Demonstrated that high-quality synthetic data allows
small models to outperform much larger ones
Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (Templeton et al., 2024)

The biggest leap in mechanistic interpretability
Decomposes neural networks into millions of interpretable features
PaLM: Scaling Language Modeling with Pathways (Chowdhery et al., 2022)

A masterclass in large-scale training
orchestration across thousands of accelerators
GLaM: Generalist Language Model (Du et al., 2022)

Validated MoE scaling economics with massive
total parameters but small active parameter counts
The Smol Training Playbook (Hugging Face, 2025)

Practical end-to-end handbook for efficiently training language models

Bonus Material

T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (Raffel et al., 2019)
Toolformer (Schick et al., 2023)
GShard (Lepikhin et al., 2020)
Adaptive Mixtures of Local Experts (Jacobs et al., 1991)
Hierarchical Mixtures of Experts (Jordan and Jacobs, 1994)

If you deeply understand these fundamentals; Transformer core, scaling laws, FlashAttention, instruction tuning, R1-style reasoning, and MoE upcycling, you already understand LLMs better than most

Time to lock-in, good luck ;)

llm · ml · paper

January 30, 2026 at 9:51:19 AM EST * · permalink

·

https://x.com/TheAhmadOsman/status/2016893734986616915