32 private links
Observations
When message history tokens exceed a threshold (default: 30,000), the Observer creates observations — concise notes about what happened.
When observations exceed their threshold (default: 40,000 tokens), the Reflector condenses them — combining related items and reflecting on patterns.
The result is a three-tier system:
- Recent messages: Exact conversation history for the current task
- Observations: A log of what the Observer has seen
- Reflections: Condensed observations when memory becomes too long
we were able to demonstrate a “Top-5” LongMemEval result with very minimal modifications to dspy.RLM, just some helper functions to process the “multi-chat” sessions
Observational Memory achieves the highest score ever recorded on LongMemEval — 94.87% with gpt-5-mini — while maintaining a completely stable, cacheable context window. It beats the oracle, outperforms complex multi-step reranking systems with a single pass, and scales better with model quality than existing approaches.
OpenAI recently introduced their bespoke in-house AI data agent, a GPT-5.2-powered tool designed to help employees navigate and analyze over 600 petabytes of internal data across 70,000 datasets. By translating natural language questions into complex data insights in minutes, the agent enables teams across the company to bypass manual SQL debugging and quickly make data-driven decisions.
Searching code is an important part of every developer's workflow. We're trying to make it better.
Abstract page for arXiv paper 2310.06770v2: SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
A post for developers about the new Claude 3.5 Sonnet and the SWE-bench eval
Chip Huyen's 8,000 word practical guide to building useful LLM-driven workflows that take advantage of tools. Chip starts by providing a definition of "agents" to be used in the piece …
Intelligent agents are considered by many to be the ultimate goal of AI. The classic book by Stuart Russell and Peter Norvig, Artificial Intelligence: A Mode...
A post for developers with advice and workflows for building effective AI agents