32 private links
Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google's Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a "neuro-symbolic" loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.
Collaborating with experts on 18 research problems, an advanced version of Gemini Deep Think helped resolve long-standing bottlenecks across algorithms, ML and combinatorial optimization, information theory, and economics. Highlights from our “Accelerating Research with Gemini” paper include (corresponding section numbers in paper):
we were able to demonstrate a “Top-5” LongMemEval result with very minimal modifications to dspy.RLM, just some helper functions to process the “multi-chat” sessions
Humans always remain in the loop, but work at a different layer of abstraction than we used to. We prioritize work, translate user feedback into acceptance criteria, and validate outcomes. When the agent struggles, we treat it as a signal: identify what is missing—tools, guardrails, documentation—and feed it back into the repository, always by having Codex itself write the fix.
Our most difficult challenges now center on designing environments, feedback loops, and control systems that help agents accomplish our goal: build and maintain complex, reliable software at scale.
The engineering team used Codex to optimize and adapt the harness for GPT‑5.3-Codex. When we started seeing strange edge cases impacting users, team members used Codex to identify context rendering bugs, and root cause low cache hit rates. GPT‑5.3-Codex is continuing to help the team throughout the launch by dynamically scaling GPU clusters to adjust to traffic surges and keeping latency stable.
A 61-year-old Tennessee man is finally free after spending a shocking 37 days in jail — all for posting a meme.
Of those, GVA said there were five confirmed transgender shooters, or fewer than a tenth of one per cent. (There have also been four cases of mass shootings by females in the U.S. since 1982.)
Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models
The firm is also whitelisting a handful of market makers, including longtime crypto liquidity provider Wintermute, to facilitate trading. Meanwhile, access to BUIDL is restricted to qualified purchasers, a legal designation for those with assets of $5 million or more.
For years, Trump has claimed he had “no idea” about Epstein’s abuse of underage girls. Yet records show that in 2006, he privately told Palm Beach police that “everyone” knew about Epstein’s activities and described Ghislaine Maxwell as evil.
Trump’s call to Palm Beach police chief
According to an FBI interview conducted in October 2019 with former Palm Beach Police Chief Michael Reiter, Trump personally called him in July 2006, just as Epstein’s criminal sex charges became public. Reiter told agents that Trump said, “Thank goodness you’re stopping him, everyone has known he’s been doing this.”
Observational Memory achieves the highest score ever recorded on LongMemEval — 94.87% with gpt-5-mini — while maintaining a completely stable, cacheable context window. It beats the oracle, outperforms complex multi-step reranking systems with a single pass, and scales better with model quality than existing approaches.
"I mean, there's tons of redacted stuff. ... And [Trump's] name, I think I put his name, and it appears more than a million times. So it's all over the place."
The bottom line: "To me, this whole rollout of saying that members can come from nine to five to sit at those four computers, is just part of the coverup," Raskin asserted.
The 3 million documents that the administration has not publicly released "are the ones I'd like to see," he said.
"The administration says that these are duplicative. Well go ahead and release them then! If they're duplicative, what's the problem? We'll be the judge of that." "Epstein's lawyers synopsized and quoted Trump as saying that Jeffrey Epstein was not a member of his club at Mar-a-Lago, but he was a guest at Mar-a-Lago, and he had never been asked to leave," Raskin said. "That was redacted for some indeterminate, inscrutable reason."
Among participants who use AI, we find a stark divide in skill formation outcomes between high scoring interaction patterns (65%-86% quiz score) vs low-scoring interaction patterns (24%-39% quiz score). The high scorers only asked AI conceptual questions instead of code generation or asked for explanations to accompany generated code; these usage patterns demonstrate a high level of cognitive engagement.
We develop a model of political cycles driven by time-varying risk aversion. Agents choose to work in the public or private sector and to vote Democratic or Republican. In equilibrium, when risk aversion is high, agents elect Democrats—the party promising more redistribution. The model predicts higher average stock market returns under Democratic presidencies, explaining the well-known “presidential puzzle.” The model can also explain why economic growth has been faster under Democratic presidencies. In the data, Democratic voters are more risk averse, and risk aversion declines during Democratic presidencies. Public workers vote Democratic, while entrepreneurs vote Republican, as the model predicts.
We may be on the descending portion of a productivity J-curve. As Brynjolfsson, Rock, and Syverson illustrate, when firms adopt transformative general-purpose technologies, measured productivity often initially falls because resources are diverted to investment, reorganization, and learning that do not show up as measured output.
The task-completion time horizon is the task duration (measured by human expert completion time) at which an AI agent is predicted to succeed with a given level of reliability
it will automatically set all users’ accounts to a “teen-appropriate” experience unless they demonstrate that they’re adults