Search: [benchmark]

2852 shaares
32 private links

2852 shaares · 32 private links

Filters

Links per page

20 50 100

3 results tagged benchmark

SWE-AGI Leaderboard

Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models

ai · benchmark

February 11, 2026 at 10:13:15 AM EST * · permalink

https://swe-agi.com/

Task-Completion Time Horizons of Frontier AI Models - METR

The task-completion time horizon is the task duration (measured by human expert completion time) at which an AI agent is predicted to succeed with a given level of reliability

ai · benchmark

February 9, 2026 at 2:16:16 PM EST * · permalink

https://metr.org/time-horizons/

GiggleScore.com

sbc · benchmark

January 1, 2020 at 5:49:49 PM EST · permalink

https://gigglescore.com/#!

Filters

Links per page

20 50 100