How much useful work does each minute of compute produce?
Source: METR Time Horizons v1.1
Everyone focuses on the METR Time Horizons headline number — how hard a task can a model solve? But the benchmark data also captures how much work the model needed to get there. Opus 4.6 solves harder tasks, more reliably, with considerably less compute.
Frontier models — higher is better
Release date vs. IY — each dot is a model