METR AI Benchmark: Clarifying Limitations of Time Horizon
Hacker News
This article from Hacker News AI discusses the METR AI Benchmark's time horizon measurements, acknowledging criticisms and misinterpretations. The author, a lead author of the original paper, aims to clarify the methodology's limitations and the evidence-supported conclusions.
Clarifying limitations of time horizon - METRtwitter
Notes
Rough/unpolished research updates and speculation
Clarifying limitations of time horizon
In the 9 months since the METR time horizon paper (during which AI time horizons have increased by ~6x), it’s generated lots of attention as well as various criticisms. As one of the main authors, I often see various misinterpretations of our work. While I still believe in the core results, I believe that many people to some extent both overstate the precision of our time horizon measurements and draw conclusions I don’t think the evidence fully supports.
Therefore, I’d like to clarify some of my beliefs about limitations of our methodology and time horizon more broadly—and then clarify what I think are the key conclusions directly supported by our results.
Despite these limitations, what conclusions do I still stand by?
See e.g. DeepSeek R1 paper: https://arxiv.org/abs/2501.12948 ↩