newsence
來源篩選

@OpenAIDevs: The standard for frontier coding evals is changing with model maturity. We now recommend reporting ...

Twitter

The standard for frontier coding evals is changing with model maturity. We now recommend reporting SWE-bench Pro and are sharing more detail on why we’re no longer reporting SWE-bench Verified as we work with the industry to establish stronger coding eval standards. SWE-bench Verified was a strong benchmark, but we’ve found evidence it is now saturated due to test-design issues and contamination from public repositories. https://t.co/3GeAsnUHdC

newsence

Loading

Fetching article data