Conference Session

Benchmarks vs economics - the AI capability measurement gap

5:00pm - 5:19pm | Benchmarks vs economics: the AI capability measurement gap

Speaker: Joel Becker, Researcher, METR

Speaker Profile: Full Speaker Profile

Bio: Researcher, METR

Topic: Reconciling lab and field evidence on AI capabilities and what this means for automated AI R&D

Slides

Slide: 16-42

Slide

Key Point: Introduction slide outlining a talk about measuring AI capabilities for long-horizon tasks and AI’s impact on developer productivity, suggesting research into understanding the gap between theoretical AI capabilities and practical developer outcomes.

Literal Content:

  • Title: “Outline”
  • METR logo in top right
  • Three bullet points:
    • “Measuring AI Ability to Complete Long Tasks”
    • “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity”
    • “Reconciling the gap”

Slide: 16-48

Slide

Key Point: Demonstrates that AI model capabilities for completing software engineering tasks have grown exponentially and consistently over time, with task completion time-horizons increasing from seconds (GPT-2) to over an hour (latest models) following a remarkably steady exponential trend.

Literal Content:

  • Title: “Remarkably steady exponential”
  • METR logos
  • Subtitle: “Time-horizon of software engineering tasks different LLMs can complete 50% of the time”
  • Graph showing task duration vs LLM release date from 2020-2025
  • Y-axis (logarithmic): from 4 sec to 1 hour, with examples like “Answer question”, “Count words in passage”, “Find fact on web”, “Train classifier”, “Train adversarially robust image model”
  • Shows progression: GPT-2, GPT-3, GPT-3.5, GPT-4, Claude 3.5 Sonnet (Old), Qwen2-72B, GPT-5.1-Codex-Max
  • Dotted trend line showing exponential growth
Stay Updated

Get the Latest AI Engineering Insights

Join the Focus.AI newsletter for curated research, analysis, and perspectives on the evolving AI landscape.

No spam. Unsubscribe anytime.

CLASSIFIED_FILES

USER: AUTHORIZED

[ EMPTY DRAWER ]

No documents have been filed.