Expert Biography

Joel Becker

Member of Technical Staff at METR, leading groundbreaking research on AI capability measurement and evaluation. Previously PhD student in Economics at NYU and researcher at Harvard, bringing rigorous empirical methods to AI safety research.

Bridging the Benchmark-Economics Gap

Joel Becker has emerged as a critical voice in AI capability measurement, challenging the field’s reliance on laboratory benchmarks that often fail to predict real-world economic impact. His research reveals fundamental disconnects between how AI progress is measured and how it actually creates value.

Current Work

As Member of Technical Staff at METR (Model Evaluation & Threat Research), Joel leads evaluation research focused on understanding AI capabilities through both controlled testing and field evidence. His work spans:

  • Capability measurement frameworks - Building evaluation methods that capture real-world performance, not just benchmark scores
  • Developer productivity research - RCT study finding AI tools increased task completion time by 19% for experienced open-source developers
  • Long-horizon task evaluation - Research on measuring AI ability to complete complex, multi-step tasks
  • AI R&D capabilities - RE-Bench evaluating frontier AI research capabilities against human experts

Joel also runs Qally’s, applying similar evaluation rigor to healthcare technology.

Background

Previously PhD student in Economics at New York University (2020-2022) and researcher at Harvard University (2018-2022), Joel brings academic economics methodology to AI safety research. His work has been published in Nature Human Behaviour and accepted to ICML 2025 (Spotlight), with coverage in Reuters, The Atlantic, WIRED, MIT Technology Review, Financial Times, and The Economist.

Philosophy on Capability Measurement

Joel’s approach challenges conventional AI evaluation:

Field evidence over laboratory optimism - Deployment data provides ground truth that controlled benchmarks cannot capture. Real-world constraints reveal hidden dependencies and failure modes invisible in lab settings.

Economic impact as capability measure - True capability assessment requires understanding actual value creation, not just raw performance scores. Some high-benchmark systems produce minimal practical value, while modest improvements in overlooked capabilities drive significant economic impact.

Implications for automated AI R&D - As AI systems begin doing their own research, what should they optimize for? Benchmark scores or real-world utility? Joel’s work suggests the benchmark-economics gap has profound implications for how autonomous research systems should be designed.

Key Publications

Conference Appearance

Event: AI Engineering Code Summit 2025 Date: November 21, 2025 Time: 5:00 PM - 5:19 PM Session: Benchmarks vs economics: the AI capability measurement gap

Joel presented on reconciling laboratory benchmarks with field evidence on AI capabilities, exploring what this means for automated AI R&D. His talk challenged the field’s fixation on benchmark scores and emphasized the need for evaluation frameworks grounded in real-world economic impact.

Stay Updated

Get the Latest AI Engineering Insights

Join the Focus.AI newsletter for curated research, analysis, and perspectives on the evolving AI landscape.

No spam. Unsubscribe anytime.

CLASSIFIED_FILES

USER: AUTHORIZED

[ EMPTY DRAWER ]

No documents have been filed.