Member of Technical Staff at METR, leading groundbreaking research on AI capability measurement and evaluation. Previously PhD student in Economics at NYU and researcher at Harvard, bringing rigorous empirical methods to AI safety research.
Bridging the Benchmark-Economics Gap
Joel Becker has emerged as a critical voice in AI capability measurement, challenging the field’s reliance on laboratory benchmarks that often fail to predict real-world economic impact. His research reveals fundamental disconnects between how AI progress is measured and how it actually creates value.
Current Work
As Member of Technical Staff at METR (Model Evaluation & Threat Research), Joel leads evaluation research focused on understanding AI capabilities through both controlled testing and field evidence. His work spans:
- Capability measurement frameworks - Building evaluation methods that capture real-world performance, not just benchmark scores
- Developer productivity research - RCT study finding AI tools increased task completion time by 19% for experienced open-source developers
- Long-horizon task evaluation - Research on measuring AI ability to complete complex, multi-step tasks
- AI R&D capabilities - RE-Bench evaluating frontier AI research capabilities against human experts
Joel also runs Qally’s, applying similar evaluation rigor to healthcare technology.
Background
Previously PhD student in Economics at New York University (2020-2022) and researcher at Harvard University (2018-2022), Joel brings academic economics methodology to AI safety research. His work has been published in Nature Human Behaviour and accepted to ICML 2025 (Spotlight), with coverage in Reuters, The Atlantic, WIRED, MIT Technology Review, Financial Times, and The Economist.
Philosophy on Capability Measurement
Joel’s approach challenges conventional AI evaluation:
Field evidence over laboratory optimism - Deployment data provides ground truth that controlled benchmarks cannot capture. Real-world constraints reveal hidden dependencies and failure modes invisible in lab settings.
Economic impact as capability measure - True capability assessment requires understanding actual value creation, not just raw performance scores. Some high-benchmark systems produce minimal practical value, while modest improvements in overlooked capabilities drive significant economic impact.
Implications for automated AI R&D - As AI systems begin doing their own research, what should they optimize for? Benchmark scores or real-world utility? Joel’s work suggests the benchmark-economics gap has profound implications for how autonomous research systems should be designed.
Key Publications
- Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity - Randomized controlled trial with surprising results
- Measuring AI Ability to Complete Long Tasks - Framework for evaluating complex multi-step capabilities
- RE-Bench: Evaluating frontier AI R&D capabilities - Comparing language model agents against human experts
- Resource profile of the Polygenic Index Repository - Nature Human Behaviour (2021)
Conference Appearance
Event: AI Engineering Code Summit 2025 Date: November 21, 2025 Time: 5:00 PM - 5:19 PM Session: Benchmarks vs economics: the AI capability measurement gap
Joel presented on reconciling laboratory benchmarks with field evidence on AI capabilities, exploring what this means for automated AI R&D. His talk challenged the field’s fixation on benchmark scores and emphasized the need for evaluation frameworks grounded in real-world economic impact.