How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR
AI models are crushing benchmarks. SWE-bench scores are climbing, and METR's measured time horizons are rising rapidly. Yet when we deployed these same models in a field study with experienced developers, they didn't speed up work. What's going on? Are benchmarks misleading us about AI capabilities? Are we missing something about how AI performs in the real world? In this talk, we'll reconcile lab and field evidence on AI capabilities. Drawing from METR's time horizon measurements and developer productivity RCT, we'll explore why impressive benchmark performance doesn't always translate to real-world impact. We'll examine potential explanations—from reliability requirements to task distribution to capability elicitation—and discuss what this means for automated AI R&D.
https://x.com/joel_bkr
Identity for AI Agents - Patrick Riley & Carlos Galan, Auth0
Implementing secure identity and access management for AI agents with Okta!
https://twitter.com/patjriley
https://twitter.com/CarlosFGalan
OpenAI + @Temporalio : Building Durable, Production Ready Agents - Cornelia Davis, Temporal
Everyone is building AI Agents, and everyone is looking for ways to build them more easily. Earlier this year, OpenAI released the OpenAI Agents SDK to bring the patterns they have found to work for building agents to the developer community. With the SDK you can define AI agents by supplying them instructions (prompts), specifying which model to use (OpenAI or not), listing tools it uses (including MCP), and much more. The OpenAI Agents SDK encourages a paradigm of orchestrated micro-agents, which themselves may have micro-orchestrations within them with the use of handoffs. It’s an elegant and powerful model.
But a good AI Agents programming model is not enough. These agents are ultimately wildly distributed systems and are plagued with all of the problems such systems bring.
- How can they persevere through flakey networks?
- How can they function when LLMs are rate limited?
- How can they run for long periods of time (hours, days, weeks, months) when infrastructure is rarely stable that long?
In this workshop, we’ll show you how. Temporal is an open source (MIT license) durable execution framework that brings resilience to AI agents, and in this workshop we’ll show you how it’s done with the OpenAI Agents SDK. Spoiler: OpenAI and Temporal have done all of the heaving lifting for you with an integration announced earlier this year.
Oh, and OpenAI themselves use Temporal to help make several of their products production ready (image gen and Codex, for example).
Not using the OpenAI Agents SDK? Do come anyway; the foundational concepts carry over to different agent frameworks (and more integrations are coming all the time).
https://twitter.com/cdavisafc
https://www.linkedin.com/in/corneliadavis
Your MCP Server is Bad (and you should feel bad) - Jeremiah Lowin, Prefect
Too many MCP servers are simply glorified REST wrappers, regurgitating APIs that were designed for SDKs, not agents. This leads to confused LLMs, wasted tokens, and demonstrably poor performance. If you've ever pointed an MCP generator at an OpenAPI spec and called it a day, this talk is your intervention.
Like any product, great MCP servers are the result of careful design. This talk shares the hard-won lessons from creating FastMCP, the most popular framework for building MCP servers (and yes, for generating them, too). The secret is to stop thinking about endpoints and start thinking about products. We will cover the three pillars of agent-native product design—Discovery, Iteration, and Context—providing an actionable framework for curating context into small, highly effective surface areas that lead to better AI outcomes.
Jeremiah Lowin, CEO of Prefect
https://twitter.com/jlowin
https://www.linkedin.com/in/jlowin
https://github.com/jlowin
Spec-Driven Development: Agentic Coding at FAANG Scale and Quality — Al Harris, Amazon Kiro
In the AI coding era, we have powerful tools, but tools still require honing to work effectively. Spec-Driven Development allows for reproducible and reliable delivery, but spending time up-front to improve the spec process will yield the best approach. Learn how the Kiro team does this, and how you can too!
https://www.linkedin.com/in/al-harris-7a755640/
DSPy: The End of Prompt Engineering - Kevin Madura, AlixPartners
Applications developed for the enterprise need to be rigorous, testable, and robust. The same is true for applications that use AI, but LLMs can make this challenging. In other words, you need to be able to program with LLMs, not just tweak prompts. In this talk we'll cover why DSPy really is all you need in building applications with LLMs. We'll dive into real-world examples where we have successfully automated manual work using an opinionated DSPy-first approach to structuring applications, covering everything from simple modules to using SoTA optimizers to measurably improve performance.
https://x.com/kmad/
**Summary**
Kevin Madura, a consultant at AlixPartners, argues that building robust enterprise AI applications requires shifting from brittle "prompt engineering" to "programming with LLMs" using **DSPy**. He contends that prompts should be treated as implementation details optimized by the system, while developers focus on defining typed interfaces (Signatures) and modular logic (Modules). The session moves from a conceptual overview of DSPy's primitives—Signatures, Modules, Adapters, and Optimizers—to a live code walkthrough. Madura demonstrates real-world use cases, including a complex pipeline that routes files by type (SEC filings vs. contracts) and a "boundary detector" that uses visual layout to segment legal documents. The talk concludes with a demonstration of how Optimizers (like MIPRO) can automatically tune these programs to outperform manual baselines, followed by a Q&A on production costs and feedback loops.
**Timestamps**
00:00 Introduction & The Enterprise AI Challenge
07:12 The 6 Core Concepts of DSPy (Signatures, Modules, Adapters)
13:23 Deep Dive: Class-based vs. Shorthand Signatures
19:57 Adapters: Controlling the Prompt Format (JSON vs. BAML)
24:17 Optimizers: The "Killer Feature" for Transferability
31:08 Code Walkthrough: Setup & Model Mixing
36:24 Handling Documents: "Poor Man's RAG" with Attachments
42:10 Adapter Comparison: Improving Token Efficiency with BAML
47:20 Optimizers in Practice: Creating Datasets & Metrics
51:13 Complex Pipeline: Routing & Classifying Arbitrary Files
56:00 Advanced Use Case: PDF Boundary Detection via Visuals
01:01:22 Analyzing Optimization Results & The "DSPy Hub" Concept
01:09:02 Q&A: Handling Delayed Feedback & Online Learning
01:13:00 Conclusion

How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR