AI Engineer

349K subscribers

Talks, workshops, events, and training for AI Engineers.

YouTube Video VVVMS1BjYTNrd3dkLUI1OUhOci1fbHZBLmsxdDJ4eVdNVWRZ

AI models are crushing benchmarks. SWE-bench scores are climbing, and METR's measured time horizons are rising rapidly. Yet when we deployed these same models in a field study with experienced developers, they didn't speed up work. What's going on? Are benchmarks misleading us about AI capabilities? Are we missing something about how AI performs in the real world? In this talk, we'll reconcile lab and field evidence on AI capabilities. Drawing from METR's time horizon measurements and developer productivity RCT, we'll explore why impressive benchmark performance doesn't always translate to real-world impact. We'll examine potential explanations—from reliability requirements to task distribution to capability elicitation—and discuss what this means for automated AI R&D.https://x.com/joel_bkrTimestamps
00:00 The Compute-Time Horizon Argument01:43 Potential Constraints on AI Scaling (Power & Dollars)04:23 The Problem of Eclipsing Evaluation Time06:52 Meta's "J-Curve" of Developer Productivity09:12 Unreliability of Self-Reported Time Estimates11:43 Personal Experiences with AI Tools (Cursor) & Learning Curves14:10 METR Study Deep Dive: Scatter Plots & Variance16:48 The Controversy of "Conservative" Usage Estimates21:41 Unpublished Hackathon Results (AI Allowed vs. Disallowed)25:28 Why AI Struggles with Data Science & Messy Enterprise Data30:35 Example of AI Failure on Complex Deployment Metrics38:29 Quantifying Speed-Up: The Methodological Challenges46:30 Future Metrics: "Watched" vs. "Unwatched" Time Horizons52:52 Moving Beyond Benchmarks: "In the Wild" Transcripts56:12 The "Agent Village" & Fuzzy Goal Measurement58:53 The "Neurodivergent AI" Hypothesis & Interface Mismatch01:06:31 Software-Only Singularity vs. Hardware Constraints01:13:53 AI Applications in Chip Fabrication & Yield Improvement

AI Engineer 349K

How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

AI Engineer January 19, 2026 9:00 am

1:15:52

How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

YouTube Video VVVMS1BjYTNrd3dkLUI1OUhOci1fbHZBLmsxdDJ4eVdNVWRZ

How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

January 19, 2026 9:00 am

AI models are crushing benchmarks. SWE-bench scores are climbing, and METR's measured time horizons are rising rapidly. Yet when we deployed these same models in a field study with experienced developers, they didn't speed up work. What's going on? Are benchmarks misleading us about AI capabilities? Are we missing something about how AI performs in the real world? In this talk, we'll reconcile lab and field evidence on AI capabilities. Drawing from METR's time horizon measurements and developer productivity RCT, we'll explore why impressive benchmark performance doesn't always translate to real-world impact. We'll examine potential explanations—from reliability requirements to task distribution to capability elicitation—and discuss what this means for automated AI R&D.

https://x.com/joel_bkr ...

Implementing secure identity and access management for AI agents with Okta!https://www.linkedin.com/in/patmriley/
https://www.linkedin.com/posts/cgcladera_auth0-for-ai-agents-secure-agentic-apps-activity-7399029829565579264-9Gdf/

1:22:12

Identity for AI Agents - Patrick Riley & Carlos Galan, Auth0

YouTube Video VVVMS1BjYTNrd3dkLUI1OUhOci1fbHZBLlZTZFYtQWRTbGlz

Identity for AI Agents - Patrick Riley & Carlos Galan, Auth0

January 14, 2026 10:03 am

Implementing secure identity and access management for AI agents with Okta!

https://twitter.com/patjriley
https://twitter.com/CarlosFGalan ...

Everyone is building AI Agents, and everyone is looking for ways to build them more easily. Earlier this year, OpenAI released the OpenAI Agents SDK to bring the patterns they have found to work for building agents to the developer community. With the SDK you can define AI agents by supplying them instructions (prompts), specifying which model to use (OpenAI or not), listing tools it uses (including MCP), and much more. The OpenAI Agents SDK encourages a paradigm of orchestrated micro-agents, which themselves may have micro-orchestrations within them with the use of handoffs. It’s an elegant and powerful model.But a good AI Agents programming model is not enough. These agents are ultimately wildly distributed systems and are plagued with all of the problems such systems bring.- How can they persevere through flakey networks?
- How can they function when LLMs are rate limited?
- How can they run for long periods of time (hours, days, weeks, months) when infrastructure is rarely stable that long?In this workshop, we’ll show you how. Temporal is an open source (MIT license) durable execution framework that brings resilience to AI agents, and in this workshop we’ll show you how it’s done with the OpenAI Agents SDK. Spoiler: OpenAI and Temporal have done all of the heaving lifting for you with an integration announced earlier this year.Oh, and OpenAI themselves use Temporal to help make several of their products production ready (image gen and Codex, for example).Not using the OpenAI Agents SDK? Do come anyway; the foundational concepts carry over to different agent frameworks (and more integrations are coming all the time).https://twitter.com/cdavisafc
https://www.linkedin.com/in/corneliadavis

1:18:30

OpenAI + @Temporalio : Building Durable, Production Ready Agents - Cornelia Davis, Temporal

YouTube Video VVVMS1BjYTNrd3dkLUI1OUhOci1fbHZBLms4Y25WQ01ZbU5j

OpenAI + @Temporalio : Building Durable, Production Ready Agents - Cornelia Davis, Temporal

January 12, 2026 2:30 pm

Everyone is building AI Agents, and everyone is looking for ways to build them more easily. Earlier this year, OpenAI released the OpenAI Agents SDK to bring the patterns they have found to work for building agents to the developer community. With the SDK you can define AI agents by supplying them instructions (prompts), specifying which model to use (OpenAI or not), listing tools it uses (including MCP), and much more. The OpenAI Agents SDK encourages a paradigm of orchestrated micro-agents, which themselves may have micro-orchestrations within them with the use of handoffs. It’s an elegant and powerful model.

But a good AI Agents programming model is not enough. These agents are ultimately wildly distributed systems and are plagued with all of the problems such systems bring.

- How can they persevere through flakey networks?
- How can they function when LLMs are rate limited?
- How can they run for long periods of time (hours, days, weeks, months) when infrastructure is rarely stable that long?

In this workshop, we’ll show you how. Temporal is an open source (MIT license) durable execution framework that brings resilience to AI agents, and in this workshop we’ll show you how it’s done with the OpenAI Agents SDK. Spoiler: OpenAI and Temporal have done all of the heaving lifting for you with an integration announced earlier this year.

Oh, and OpenAI themselves use Temporal to help make several of their products production ready (image gen and Codex, for example).

Not using the OpenAI Agents SDK? Do come anyway; the foundational concepts carry over to different agent frameworks (and more integrations are coming all the time).

https://twitter.com/cdavisafc
https://www.linkedin.com/in/corneliadavis ...

Too many MCP servers are simply glorified REST wrappers, regurgitating APIs that were designed for SDKs, not agents. This leads to confused LLMs, wasted tokens, and demonstrably poor performance. If you've ever pointed an MCP generator at an OpenAPI spec and called it a day, this talk is your intervention.Like any product, great MCP servers are the result of careful design. This talk shares the hard-won lessons from creating FastMCP, the most popular framework for building MCP servers (and yes, for generating them, too). The secret is to stop thinking about endpoints and start thinking about products. We will cover the three pillars of agent-native product design—Discovery, Iteration, and Context—providing an actionable framework for curating context into small, highly effective surface areas that lead to better AI outcomes.Jeremiah Lowin, CEO of Prefect
https://twitter.com/jlowin
https://www.linkedin.com/in/jlowin
https://github.com/jlowin

54:33

Your MCP Server is Bad (and you should feel bad) - Jeremiah Lowin, Prefect

YouTube Video VVVMS1BjYTNrd3dkLUI1OUhOci1fbHZBLjk2RzdGTGFiOHhj

Your MCP Server is Bad (and you should feel bad) - Jeremiah Lowin, Prefect

January 12, 2026 1:00 pm

Too many MCP servers are simply glorified REST wrappers, regurgitating APIs that were designed for SDKs, not agents. This leads to confused LLMs, wasted tokens, and demonstrably poor performance. If you've ever pointed an MCP generator at an OpenAPI spec and called it a day, this talk is your intervention.

Like any product, great MCP servers are the result of careful design. This talk shares the hard-won lessons from creating FastMCP, the most popular framework for building MCP servers (and yes, for generating them, too). The secret is to stop thinking about endpoints and start thinking about products. We will cover the three pillars of agent-native product design—Discovery, Iteration, and Context—providing an actionable framework for curating context into small, highly effective surface areas that lead to better AI outcomes.

Jeremiah Lowin, CEO of Prefect
https://twitter.com/jlowin
https://www.linkedin.com/in/jlowin
https://github.com/jlowin ...

In the AI coding era, we have powerful tools, but tools still require honing to work effectively. Spec-Driven Development allows for reproducible and reliable delivery, but spending time up-front to improve the spec process will yield the best approach. Learn how the Kiro team does this, and how you can too!https://www.linkedin.com/in/al-harris-7a755640/

1:3:50

Spec-Driven Development: Agentic Coding at FAANG Scale and Quality — Al Harris, Amazon Kiro

YouTube Video VVVMS1BjYTNrd3dkLUI1OUhOci1fbHZBLkhZX0p5eEFac2lF

Spec-Driven Development: Agentic Coding at FAANG Scale and Quality — Al Harris, Amazon Kiro

January 9, 2026 10:15 am

In the AI coding era, we have powerful tools, but tools still require honing to work effectively. Spec-Driven Development allows for reproducible and reliable delivery, but spending time up-front to improve the spec process will yield the best approach. Learn how the Kiro team does this, and how you can too!

https://www.linkedin.com/in/al-harris-7a755640/ ...

Applications developed for the enterprise need to be rigorous, testable, and robust. The same is true for applications that use AI, but LLMs can make this challenging. In other words, you need to be able to program with LLMs, not just tweak prompts. In this talk we'll cover why DSPy really is all you need in building applications with LLMs. We'll dive into real-world examples where we have successfully automated manual work using an opinionated DSPy-first approach to structuring applications, covering everything from simple modules to using SoTA optimizers to measurably improve performance.https://x.com/kmad/**Summary**
Kevin Madura, a consultant at AlixPartners, argues that building robust enterprise AI applications requires shifting from brittle "prompt engineering" to "programming with LLMs" using **DSPy**. He contends that prompts should be treated as implementation details optimized by the system, while developers focus on defining typed interfaces (Signatures) and modular logic (Modules). The session moves from a conceptual overview of DSPy's primitives—Signatures, Modules, Adapters, and Optimizers—to a live code walkthrough. Madura demonstrates real-world use cases, including a complex pipeline that routes files by type (SEC filings vs. contracts) and a "boundary detector" that uses visual layout to segment legal documents. The talk concludes with a demonstration of how Optimizers (like MIPRO) can automatically tune these programs to outperform manual baselines, followed by a Q&A on production costs and feedback loops.**Timestamps**00:00 Introduction & The Enterprise AI Challenge
07:12 The 6 Core Concepts of DSPy (Signatures, Modules, Adapters)
13:23 Deep Dive: Class-based vs. Shorthand Signatures
19:57 Adapters: Controlling the Prompt Format (JSON vs. BAML)
24:17 Optimizers: The "Killer Feature" for Transferability
31:08 Code Walkthrough: Setup & Model Mixing
36:24 Handling Documents: "Poor Man's RAG" with Attachments
42:10 Adapter Comparison: Improving Token Efficiency with BAML
47:20 Optimizers in Practice: Creating Datasets & Metrics
51:13 Complex Pipeline: Routing & Classifying Arbitrary Files
56:00 Advanced Use Case: PDF Boundary Detection via Visuals
01:01:22 Analyzing Optimization Results & The "DSPy Hub" Concept
01:09:02 Q&A: Handling Delayed Feedback & Online Learning
01:13:00 Conclusion

1:13:13

DSPy: The End of Prompt Engineering - Kevin Madura, AlixPartners

YouTube Video VVVMS1BjYTNrd3dkLUI1OUhOci1fbHZBLi1jS1VXNm44aEJV

DSPy: The End of Prompt Engineering - Kevin Madura, AlixPartners

January 8, 2026 3:48 pm

Applications developed for the enterprise need to be rigorous, testable, and robust. The same is true for applications that use AI, but LLMs can make this challenging. In other words, you need to be able to program with LLMs, not just tweak prompts. In this talk we'll cover why DSPy really is all you need in building applications with LLMs. We'll dive into real-world examples where we have successfully automated manual work using an opinionated DSPy-first approach to structuring applications, covering everything from simple modules to using SoTA optimizers to measurably improve performance.

https://x.com/kmad/

**Summary**
Kevin Madura, a consultant at AlixPartners, argues that building robust enterprise AI applications requires shifting from brittle "prompt engineering" to "programming with LLMs" using **DSPy**. He contends that prompts should be treated as implementation details optimized by the system, while developers focus on defining typed interfaces (Signatures) and modular logic (Modules). The session moves from a conceptual overview of DSPy's primitives—Signatures, Modules, Adapters, and Optimizers—to a live code walkthrough. Madura demonstrates real-world use cases, including a complex pipeline that routes files by type (SEC filings vs. contracts) and a "boundary detector" that uses visual layout to segment legal documents. The talk concludes with a demonstration of how Optimizers (like MIPRO) can automatically tune these programs to outperform manual baselines, followed by a Q&A on production costs and feedback loops.

**Timestamps**

00:00 Introduction & The Enterprise AI Challenge
07:12 The 6 Core Concepts of DSPy (Signatures, Modules, Adapters)
13:23 Deep Dive: Class-based vs. Shorthand Signatures
19:57 Adapters: Controlling the Prompt Format (JSON vs. BAML)
24:17 Optimizers: The "Killer Feature" for Transferability
31:08 Code Walkthrough: Setup & Model Mixing
36:24 Handling Documents: "Poor Man's RAG" with Attachments
42:10 Adapter Comparison: Improving Token Efficiency with BAML
47:20 Optimizers in Practice: Creating Datasets & Metrics
51:13 Complex Pipeline: Routing & Classifying Arbitrary Files
56:00 Advanced Use Case: PDF Boundary Detection via Visuals
01:01:22 Analyzing Optimization Results & The "DSPy Hub" Concept
01:09:02 Q&A: Handling Delayed Feedback & Online Learning
01:13:00 Conclusion ...

AI Engineer

AI Engineer

How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

About us

Company