·5:31

The Week AI Agents Outscored ER Doctors — Jun 19, 2026

Show notes

Two AI agents outscored emergency doctors on real cases, then ordered twice the bloodwork.

Run time: 5:31

In today's episode:

  1. Two AI agents match or beat doctors in Nature
  2. Google's AMIE outscores physicians on disease management
  3. AI flags small pancreatic tumors on routine CT
  4. Joint Commission rolls out first hospital AI governance cert
  5. Doctors split on patients reading scans with AI
  6. Ambient AI scribes head for every VA hospital
  7. Anthropic says Fable and Mythos return within days
  8. Researchers reframe AI agents as expanding, not replacing, engineers

TL;DR:

  • Two landmark Nature papers (June 17) put autonomous AI agents at or above doctor level — but on constructed cases with known answers, no real patients, and one padded its lead by over-ordering tests.
  • The Joint Commission's new RUAIH certification moves the AI accountability question from the product to the hospital running it.
  • Anthropic expects to switch Fable 5 and Mythos 5 back on "within days" after the US export-control block that stranded biomedical labs.

Sources cited:

Subscribe: YouTube

medAI Times is for educational and informational purposes only. The content does not constitute medical advice, diagnosis, treatment recommendation, or professional clinical guidance. Consult qualified healthcare professionals and refer to official sources before making clinical, research, regulatory, or business decisions.

Transcript

Auto-generated from the episode audio. Click any timestamp to jump the player there.

Two AI agents outscored emergency doctors on real cases, then ordered twice the bloodwork. Welcome to MedAI Times Podcast, your daily update on medical AI. Don't forget to like and subscribe. Here's what crossed the desk since Monday.

Two autonomous AI agents matched or beat doctors in back-to-back Nature papers. Google's Amy outscored physicians on disease management. An AI flagged small pancreatic tumors on routine CT.

The Joint Commission rolled out the first AI governance certificate built for hospitals. Doctors are split on patients reading their own scans with AI. Ambient scribes are heading for every VA medical center. Anthropic says fable and mythos come back within days.

And researchers are arguing AI agents expand software engineering rather than replace it. On Monday, we said frontier models were already beating the specialized clinical tools. This week, two of them stopped answering quiz questions and started ordering the labs.

The big one. On June 17th, Nature published two papers that move medical AI from talking to doing. The first is MIRA, an autonomous agent from Jacob Cather's group built on GPT-4-0.

It was let loose on 500 real emergency department cases with 11 tools and more than 85,000 possible actions. It could pull the history, read the exam, order labs, blood cultures, scans, medications, even triage for admission.

Overall diagnostic accuracy was 87.8% against 78.1% for the board-certified physicians. On appendicitis, it hit 100%. Here's the catch the headline skipped.

The independent experts noted MIRA's edge came mostly from conditions with clean test results and that it ordered roughly twice as many blood tests as the humans to get there. None of this touched a real patient. Signal but with an asterisk.

These are constructed cases where the answer was already known and overtesting is not free in a real waiting room. The second Nature paper is AIME from Google DeepMind, and it goes after the harder problem, managing a patient over time, not just naming the diagnosis.

In a randomized blinded virtual exam across multiple visits, AIME scored 96% on treatment preciseness against 62% for primary care doctors and 93% on guideline alignment by the third visit.

Specialists and patient actors preferred it 47% of the time versus 7% for the physicians. Same caveat as MIRA, simulated cases, scripted actors. Impressive on the bench, unproven at the bedside.

Quieter but worth your time. A team at Kobe University publishing in Radiology on June 16th found AI models matched physicians at spotting pancreatic cancer on CT and held up best on the small early tumors that humans miss.

Pancreatic cancer is the one where catching it weeks earlier genuinely changes survival. It echoes Mayo's pancreatic detection work from last month from a different group on a different data set, which is exactly the kind of replication this field has been short on.

On the rules, the Joint Commission, which accredits most U.S. hospitals, launched its Responsible Use of AI in Healthcare Certification. Read the fine print, because the target changed. It does not certify any individual AI product.

It certifies the hospital's governance around AI, data management, bias checks, monitoring, and disclosure. The accountability question is shifting from the vendor that built the model to the institution that switched it on.

Voluntary for now. These things rarely stay voluntary. One that cuts close to the exam room. A new survey out of the imaging world finds radiologists genuinely split on patients using AI to interpret their own scan results.

Some see a more informed patient walking in with good questions. Others see a frightened one misreading a chatbot's confident guess about an incidental nodule. Patients are already doing it, with or without permission.

So the open question is whether the report gets written for the clinician or the person it's actually about. Now the Claude story everyone asked about. After the U.S. government's export control order forced Anthropic to pull Fable 5 and Mythos 5 for all customers three days after launch, the company now says the light's come back on soon.

Speaking in Seoul on June 18th, Anthropic's managing director for international said they are very confident the models return in the coming days. That matters for medicine, because Mythos was the version a handful of biology labs were using for molecular hypotheses and drug design.

Those programs have been stranded for a week. Spotlight The thread tying this week together is agentic medical AI. Systems that don't just answer, they act on an electronic record, choosing from tens of thousands of possible orders.

The promise is obvious. So is the failure mode the experts keep naming. An agent that looks accurate while quietly ordering twice the tests on cases where someone already knew the answer. That's the week.

Every paper, survey, and source link is in the description. We'll be back Monday. And if an AI agent reads this to you first, tell it to order fewer labs. Thanks for listening. Find us on YouTube and your favorite podcast app.

See you tomorrow.