The Week AI Beat Doctors, Then Hit a Wall — Jun 22, 2026
Listen & watch
Show notes
AI agents beat real doctors on diagnosis this week, then flunked the actual ward.
Run time: 6:12
In today's episode:
- AI agents MIRA and AMIE out-diagnose doctors on test cases
- Biggest real-world benchmark: top model flunks half of clinical tasks
- o3 cracks 18 unsolved rare-disease cases
- Utah lets AI renew prescriptions; oversight thin
- FDA's cleared-AI list passes 1,500
- Hetairos AI out-classifies five neuropathologists
- Fable 5 leaves Anthropic's free plans tomorrow
- GPT-5.6 and Gemini 3.5 Pro still stuck in preview
TL;DR:
- The week's whole story in one line: autonomous agents (MIRA, AMIE) beat doctors on constructed cases, but BRIDGE — the largest real-world clinical benchmark yet — shows the same model class scores 92% on exams and 44.8% on actual EHR tasks across nine languages. Lab brilliance, bedside gap.
- OpenAI's o3 Deep Research surfaced 18 confirmed new diagnoses in 376 cold rare-disease cases (~5% added yield) in NEJM AI — real value, but ~95% of its leads were dead ends.
- Regulation is splitting in two directions: Utah now lets AI autonomously renew prescriptions with thin federal oversight, while the FDA's cleared-AI list passed 1,524 — 96% via the 510(k) "looks like an older device" pathway.
Sources cited:
- Nature / MIRA
- Nature / AMIE
- Mass General Brigham
- OpenAI / NEJM AI
- Nature Medicine
- Cardiovascular Business
- The Imaging Wire
- Nature Cancer
- Anthropic
- Snowflake
Subscribe: YouTube
medAI Times is for educational and informational purposes only. The content does not constitute medical advice, diagnosis, treatment recommendation, or professional clinical guidance. Consult qualified healthcare professionals and refer to official sources before making clinical, research, regulatory, or business decisions.
Transcript
Auto-generated from the episode audio. Click any timestamp to jump the player there.
AI agents beat real doctors on diagnosis this week, then flunked the actual ward. Welcome to MedAI Times Podcast, your daily update on medical AI. Don't forget to like and subscribe. Here's what crossed the medical AI desk since Friday.
Two autonomous agents out-diagnosed emergency room doctors on hundreds of cases. A new benchmark says the same class of model handles fewer than half of real clinical tasks. A reasoning model cracked 18 rare disease cases that had stumped specialists for years.
Utah is letting an AI renew prescriptions on its own, and the oversight rules are thin. The FDA's cleared AI tally crossed 1,500. An anthropic's most powerful public model leaves the free plan tomorrow.
On Friday, we watched two agents out-diagnose ER doctors. This week, the bill came due. Start with those agents, because they framed the whole week. In nature, Yakub Kather's group put a system called Mira on 500 real emergency department cases.
They gave it 11 tools and more than 85,000 possible actions. Order labs, image, prescribe, admit. Overall diagnostic accuracy came in at 87.8%, against 78.1 for physicians.
On appendicitis, it hit 100%. The same day, Google DeepMind published Aimee, an agent for ongoing disease management, not just the first guess. In a blinded virtual exam against 21 primary care doctors, it scored 96% on treatment precision versus 62.
Striking numbers. Now, the fine print the Science Media Center flagged, these were constructed cases with known answers. The lead came mostly from clear-cut conditions, and Mira ordered about twice as many blood tests as the doctors did.
So, signal, but read it carefully. The agents shine on clean textbook cases. The bedside is messier, and that's exactly where the next story lands. Because Mass General Brigham then published the largest clinical AI benchmark to date in nature biomedical engineering.
It's called Bridge, and it scored 95 language models on real records, actual electronic health records, case reports, and patient-doctor consultations across nine languages and 14 specialties, not multiple choice.
The headline number is brutal. The single best model scored 92% on standardized medical exams and 44.8% on Bridge, fewer than half of real clinical tasks. Performance fell hardest for non-English speaking patients.
The takeaway for anyone buying these tools, the leaderboard score of vendor quotes tells you almost nothing about how the thing performs on your patients. One place, the reasoning models genuinely delivered, cold cases.
In NEJM AI, Boston Children's, Harvard, and OpenAI ran the O3 deep research model over 376 rare pediatric cases that stayed unsolved after full specialist workup. The model didn't diagnose, it generated evidence-linked gene hypotheses, and physicians confirmed 18 new diagnoses in the lab, about a 5% bump in yield on cases everyone had given up on.
The honest caveat from the authors, roughly 95% of the model's leads went nowhere. So it's a tireless idea generator for the hardest cases, not an oracle, and someone still has to chase every lead.
Now the policy edge of all this autonomy. A nature medicine piece this month dissected Utah's clinical AI sandbox, where the state cleared a company to let software autonomously renew prescription refills, the first program of its kind.
The authors, including Harvard's Aaron Kesselheim, make the uncomfortable point. Traditional FDA review is a one-time snapshot, but these systems drift over time, and a patchwork of state sandboxes is now filling the gap where federal oversight of live learning clinical tariffs.
AI doesn't exist. Autonomous prescribing is here. The monitoring framework for it mostly isn't. Meanwhile, the clearance machine keeps humming. The FDA refreshed its authorized AI list on June 16th.
1,524 cleared algorithms as of late March. Radiology still owns it with 1,163. Cardiology added 22 more to reach 225. Pathology, the field everyone calls the next frontier, has just nine.
And remember, about 96% of these come through the 510K pathway, which tests whether a device resembles an older one, not whether it helps a patient. Quantity is not the same as evidence. Two quick notes from the general AI world.
Anthropix Fable 5, its most capable public model, the one some biomedical labs were using for molecular hypotheses before a government export control order yanked it offline this month, comes off the free Pro Max and Team plans tomorrow, June 23rd.
The company says it'll restore it as a standard subscription model when capacity allows. Translation, enjoy it while it lasts. And the long-promised model flood is running late. GPT 5.6 and Google's Gemini 3.5 Pro were both penciled in for June.
Both are still in limited preview as the month closes. For clinical work, the spec that matters is context length. Gemini's 2 million token window is the one to watch for feeding whole patient records into a single prompt.
Spotlight, regulatory sandboxes for autonomous clinical AI. A sandbox lets a state grant temporary, supervised permission for a tool that doesn't fit existing rules, like Utah's autonomous refills.
The promise is faster learning under watch. The risk is that 50 states write 50 different rule books while no one monitors whether the model still works six months in. That's the wrap.
Every study, clearance, and source is linked in the description. Go read the caveats yourself. They're where the real story lives. Thanks for listening. Find us on YouTube and your favorite podcast app. See you tomorrow.