Anthropic's Mythos AI Designs Drugs 10x Faster — Jun 10, 2026
Listen & watch
Show notes
Anthropic just handed drug labs a model that designs molecules ten times faster.
Run time: 13:30
In today's episode:
- Ambient AI scribe cuts clinician burnout in randomized trial
- Philips survey: AI saves sixteen workdays, training lags
- AIRS SwiftMR wins wider FDA MRI clearance
- Mount Sinai launches AI cancer trial-matching tool
- Nature Medicine: AI diagnosis accuracy swings fourfold
- Topol says every mammogram should use AI
- Anthropic ships Fable 5, its strongest public model
- Claude Code adds parallel-subagent dynamic workflows
- GPT-5.6 and Gemini 3.5 Pro expected this month
TL;DR:
- A pragmatic RCT of an ambient AI scribe cut clinician exhaustion and documentation time without degrading note quality or billing accuracy — the kind of evidence the field keeps promising and rarely delivers.
- Anthropic released Claude Fable 5 (and the unlocked Mythos 5 for vetted bio/security researchers); scientists preferred Mythos 5's molecular-biology hypotheses over Opus-class output ~80% of the time, with a documented ~10x drug-design speedup.
- Philips' Future Health Index says AI already saves clinicians 16 working days a year — while 70% of those same clinicians call their AI training inadequate or nonexistent.
Sources cited:
- NEJM AI
- GlobeNewswire
- Diagnostic Imaging
- Mount Sinai
- Nature Medicine
- Ground Truths
- Anthropic
- Releasebot
- Essa Mamdani
- Primer
Subscribe: YouTube
medAI Times is for educational and informational purposes only. The content does not constitute medical advice, diagnosis, treatment recommendation, or professional clinical guidance. Consult qualified healthcare professionals and refer to official sources before making clinical, research, regulatory, or business decisions.
Transcript
Auto-generated from the episode audio. Click any timestamp to jump the player there.
Anthropic just handed drug labs a model that designs molecules 10 times faster. Welcome to MedAI Times Podcast, your daily update on medical AI. Don't forget to like and subscribe. Ambient AI scribe cuts clinician burnout in randomized trial.
Phillips survey. AI saves 16 workdays training lags. Air Swift MR wins wider FDA MRI clearance. Mount Sinai launches AI cancer trial matching tool. Nature medicine. AI diagnosis accuracy swings fourfold.
Topol says every mammogram should use AI. Anthropic ships Fable 5, its strongest public model. Claude Code adds parallel sub-agent dynamic workflows. GPT 5.6 and Gemini 3.5 Pro expected this month. Yesterday we looked at an eye scan reading your heart.
Today a randomized trial asked whether the AI writing your doctor's notes actually helps. And for once the answer got measured. Yeah, we finally have some real data. Right. So we have a clear mission for today's deep dive, which is separating rigorous evidence from all the self-reported claims we see in medical AI.
And then we're going to unpack Anthropic's new dual model release. It's this strategy that massively accelerates drug design to the private lab while keeping tight safeguards on the public model. The three line here is accountability, really.
I mean, the era of the shiny vendor slide deck is ending. Whether we're evaluating a microphone in a patient room or, you know, a trillion parameter model generating biology hypotheses, the industry is finally being forced to show its math.
Let's start right there at the bedside. The New England Journal of Medicine AI just published this paper on an ambient AI scribe out of the University of Wisconsin. The NCT 06517082 trial.
Yeah, that's the one. But I've got to push back right out of the gate here. We've been hearing that ambient scribes cure burnout for like two solid years now. Right. Every vendor claims their tool saves doctors hours.
So why are we treating this Wisconsin study like it's breaking news? Well, because it's an actual pragmatic, randomized, controlled trial. They didn't just hand a new toy to some friendly early adopters and send out a survey asking, do you feel better? Right. Which happens all the time.
Exactly. A pragmatic trial means they tested this in the messy, completely unpredictable reality of actual clinical practice. They actually randomized which physicians got the AI tool and which ones didn't, and then rigorously measured the outcomes.
OK, but how do they measure that? Because burnout is so subjective. You can't just draw blood to see if a doctor is exhausted. No, you can't. So they used a mix of standardized psychological instruments and hard system data. For the subjective side, they measured domains like practitioner work exhaustion and interpersonal disengagement.
Which basically means, are you looking at your patient or just staring at your keyboard? Pretty much. But they paired that with objective data pulled right from the EHR audit logs. They tracked literal minutes spent in the charting system, specifically pajama time. The hours doctors spend finishing notes at home after the kids are asleep.
Right. And the numbers showed a genuine drop. Exhaustion went down, disengagement went down, and the actual minutes spent typing dropped. But wait, if the AI is suddenly doing the heavy lifting of writing the note, doesn't the doctor get kind of complacent?
I mean, if a machine summarizes the 15 minute complex conversation, you'd assume something gets lost or hallucinated. That's the most crucial detail of the whole trial. They achieved this reduction in documentation time without degrading diagnosis accuracy or hurting billing compliance or dropping the quality of the notes.
Wow. Really? Yeah. The mechanism relies on the physician remaining the final editor. The AI parses the transcript and maps it out, but the doctor still has to review and sign it. The trial proved that this review step doesn't fall prey to massive automation bias. At least not enough to degrade the quality of care.
Right. They even published a playbook detailing exactly how a health system should monitor this in live practice, because you absolutely cannot just turn these ambient tools on and look away. Yeah. But solving the burnout problem in a tightly monitored Wisconsin trial doesn't necessarily reflect what happens at scale, right?
Like, let's look at the broader adoption numbers from the new Philips Future Health Index. It's a massive survey of clinicians and they're reporting saving the equivalent of 16 working days a year using AI tools. Half say they're seeing about eight more patients a week and 39 percent claim the AI actually flagged or prevented a medical error recently.
We really need a heavy dose of skepticism with those specific numbers, though. You think so? Yeah. Remember the source. This is self-reported data from a major device maker. Humans are notoriously terrible at estimating how much time software actually saves them. They answer based on how it makes them feel, not, you know, by staring at a stopwatch. Sure enough. We should treat those 16 hours saved as a directional signal,
not an audited reality like the Wisconsin data. 70 percent of those surveyed clinicians say their AI training is inadequate or just completely nonexistent.
Wow. Yeah. It's like handing someone the keys to a ridiculously fast car. But 70 percent of them admit they never took driver's ed. Exactly. They don't know how the brakes work. It's a massive recipe for clinical liability.
If you're seeing eight extra patients a week, but you don't understand the failure modes of the algorithm, you're just increasing the speed at which you might make a catastrophic error. Right. But saving time at the desk doesn't help if the physical machines in the hospital are the actual bottleneck.
Let's look at the hardware side quickly. Air is Medical just secured an expanded FDA 510K clearance for their SwiftMR software. Right. Which allows their software to sit on top of the deep learning reconstruction tools already built into MRI scanners.
And this is the operational lever that hospital admins just obsess over. Oh, absolutely. Historically, if you speed up an MRI scan, you gather less data and you get a blurry image. SwiftMR uses AI to mathematically predict and fill in that missing data.
So you run a much faster scan and the AI reconstructs it to look like it took 45 minutes. Exactly. It's a pure throughput play. You get more patients through the same multimillion dollar magnet every day. Makes total sense. And over at Mount Sinai, they're trying to unbottleneck a different kind of pipeline. Clinical trials.
They just rolled out an AI platform to match oncology patients to open cancer trials. Trial recruitment is such a chronic failure point in medicine. The eligibility criteria for modern oncology trials are labyrinthine.
You're not just looking for breast cancer. You're looking for a specific genetic mutation, a specific prior treatment failure. And making sure they haven't taken some contraindicated steroid in the last 60 days. Exactly. And traditional keyword searches just completely break down under that level of nested logic. A keyword search sees the word steroid and flags the file completely missing that it says patient denied taking any steroid.
Right. The context is totally lost. But the NLP models Mount Sinai is using are designed to understand context and negation buried inside messy physician notes. Okay. But if Mount Sinai is trusting an AI to parse that kind of complex logic for high stakes cancer trials, how do we know the AI is actually getting it right?
Because we have this new perspective piece in nature medicine showing diagnostic accuracy swinging wildly. From 25 percent to 98 percent. Right. You cannot run a health system on a tool that might be 98 percent accurate or might be 25 percent accurate, depending on the day.
Well, that fourfold spread tells us almost nothing about the models themselves, but it tells us everything about how deeply flawed our evaluation designs are. The authors are pointing out data set shift. Okay. Break that down for me. So an AI model might score 98 percent on a perfectly curated data set in a lab.
But when you deploy that exact same model into a real hospital where the lighting is different, the population is diverse and the records are full of typos. Accuracy just plummets. Absolutely. A single headline accuracy number is totally meaningless unless you know the specific evaluation conditions.
That context makes Eric DePaul's latest argument kind of jarring then. He's publicly stating that routine AI and screening mammography should absolutely be the standard of care right now. Just use it. Yeah.
How does he justify that, given the evaluation uncertainty we literally just talked about? Because the evidence tier for breast screening AI is in a completely different universe. Mammography isn't relying on a small retrospective lab test.
It has actual trial data. Massive, large scale, perspective, randomized evidence. He's citing a trial involving over 100,000 women. At that scale, you effectively neutralize the data set shift problem. Oh, wow. Yeah.
When you have peer reviewed data of that magnitude, proving the AI catches cancers, human radiologists miss. The just use it stance is incredibly grounded. OK, so we have rigorously proven models on one side and untested algorithms on the other. Let's do a signal check on Anthropic's new release strategy, which kind of straddles both. I'll just say it's a split decision for the clinician at the bedside
this week. Fable 5 is mostly noise, but for the drug discovery lab, a model that scientists prefer four times out of five on novel biology hypotheses, that is pure signal. It really is. Anthropic has introduced a dual model architecture that forces us to rethink how these capabilities are distributed.
On the public side, you have Fable 5. It sits above their previous Opus model, priced at $10 input and $50 output. Right. And it's free for pro, team and enterprise tiers from June 9th through the 22nd. But the real story is Mythos 5.
And just to be clear, Mythos 5 is the exact same underlying neural network as Fable 5, right? They share the same brain. They are the same core model. But Mythos 5 has its safety guard rails heavily modified or lifted entirely. It's locked behind closed doors, available exclusively to vetted partners like the researchers at Project Glasswing.
I have to admit, that feels inherently risky. They built a hyper capable biology brain and they're releasing it to private scientists while like putting a blindfold on the version the public gets. The results for the vetted partners are wild, though.
They're documenting a 10x speed up in drug design. Moving from hypotheses that took weeks down to days or even hours. That's staggering. But the risk concern. Is exactly why they split the release. You cannot hand the public a model capable of accelerating drug design by 10x because that same biological reasoning can be inverted to design a highly legal novel pathogen. Right. So how are they actually enforcing that split?
Are they lobotomizing the public Fable 5 model during training? Or is it like giving the model a localized immune system? The immune system metaphor is actually highly accurate. And this brings us to our spotlight topic. Constitutional classifiers.
OK, what are those? Historically, AI labs tried to bake refusal training into the core weights of the model. But if you heavily train a model to refuse virus talk, it degrades its overall ability to reason about benign biology.
It gets nervous and just overcorrect. Exactly. So instead of messing with the core DNA of Fable 5, they build lightweight guardrail models, the classifiers that sit on the perimeter. If a user asks Fable 5 for restricted chemistry or biology, the classifier intercepts it. And Fable 5 never even sees the prompt.
Right. It automatically routes the request to a safer, older model like Opus 4.8 to safely generate the refusal. That makes a lot of sense. You don't have to spend 10 million dollars retraining the massive Fable 5 model from scratch if a new threat emerges.
Exactly. You just update the instructions for the lightweight layer. But the calibration of that immune system has to be nearly impossible. If it's too aggressive, it blocks a legitimate oncologist. But if it's too loose, someone gets synthesis instructions for something dangerous.
I mean, the vocabulary for curing a disease and creating one is practically identical. That tension is a defining open question for medicine in the generative AI era right now. Right. Well, let's quickly sweep the rest of the general AI ecosystem.
Cloud Code just launched a new dynamic workflows mode in preview. Instead of the model just answering in a single threaded way, it can now fan out tens to hundreds of parallel subagents. Which fundamentally alters what a researcher can do.
Think about building a clinical literature pipeline. Previously, it was like having one extremely fast intern reading papers one by one. Dynamic workflows turn that into a fleet of 100 interns sweeping the database simultaneously. That's huge.
And looking ahead, DBT 5.6 and Gemini 3.5 Pro are expected this month competing directly with Fable 5. Yeah. And I'd advise entirely ignoring the benchmark leaderboard noise when they drop.
The only metric that matters for medical applications is what their expanded context limits will do to medical RAG. Retrieval augmented generation. Right. Currently, our RAG is constrained by the retrieval bottleneck. If the system doesn't retrieve the right chunks of a patient's history, it fails.
But if these new models have context windows large enough to just ingest an entire 20 year medical record at once. You completely bypass the bottleneck. The model just reasons across the whole raw data set. Exactly. That's the structural shift to watch for.
So the Ambient Scribe RCT is the rare clinician facing win with real randomized evidence. Do you want us to pull the full trial design, you know, the endpoints, the blinding, who funded it, and stack it against the earlier automation biased data we've covered. So you can see exactly where Ambient AI helps and where it quietly shifts risk.
Thanks for listening. Find us on YouTube and your favorite podcast app. See you tomorrow.