Nature Medicine Stress Test Finds Frontier AI Models Not Clinically Ready

A Nature Medicine adversarial stress test found GPT-5, Claude 3.5, and Gemini 2.5 Pro ace medical benchmarks but lean on shortcuts, fabricate reasoning, and falter under small input changes, exposing a gap between lab scores and clinical readiness.

Nature Medicine Stress Test Finds Frontier AI Models Not Clinically Ready A Nature Medicine adversarial stress test found GPT-5, Claude 3.5, and Gemini 2.5 Pro ace medical benchmarks but lean on shortcuts, fabricate reasoning, and falter under small input changes, exposing a gap between lab scores and clinical readiness. Aaron Rafferty June 26, 2026 Key Takeaways A Nature Medicine study stress-tested frontier multimodal models, including GPT-5, Claude 3.5, and Gemini 2.5 Pro, on medical reasoning. The models aced ordinary benchmarks but leaned on shortcuts, fabricated reasoning, and faltered when key image details were removed or modalities swapped. The authors tie the gap to test design rather than model size, and released perturbation tools and rubrics for others to reuse. Leading AI models that ace medical benchmarks fall apart under small, deliberate changes to their inputs, according to an adversarial stress test published in Nature Medicine on June 26. — (@) Researchers led by Hoifung Poon and Yu Gu put frontier multimodal models, including GPT-5, Claude 3.5, and Gemini 2.5 Pro, through perturbations designed to probe how they actually reason. Scripps Research cardiologist Eric Topol, who flagged the paper, said the models are not ready. The systems often reached the right answer for the wrong reasons. They could guess correctly even with key inputs removed, then get confused by slight prompt changes while producing convincing but flawed reasoning traces. Removing image details or swapping modalities exposed the brittleness. The authors tie the weakness to how models are tested, not to model size, and argue benchmark wins do not equal clinical readiness. They released their perturbation tools and rubrics so others can run the same checks on robustness and so

Loading full article…