Living

AI outperformed doctors on diagnosing tough cases - but is it ready for real patients?

An artificial intelligence program put to the most rigorous tests in modern medicine aced its exams, and in fact performed better than human doctors on reasoning tasks such as making emergency room decisions and diagnosing complex cases, according to a new study published Thursday in Science.

Authors of the study, which was conducted by a network of scientists from across the country, said they were stunned by how well the AI program performed, but they made clear that their results do not demonstrate that artificial intelligence is ready to take over medicine and remove human care providers.

"Essentially we threw every single case benchmark that we had at one of the new reasoning models. And overall, the model outperformed our very large physician baseline, which included board-certified, actively practicing physicians and real, messy cases," said Arjun Manrai, a biomedical informatics expert at Harvard Medical School and an author of the study.

"I don't think our findings mean that AI replaces doctors, despite what some companies are likely to say and how they're likely to use the results," Manrai added. "I think it does mean that we're witnessing a really profound change in technology that will reshape medicine, and that we need to evaluate this technology now, in rigorously conducted clinical trials."

Artificial intelligence has exploded around the world including in the healthcare field, where chatbots already are being deployed to provide mental healthcare and some institutions are beginning to incorporate AI in their medical record-keeping. But healthcare providers have remained skeptical of incorporating AI more broadly in medical settings, in part because the technology is so new and not yet fully vetted.

The Science study used an advanced AI program, called the OpenAI o1 series, and compared its abilities across multiple reasoning tests to earlier AI generations and hundreds of human doctors. The tests were all text-based, meaning the AI program did not interact directly with people or collect sensory information - one major limitation in the results, the authors noted.

In all, the AI program was evaluated across six types of tests, which included standard case studies used to train doctors as well as real-world cases taken from an emergency department in Massachusetts. Many of these tests were proposed more than 65 years ago when scientists first anticipated computers would reach a level of intelligence that they could be used in medicine; those tests have held up over the decades and remain a gold standard for evaluating AI.

In every test, the advanced AI program outperformed earlier generations of AI as well as human doctors. In the emergency room setting, the AI performed especially well in triage, where it was far better at rapidly diagnosing and developing treatment plans based on confusing or fragmented information.

"These models are becoming shockingly good at reasoning tasks that a few years ago would have been unthinkable for a computer," said Dr. Jonathan Chen, an expert in computational medicine and biomedical data science at Stanford who was also an author of the new paper. "Now it's not only gotten a lot better than we expected, but sometimes better than a human, and sometimes better than a human using an AI chatbot to help. It's humbling."

But Chen, too, said AI is not ready to be widely deployed in medicine. What this study suggests, he and others said, is that artificial intelligence technology is now moving faster than human scientists' ability to evaluate it - there's a need for new guardrails and new ways to test the limitations of these products.

"We'll eventually catch up and smooth things out," Chen said. "Like an industrial revolution, eventually it's better overall on average, but it's going to hurt some people along the way so we have to rapidly catch up. This isn't something to figure out later - we have to figure out now."

Optimistically, the authors all said there is undoubtedly a path toward incorporating AI into medicine that will improve healthcare across the board. It may be that AI, for example, is far better at diagnostics than the "Dr. House" characters who have been idolized in medicine for decades, said Dr. Adam Rodman, director of AI programs at the Shapiro Center for Research and Education in Boston, who was senior author. House was the titular character in a medical TV show who was known for making incredible diagnoses on hard-to-treat patients.

Maybe, Rodman said, that job will go to AI in the not-too-distant future. "I would be shocked in 10 years if there's still the Dr. House type who's routinely making diagnoses. These models are so good at making diagnoses," he said.

But diagnosis is just one part of the job, and much of a physician's work involves nuance and decision-making that takes into account the full human experience. Rodman said he imagined a "triad" in the future - the doctor, patient and an AI program working together.

"It will take over parts of the job and it will be very very good at certain parts of the job," Manrai said. "But ultimately I think humans want humans to guide them through life or death decisions, to guide them through challenging treatment decisions that change their body and interfere with their quality of life. I want to talk to a doctor about that, and I don't think it's going anywhere."

Dr. Sumant Ranji, director of the UCSF Coordinating Center for Diagnostic Excellence (called CODEX), said he was impressed by the diagnostic abilities of the AI tool in the study, which he was not involved with. Like the authors, he said he now believes that artificial intelligence is "ready for prime-time," in the sense that "they are ready to be tested in real-world healthcare settings."

"We don't know how AI will play out in the real world," Ranji said. "AI is very impressively good at the diagnostic part of the job, but that's only one facet of what we as clinicians do. I saw 14 patients today in the hospital, and the kinds of discussions we were having boiled down to trying to build a rapport so they would trust us. A lot of times, diagnosis is just step one."

Copyright 2026 Tribune Content Agency. All Rights Reserved.

Get unlimited digital access
#ReadLocal

Try 1 month for $1

CLAIM OFFER