AI chatbots outperform doctors in diagnosing patients, study finds

jeudi 13 février 2025, 12:00 , par ComputerWorld

Chatbots quickly surpassed human physicians in diagnostic reasoning — the crucial first step in clinical care — according to a new study published in the journal Nature Medicine.

The study suggests physicians who have access to large language models (LLMs), which underpin generative AI (genAI) chatbots, demonstrate improved performance on several patient care tasks compared to colleagues without access to the technology.

The study also found that physicians using chatbots spent more time on patient cases and made safer decisions than those without access to the genAI tools.

The research, undertaken by more than a dozen physicians at Beth Israel Deaconess Medical Center (BIDMC), showed genAI has promise as an “open-ended decision-making” physician partner.

“However, this will require rigorous validation to realize LLMs’ potential for enhancing patient care,” said Dr. Adam Rodman, director of AI Programs at BIDMC. “Unlike diagnostic reasoning, a task often with a single right answer, which LLMs excel at, management reasoning may have no right answer and involves weighing trade-offs between inherently risky courses of action.”

The conclusions were based on evaluations about the decision-making capabilities of 92 physicians as they worked through five hypothetical patient cases. They focused on the physicians’ management reasoning, which includes decisions on testing, treatment, patient preferences, social factors, costs, and risks.

When responses to their hypothetical patient cases were scored, the physicians using a chatbot scored significantly higher than those using conventional resources only. Chatbot users also spent more time per case — by nearly two minutes — and they had a lower risk of mild-to-moderate harm compared to those using conventional resources (3.7% vs. 5.3%). Severe harm ratings, however, were similar between groups.

“My theory,” Rodman said, “[is] the AI improved management reasoning in patient communication and patient factors domains; it did not affect things like recognizing complications or medication decisions. We used a high standard for harm — immediate harm — and poor communication is unlikely to cause immediate harm.”

An earlier 2023 study by Rodman and his colleagues yielded promising, yet cautious, conclusions about the role of genAI technology. They found it was “capable of showing the equivalent or better reasoning than people throughout the evolution of clinical case.”

That data, published in Journal of the American Medical Association (JAMA), used a common testing tool used to assess physicians’ clinical reasoning. The researchers recruited 21 attending physicians and 18 residents, who worked through 20 archived (not new) clinical cases in four stages of diagnostic reasoning, writing and justifying their differential diagnoses at each stage.

The researchers then performed the same tests using ChatGPT based on the GPT-4 LLM. The chatbot followed the same instructions and used the same clinical cases. The results were both promising and concerning.

The chatbot scored highest in some measures on the testing tool, with a median score of 10/10, compared to 9/10 for attending physicians and 8/10 for residents. While diagnostic accuracy and reasoning were similar between humans and the bot, the chatbot had more instances of incorrect reasoning. “This highlights that AI is likely best used to augment, not replace, human reasoning,” the study concluded.

Simply put, in some cases “the bots were also just plain wrong,” the report said.

Rodman said he isn’t sure why the genAI study pointed to more errors in the earlier study. “The checkpoint is different [in the new study], so hallucinations might have improved, but they also vary by task,” he said. “ Our original study focused on diagnostic reasoning, a classification task with clear right and wrong answers. Management reasoning, on the other hand, is highly context-specific and has a range of acceptable answers.”

A key difference from the original study is the researchers are now comparing two groups of humans — one using AI and one not — while the original work compared AI to humans directly. “We did collect a small AI-only baseline, but the comparison was done with a multi-effects model. So, in this case, everything is mediated through people,” Rodman said.

Researcher and lead study author Dr. Stephanie Cabral, a third-year internal medicine resident at BIDMC, said more research is needed on how LLMs can fit into clinical practice, “but they could already serve as a useful checkpoint to prevent oversight.

“My ultimate hope is that AI will improve the patient-physician interaction by reducing some of the inefficiencies we currently have and allow us to focus more on the conversation we’re having with our patients,” she said.

The latest study involved a newer, upgraded version of GPT-4, which could explain some of the variations in results.

To date, AI in healthcare has mainly focused on tasks such as portal messaging, according to Rodman. But chatbots could enhance human decision-making, especially in complex tasks.

“Our findings show promise, but rigorous validation is needed to fully unlock their potential for improving patient care,” he said. “This suggests a future use for LLMs as a helpful adjunct to clinical judgment. Further exploration into whether the LLM is merely encouraging users to slow down and reflect more deeply, or whether it is actively augmenting the reasoning process would be valuable.”

The chatbot testing will now enter the next of two follow-on phases, the first of which has already produced new raw data to be analyzed by the researchers, Rodman said. The researchers will begin looking at varying user interaction, where they study different types of chatbots, different user interfaces, and doctor education about using LLMs (such as more specific prompt design) in controlled environments to see how performance is affected.The second phase will also involve real-time patient data, not archived patient cases.

“We are also studying [human computer interaction] using secure LLMs — so [it’s] HIPAA complaint — to see how these effects hold in the real world,” he said.

Lire la suite sur ComputerWorld

https://www.computerworld.com/article/3823233/ai-chatbots-outperform-doctors-in-diagnosing-patients-...

56 sources (32 en français)

Date Actuelle

sam. 11 oct. - 10:23 CEST