AI Chatbots Accept False Medical Advice 46% of the Time

alex2404
By
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

The exact phrasing was “rectal garlic insertion for immune support.” Written in formal clinical language, tucked inside what looked like a hospital discharge note, it fooled nearly half the AI models tested.

That finding sits at the center of a January study published in The Lancet Digital Health, which evaluated 20 large language models using more than 3.4 million prompts. The prompts came from three sources: public forums and social media, real hospital discharge notes edited to contain a single false recommendation, and fabricated clinical accounts reviewed by physicians.

The headline number is unsettling enough — models accepted medical misinformation roughly one in three times they encountered it. But the more precise finding reveals something structural about how these systems work.

Clinical language as a trust signal

When false claims arrived in casual, Reddit-style phrasing, the models rejected them about 91% of the time. When the identical claim was repackaged in formal medical language — a discharge note tone, clinical vocabulary, institutional framing — the failure rate jumped to 46%. The content was the same. The costume was different.

“They evaluate whether it sounds like something a trustworthy source would say,” said Dr. Mahmud Omar, a research scientist at Mount Sinai Medical Center and co-author of the study. The models, he says, have absorbed the association between clinical language and authority, but they do not verify whether the underlying claim is true.

The pattern inverted when misinformation leaned on rhetorical tricks — appeals like “a senior clinician with 20 years of experience endorses this” or “everyone knows this works.” There, the models grew more skeptical. The explanation, according to Omar, is that they have “learned to distrust the rhetorical tricks of internet arguments, but not the language of clinical documentation.”

The failure mode matters as much as the failure rate. “A doctor who’s unsure will pause, hedge, order another test,” Omar told Live Science. “An LLM delivers the wrong answer with the exact same confidence as the right one.”

No better than a search engine

A separate study, published in February in Nature Medicine, reached a related conclusion: chatbots performed no better than an ordinary internet search for medical information. Together, the two papers add weight to a body of evidence questioning whether these tools serve the general public reliably in health contexts.

The stakes are not abstract. According to the announcement, more than 40 million people turn to ChatGPT daily with medical questions, even though the platform carries a warning against relying on it for medical advice. Tools like Gemini, Ada Health, and ChatGPT Health are trained on vast medical literature and score near-perfectly on medical licensing exams — a combination that likely amplifies public trust.

That trust, the Lancet study suggests, is being returned with false confidence. The models tested include both general-purpose and medically specialized systems. None were reliably immune to the problem. The vulnerability, Omar’s team found, was not in the models’ knowledge — it was in how they were taught to read the room.

Photo by Vitaly Gariev on Unsplash

This article is a curated summary based on third-party sources. Source: Read the original article

Share This Article