Skip to content

A 1-in-3 Chance ChatGPT Gets Your Drug Wrong: The Accuracy Numbers Pharma Leaders Need

ChatGPT drug information accuracy is worse than most brand teams assume. The published record: roughly 74 percent of drug answers wrong, citations routinely fabricated.

The Juncture team7 min read
AI accuracyhallucinationdrug informationbrand safetypatient safetyanswer monitoring

When a patient asks ChatGPT about your therapy and the answer is wrong, the patient does not blame the model. They blame the drug. The same is true of the physician who gets a confident, incorrect dose. The mistake belongs to OpenAI, but the name attached to it is yours. That is what makes AI accuracy a brand-safety number, and the published record on that number is bad enough that every pharma leader should know it by heart.

This piece assembles that record in one place. Not the mechanism (how an approved label gets paraphrased off-label, which we cover separately in your approved message is already off-label in the machine answer), but the raw, citable measurements of how often AI gets drug information wrong. Taken together, they show that the answer engine your audience already trusts is wrong about medications often enough that not watching what it says about your brand is a safety decision, not a marketing one.

The headline number: roughly three in four drug answers were wrong or incomplete

Pharmacists at Long Island University posed 39 real drug-information questions, the kind their drug-information service fields from clinicians, to the free version of ChatGPT. Only 10 of the 39 responses were judged satisfactory. The model gave an inaccurate response, an incomplete response, or no usable response at all to the rest, meaning ChatGPT answered roughly 74 percent of the drug questions wrong, incomplete, or not at all. When the pharmacists asked the model to supply references for its answers, it generated citations that did not exist.

That last detail matters more than the headline percentage. A wrong answer with no source can be caught. A wrong answer wrapped in a fabricated but plausible-looking citation is engineered to be believed. A 2025 analysis published in the Journal of the American College of Clinical Pharmacy reached the same finding: ChatGPT produced accurate responses to only about 10 of 39 real-world drug-information questions, with nearly three-fourths incomplete or inaccurate, and with reproducibility problems that meant the same question did not reliably get the same answer.

Hold that number next to how it is being used. This is not a tool sitting unused in a lab. Pew Research Center found in 2026 that 22 percent of U.S. adults use AI chatbots for health information at least sometimes. One in five adults is asking a system that gets drug questions wrong most of the time.

The citation problem is its own category of failure

If the answers were merely wrong, a savvy reader could check the sources. The deeper problem is that the sources are frequently invented, and they look real.

A BMJ-published audit of how the leading chatbots cite medical information found that average reference completeness across five chatbots was only 40 percent, with no model producing a fully accurate reference list. A separate analysis of more than 500 AI-generated medical citations in that same body of work found only 32 percent were fully accurate, with nearly half at least partly fabricated. Put plainly: when an AI hands a clinician a citation, the coin-flip outcome is that the citation is wrong, and a meaningful share of the time it points at a paper that was never written.

It gets worse in the formal literature. A review of documented large-language-model failures in pharma found that 47 percent of the citations in ChatGPT-generated medical papers were fully fabricated. The same review catalogued the kind of error that turns a citation problem into a clinical one: 29 percent accuracy on antibiotic boxed-warning questions, and a documented opioid-conversion error off by a factor of 1,000. A boxed warning is the most serious safety information a label carries. A 1,000-fold dose error is not a typo, it is a number that, acted on, harms someone.

These are not edge cases curated to look bad. They are the published, reproducible findings that any journalist, regulator, or plaintiff's attorney can pull up and attach to a brand that did not see them coming.

Patients already sense the gap, which is the trap

Confident, fluent, wrong answers do not fool everyone. The survey data says people are more skeptical than that, and the skepticism is exactly what makes the situation dangerous for a brand.

Pew found that only 18 percent of people who use AI chatbots for health information rate that information as highly accurate, compared with 65 percent who rate the information from healthcare providers as highly accurate. So users do not fully trust the machine. They use it anyway, often as a first stop, and then carry the answer into the exam room. The trap is that a low-trust channel still shapes the conversation. When a patient arrives believing your drug causes a side effect it does not, or works for a condition it is not approved for, that belief was authored by a system with an 18 percent confidence rating and your name on the output. The clinician now spends the visit correcting the machine instead of treating the patient.

The accuracy a reader cannot see is the core of the problem. Nothing in a confident wrong answer signals that it is wrong. The fabricated citation looks like a real one. The dropped contraindication leaves no gap on the page. A patient cannot distinguish an accurate answer from a confident incorrect one, and neither, in the moment, can a busy physician. That is precisely why the inaccuracy has to be measured externally, by someone watching the output, because the audience consuming it has no way to flag it themselves.

What this looks like on one brand

Take a fictional therapy, Varigel, approved for a single narrow indication with one serious contraindication. The label is exact. Now run the kind of questions the studies above used against the engines your audience actually uses.

Ask "what are the side effects of Varigel" and one engine returns a clean list that matches the label. A second adds a side effect that belongs to a different drug class, lifted from an adjacent summary, stated with full confidence. Ask "can I take Varigel with [the contraindicated medication]" and an engine, summarizing an efficacy-focused source that skipped the safety section, says nothing about the contraindication at all. Ask for a source and you may get a citation to a journal article that does not exist. None of this is malicious. Every one of these failures is the documented behavior in the studies above, reproduced on your molecule.

Here is the question the numbers force: across the engines your patients and HCPs use, how often does each of those failures happen for your brand this week, and did the rate move after your last data readout or a model update? Without an instrument pointed at the machine, you cannot answer it. The studies prove the failure rate is high in general. Only continuous monitoring tells you your rate, on your label, right now.

From a published statistic to a brand-safety function

The reason these numbers add up to a mandate rather than a curiosity is liability and attribution. When AI states something false about a medication, the harm and the reputational damage flow to the brand, regardless of who built the model. You cannot edit the model. You cannot review its output before it speaks. The only lever you control is detection: knowing, continuously, what the machine says about your drug and how far it has drifted from the approved label.

That reframes the work. Tracking AI accuracy for your brand is not a content campaign. It is the same category of activity as pharmacovigilance: a measurable safety surface that you monitor because the cost of not knowing is borne by patients and by the brand. The metric that operationalizes it is Share of Answer, paired with an accuracy check against the label, run across engines on a schedule rather than screenshotted once.

This is the seam Juncture is built for. The same approved label you clear on the inside with Pre-check is the reference that Answer Monitor measures the machine against on the outside, so an inaccurate or off-label AI answer surfaces as a deviation from a known-good source, not as a surprise in a forwarded screenshot. The published studies tell you the failure rate is high. Monitoring tells you yours, on your brand, in time to do something about it.

The accuracy record is not a reason to wait for the models to improve. It is the reason to start watching now. Bring one brand and the questions your audience actually asks, and we will show you what the machine says about it today, measured against your label, including the answers it gets confidently, citably wrong.

Sources

  1. ASHP / Long Island University, "Study Finds ChatGPT Provides Inaccurate Responses to Drug Questions," 2023. news.ashp.org
  2. BMJ Group, "Substantial amount of medical information provided by popular chatbots inaccurate and incomplete," 2025. bmjgroup.com
  3. JACCP: Journal of the American College of Clinical Pharmacy, evaluation of ChatGPT responses to real-world drug-information questions, 2025. accpjournals.onlinelibrary.wiley.com
  4. IntuitionLabs, "LLM Hallucinations in Pharma and Clinical Trial Errors," 2025. intuitionlabs.ai
  5. Pew Research Center, "Where Do Americans Get Health Information, and What Do They Trust?," 2026. pewresearch.org

People also ask

Questions this raises

How accurate is ChatGPT for drug information?
In a Long Island University study, pharmacists posed 39 real drug-information questions to free ChatGPT and judged only 10 responses satisfactory, meaning roughly 74 percent were inaccurate, incomplete, or unanswerable. A 2025 study in the Journal of the American College of Clinical Pharmacy found the same: accurate answers to only about 10 of 39 real-world drug questions, with reproducibility problems. For drug information specifically, the published accuracy record is poor.
Does ChatGPT make up medical citations?
Yes, routinely. The Long Island University pharmacists found ChatGPT generated fabricated citations when asked to support its drug answers. A BMJ-published audit found average reference completeness across five chatbots was only 40 percent, and a separate analysis of more than 500 AI-generated medical citations found only 32 percent were fully accurate, with nearly half at least partly fabricated. The invented citations look real, which is what makes them dangerous.
What percentage of AI drug answers are wrong or incomplete?
In the most-cited study, ChatGPT gave an inaccurate, incomplete, or absent response to roughly 74 percent of 39 real drug-information questions, with only about a quarter judged satisfactory. A 2025 replication in a clinical-pharmacy journal reached nearly the same figure. The rate for any individual brand will differ and can only be known by monitoring the engines directly, but the documented general failure rate is high.
Can patients trust ChatGPT for medication side effects?
The evidence says no. Studies show AI frequently omits contraindications, adds side effects from the wrong drug class, and supports claims with fabricated citations, and a confident wrong answer carries no signal that it is wrong. Pew Research found only 18 percent of AI health users rate the information as highly accurate, versus 65 percent for healthcare providers, yet 22 percent of U.S. adults use chatbots for health information anyway. Medication side-effect questions should be verified against the approved label or a clinician.
Who is liable when AI gives wrong information about a drug?
Liability is unsettled and depends on jurisdiction and facts, but the practical reality is that the harm and reputational damage attach to the brand, not the model vendor. A patient or physician who receives a wrong AI answer associates the error with the drug. Because a pharma company cannot edit the model or review its output beforehand, the only controllable lever is detection: continuously monitoring what AI says about the brand against the approved label so off-label or inaccurate answers are caught early.

See it on your brand

See Juncture run on your brand.

Bring an asset and a brand. We will pre-check the asset against the label and show how the machine answers about the brand today, inside and out.