When a patient asks ChatGPT about your therapy and the answer is wrong, the patient does not blame the model. They blame the drug. The same is true of the physician who gets a confident, incorrect dose. The mistake belongs to OpenAI, but the name attached to it is yours. That is what makes AI accuracy a brand-safety number, and the published record on that number is bad enough that every pharma leader should know it by heart.
This piece assembles that record in one place. Not the mechanism (how an approved label gets paraphrased off-label, which we cover separately in your approved message is already off-label in the machine answer), but the raw, citable measurements of how often AI gets drug information wrong. Taken together, they show that the answer engine your audience already trusts is wrong about medications often enough that not watching what it says about your brand is a safety decision, not a marketing one.
The headline number: roughly three in four drug answers were wrong or incomplete
Pharmacists at Long Island University posed 39 real drug-information questions, the kind their drug-information service fields from clinicians, to the free version of ChatGPT. Only 10 of the 39 responses were judged satisfactory. The model gave an inaccurate response, an incomplete response, or no usable response at all to the rest, meaning ChatGPT answered roughly 74 percent of the drug questions wrong, incomplete, or not at all. When the pharmacists asked the model to supply references for its answers, it generated citations that did not exist.
That last detail matters more than the headline percentage. A wrong answer with no source can be caught. A wrong answer wrapped in a fabricated but plausible-looking citation is engineered to be believed. A 2025 analysis published in the Journal of the American College of Clinical Pharmacy reached the same finding: ChatGPT produced accurate responses to only about 10 of 39 real-world drug-information questions, with nearly three-fourths incomplete or inaccurate, and with reproducibility problems that meant the same question did not reliably get the same answer.
Hold that number next to how it is being used. This is not a tool sitting unused in a lab. Pew Research Center found in 2026 that 22 percent of U.S. adults use AI chatbots for health information at least sometimes. One in five adults is asking a system that gets drug questions wrong most of the time.
The citation problem is its own category of failure
If the answers were merely wrong, a savvy reader could check the sources. The deeper problem is that the sources are frequently invented, and they look real.
A BMJ-published audit of how the leading chatbots cite medical information found that average reference completeness across five chatbots was only 40 percent, with no model producing a fully accurate reference list. A separate analysis of more than 500 AI-generated medical citations in that same body of work found only 32 percent were fully accurate, with nearly half at least partly fabricated. Put plainly: when an AI hands a clinician a citation, the coin-flip outcome is that the citation is wrong, and a meaningful share of the time it points at a paper that was never written.
It gets worse in the formal literature. A review of documented large-language-model failures in pharma found that 47 percent of the citations in ChatGPT-generated medical papers were fully fabricated. The same review catalogued the kind of error that turns a citation problem into a clinical one: 29 percent accuracy on antibiotic boxed-warning questions, and a documented opioid-conversion error off by a factor of 1,000. A boxed warning is the most serious safety information a label carries. A 1,000-fold dose error is not a typo, it is a number that, acted on, harms someone.
These are not edge cases curated to look bad. They are the published, reproducible findings that any journalist, regulator, or plaintiff's attorney can pull up and attach to a brand that did not see them coming.
Patients already sense the gap, which is the trap
Confident, fluent, wrong answers do not fool everyone. The survey data says people are more skeptical than that, and the skepticism is exactly what makes the situation dangerous for a brand.
Pew found that only 18 percent of people who use AI chatbots for health information rate that information as highly accurate, compared with 65 percent who rate the information from healthcare providers as highly accurate. So users do not fully trust the machine. They use it anyway, often as a first stop, and then carry the answer into the exam room. The trap is that a low-trust channel still shapes the conversation. When a patient arrives believing your drug causes a side effect it does not, or works for a condition it is not approved for, that belief was authored by a system with an 18 percent confidence rating and your name on the output. The clinician now spends the visit correcting the machine instead of treating the patient.
The accuracy a reader cannot see is the core of the problem. Nothing in a confident wrong answer signals that it is wrong. The fabricated citation looks like a real one. The dropped contraindication leaves no gap on the page. A patient cannot distinguish an accurate answer from a confident incorrect one, and neither, in the moment, can a busy physician. That is precisely why the inaccuracy has to be measured externally, by someone watching the output, because the audience consuming it has no way to flag it themselves.
What this looks like on one brand
Take a fictional therapy, Varigel, approved for a single narrow indication with one serious contraindication. The label is exact. Now run the kind of questions the studies above used against the engines your audience actually uses.
Ask "what are the side effects of Varigel" and one engine returns a clean list that matches the label. A second adds a side effect that belongs to a different drug class, lifted from an adjacent summary, stated with full confidence. Ask "can I take Varigel with [the contraindicated medication]" and an engine, summarizing an efficacy-focused source that skipped the safety section, says nothing about the contraindication at all. Ask for a source and you may get a citation to a journal article that does not exist. None of this is malicious. Every one of these failures is the documented behavior in the studies above, reproduced on your molecule.
Here is the question the numbers force: across the engines your patients and HCPs use, how often does each of those failures happen for your brand this week, and did the rate move after your last data readout or a model update? Without an instrument pointed at the machine, you cannot answer it. The studies prove the failure rate is high in general. Only continuous monitoring tells you your rate, on your label, right now.
From a published statistic to a brand-safety function
The reason these numbers add up to a mandate rather than a curiosity is liability and attribution. When AI states something false about a medication, the harm and the reputational damage flow to the brand, regardless of who built the model. You cannot edit the model. You cannot review its output before it speaks. The only lever you control is detection: knowing, continuously, what the machine says about your drug and how far it has drifted from the approved label.
That reframes the work. Tracking AI accuracy for your brand is not a content campaign. It is the same category of activity as pharmacovigilance: a measurable safety surface that you monitor because the cost of not knowing is borne by patients and by the brand. The metric that operationalizes it is Share of Answer, paired with an accuracy check against the label, run across engines on a schedule rather than screenshotted once.
This is the seam Juncture is built for. The same approved label you clear on the inside with Pre-check is the reference that Answer Monitor measures the machine against on the outside, so an inaccurate or off-label AI answer surfaces as a deviation from a known-good source, not as a surprise in a forwarded screenshot. The published studies tell you the failure rate is high. Monitoring tells you yours, on your brand, in time to do something about it.
The accuracy record is not a reason to wait for the models to improve. It is the reason to start watching now. Bring one brand and the questions your audience actually asks, and we will show you what the machine says about it today, measured against your label, including the answers it gets confidently, citably wrong.
Sources
- ASHP / Long Island University, "Study Finds ChatGPT Provides Inaccurate Responses to Drug Questions," 2023. news.ashp.org
- BMJ Group, "Substantial amount of medical information provided by popular chatbots inaccurate and incomplete," 2025. bmjgroup.com
- JACCP: Journal of the American College of Clinical Pharmacy, evaluation of ChatGPT responses to real-world drug-information questions, 2025. accpjournals.onlinelibrary.wiley.com
- IntuitionLabs, "LLM Hallucinations in Pharma and Clinical Trial Errors," 2025. intuitionlabs.ai
- Pew Research Center, "Where Do Americans Get Health Information, and What Do They Trust?," 2026. pewresearch.org