Skip to content

How retrieval and summarization quietly rewrite your approved claim

Claim drift is structural. An answer engine retrieves mixed sources then summarizes, dropping a qualifier so an on-label claim reads off-label. Here is the mechanism, step by step.

The Juncture team9 min read
claim driftRAGsummarizationoff-labelanswer engines

An answer engine never copies your approved claim. It reconstructs one. Between the moment a clinician types a question and the moment a paragraph appears, two transformations run on your message, and each one is lossy by design. The first is retrieval: the system selects a handful of passages from a large index. The second is summarization: the model compresses those passages into fluent prose. A qualifier survives both steps or it does not. When it does not, an on-label claim reads off-label, and no human chose that outcome.

This is the failure most pharma teams misclassify. They see a wrong answer, assume a glitch, and wait for the model vendor to patch it. It is not a glitch. It is the expected behavior of a retrieve-then-summarize pipeline operating on your message, and it recurs every time the pipeline runs. This article walks the mechanism step by step, shows it drift a fictional Varigel claim from on-label to off-label, and explains why the exposure is structural rather than incidental.

The pipeline has two lossy stages, not one

Most discussion of AI answers collapses into one word, hallucination, as if the model invents from nothing. In a retrieval-augmented system that is rarely what happens. Retrieval-augmented generation (RAG) grounds the model in retrieved evidence specifically to suppress free invention. The drift you should worry about is subtler and it enters at two distinct points.

Stage one is retrieval. The system embeds the user question, searches an index, and returns the top passages by similarity. Similarity is not the same as completeness. A safety paragraph and an efficacy paragraph from the same label can have very different relevance scores against a question like "what is Varigel for," and the efficacy passage usually wins. The contraindication never enters the context window, so the model cannot include what it never received. A 2025 survey of RAG architectures catalogs exactly these retrieval-side failures: noisy passages, partial coverage, and the model extrapolating beyond what the retrieved evidence justifies (arXiv, 2025).

Stage two is summarization. Even when the right passages are retrieved, the model must compress them into a short answer, and abstractive summarization is lossy compression. The compression target is fluency, and fluency favors the main clause over the subordinate one. Research on multi-document summarization finds that state-of-the-art models hallucinate and drop content precisely when sources are diverse or partially overlapping (ACL Findings, 2025), and faithfulness degrades further as the material to be summarized grows longer (arXiv, 2025). A qualifier is, structurally, the most droppable token in a sentence. It is subordinate, it is conditional, and removing it makes the prose read cleaner. The model is optimizing for the thing that erases your protection.

Why citation does not save you

Teams assume a cited answer is a safe answer. The link looks like grounding. It frequently is not. Recent work formalizing citation faithfulness shows that a model often produces a claim from its parametric memory and then attaches a citation that maps to a document only superficially, a phenomenon the authors call post-rationalization, where the attribution looks supported but the underlying claim was not lifted from the cited source (Wallat et al., 2024). Correctness of the citation and faithfulness of the claim to that citation are different properties, and an unfaithful attribution is harder to catch precisely because it looks authoritative. A clinician who clicks the citation and sees a real label may trust a sentence the label never contained.

Watching the mechanism drift a Varigel claim

Make this concrete. Varigel is a fictional therapy approved for one narrow indication, with a contraindication in patients taking a common comorbidity medication. The approved claim, as cleared by MLR, reads in full:

Varigel is indicated for moderate to severe Condition X in adults who have not responded to first-line therapy. Varigel is contraindicated in patients receiving Medication Y.

Two clauses do regulatory work here. "Who have not responded to first-line therapy" scopes the indication. "Contraindicated in patients receiving Medication Y" is fair-balance safety information. Now run the pipeline.

Retrieval. The clinician asks "what is Varigel used for." The retriever scores passages by similarity to that question. The indication paragraph scores high. The contraindication paragraph, which never mentions "used for" and reads as safety prose, scores lower and falls below the cutoff. It is not retrieved. Already, before the model writes a token, the safety clause is gone from the context. This is the retrieval-coverage failure the survey describes, not a model defect.

Summarization. The model now has the indication paragraph plus a competitor-comparison blog and a years-old conference abstract that also surfaced. It compresses. "Moderate to severe Condition X in adults who have not responded to first-line therapy" is long and conditional, so the model renders it as "Varigel treats Condition X in adults." The qualifier is dropped, not maliciously, but because the shorter clause is more fluent and the model optimizes for fluency. The abstract mentioned an exploratory second use, so the model, filling toward a complete-sounding answer, adds "and is also used for Condition Z."

The output:

Varigel is used to treat Condition X in adults, and is also used for Condition Z.

Compare it to the approved claim. The first-line-failure qualifier is gone, so the indication has silently broadened. The contraindication is gone, so fair balance is gone. An exploratory mention has been promoted to "used for," so the answer now reads off-label. Three regulated transformations, zero invented facts. Every word traces to a real source. The drift lives entirely in what was selected and what was compressed away.

This pattern is not a thought experiment. Studies of ChatGPT answering real drug-information questions repeatedly find responses that are incomplete or only partially correct and that require careful human review before use (iForumRx, 2024; JACCP, 2025). Incomplete is the operative word. The failure is usually omission, the dropped clause, not fabrication.

Why this is structural, not a one-off

The strongest evidence that claim drift is built in, not incidental, comes from the domain that has worked hardest to engineer it out. Stanford researchers tested purpose-built legal research tools from LexisNexis and Thomson Reuters, systems that use RAG specifically to ground answers in authoritative sources. They still produced incorrect or misgrounded answers on roughly one in six queries, with rates of 17 percent and higher across tools, and the errors included citing a real source for a claim that source did not support (Stanford HAI, 2024). RAG lowered the error rate. It did not remove it. If a tool engineered for a regulated, citation-obsessed profession misgrounds one answer in six, a general engine paraphrasing your label is not going to do better.

Three properties make the drift recur rather than resolve.

It is non-deterministic. The same question, asked twice or asked on two engines, retrieves a slightly different passage set and samples a slightly different summary. You cannot screenshot the answer once and certify it. A claim that reads on-label today can drift next week because the index was refreshed or the model was updated.

The droppable token is the regulated one. Qualifiers, indications, and contraindications are exactly the subordinate, conditional clauses that abstractive summarization sheds first. The compression objective is structurally biased against the parts of your message that carry legal weight.

Silence is filled, not respected. When retrieval undercovers, the model does not return a shorter, safer answer. It reaches for ambient third-party content, the comparison blog, the old abstract, the forum post, to complete the paragraph. A brand that never published a clean, machine-legible version of its claim is not met with silence. It is met with a confident reconstruction assembled from whatever was nearby.

The regulatory exposure

Reframe the Varigel output as a promotional artifact and the problem sharpens. FDA's prescription-drug advertising rule requires a fair balance of benefit and risk information and prohibits promotion of a drug for an unapproved use (21 CFR 202.1). The drifted answer dropped the contraindication, so risk information is no longer presented alongside benefit. It promoted an exploratory use to an indication, so it now reads as off-label promotion. If your brand fed or influenced the source the engine leaned on, the line from your content to that off-label sentence is short.

The uncomfortable part is ownership. MLR governs what you publish. Medical affairs governs the evidence. Neither has a standing mandate over a sentence a third-party model generates when no one from your company is in the room. The drift is a regulated communication that your governance process never saw, because your governance process was built to review documents, and this is not a document. It is a reconstruction, produced on demand, different every time.

The takeaway

Claim drift is not the model lying. It is retrieval selecting an incomplete slice of your message and summarization compressing away the qualifiers that made it on-label. Both stages are working as designed, which is why patching one bad answer does nothing: the next query runs the same lossy pipeline and produces the next drift.

Two moves follow. First, give the pipeline a clean source to prefer. Your approved indication, your contraindication, and your safety language, published as plain, well-structured, machine-legible content, are more likely to be retrieved whole and compressed faithfully than a clause buried in a PDF. You cannot stop summarization from compressing, but you can make the correct claim the easiest one to compress without losing the regulated parts. Second, measure the output continuously against the approved claim, because a baseline taken once is a screenshot of a process that re-runs on every query and every model update.

Juncture is built on that join. Inside, it pre-checks the approved message before MLR, comparing each asset against the label and backing the reviewer with a 21 CFR Part 11 trail. Outside, its Answer Monitor reads what the engines actually say and flags the drift as a deviation from that same known-good claim, so a dropped contraindication or an off-label "used for" surfaces as a tracked exception rather than a surprise in a forwarded screenshot. The approved sentence you cleared on the inside is the sentence Juncture watches for on the outside.

Bring one brand and its approved claim. We will run the questions your audience asks across the engines they use, show you where the qualifier already dropped, and trace each drift back to the retrieval or summarization step that caused it.

Sources

  1. Gupta et al., "Retrieval-Augmented Generation: A Comprehensive Survey of Architectures, Enhancements, and Robustness Frontiers," arXiv, 2025. arxiv.org
  2. "How LLMs Hallucinate in Multi-Document Summarization," ACL Findings (NAACL), 2025. aclanthology.org
  3. "Hallucinate at the Last in Long Response Generation: A Case Study on Long Document Summarization," arXiv, 2025. arxiv.org
  4. Wallat et al., "Correctness is not Faithfulness in Retrieval Augmented Generation Attributions," arXiv, 2024. arxiv.org
  5. Magnus et al., "Helpful or Harmful? Using ChatGPT to Answer Drug Information Questions," iForumRx, 2024. iforumrx.org
  6. Khatri et al., "Accuracy and reproducibility of ChatGPT responses to real-world drug information questions," JACCP, 2025. accpjournals.onlinelibrary.wiley.com
  7. Magesh et al., "AI on Trial: Legal Models Hallucinate in 1 out of 6 (or More) Benchmarking Queries," Stanford HAI, 2024. hai.stanford.edu
  8. "Fair Balance and Adequate Provision in Direct-to-Consumer Prescription Drug Advertisements" (21 CFR 202.1), PMC, 2016. ncbi.nlm.nih.gov

People also ask

Questions this raises

What is claim drift in an AI answer engine?
Claim drift is when an answer engine restates an approved pharma claim in a way that changes its regulatory meaning, usually by dropping a qualifier, indication, or contraindication. It happens not because the model invents facts but because retrieval selects an incomplete set of passages and summarization compresses away the subordinate clauses that kept the claim on-label. The result is an on-label claim that reads off-label, assembled entirely from real sources.
How does retrieval-augmented generation (RAG) cause claim drift?
RAG ranks source passages by similarity to the question, and similarity is not completeness. A safety or contraindication passage often scores lower than the efficacy passage and falls below the retrieval cutoff, so it never enters the model context. The model then cannot include a clause it never received, so the qualifier is lost at the retrieval stage before any text is generated.
Why do language models drop qualifiers when they summarize?
Abstractive summarization is lossy compression optimized for fluency, and qualifiers are the most droppable tokens in a sentence because they are subordinate and conditional. Removing a clause like "in patients who have not responded to first-line therapy" makes the output read cleaner, so the model that optimizes for fluency erases exactly the clause that carried regulatory weight. Studies of multi-document and long-document summarization show faithfulness degrading and content being dropped as sources diversify and lengthen.
Why is claim drift structural rather than a one-off bug?
It recurs because both pipeline stages, retrieval and summarization, are lossy by design and run fresh on every query. The output is non-deterministic, so the same question can drift differently across engines or after a model update, and the regulated clauses are precisely the droppable ones. Even purpose-built RAG tools engineered to ground answers in authoritative sources still misground roughly one query in six, which shows the failure is built into retrieve-then-summarize systems, not patchable as a single bug.
What is the regulatory exposure when an answer engine drifts a claim?
A drifted answer can drop the contraindication, removing the fair balance of benefit and risk that FDA advertising rules require, and can promote an exploratory use to an indication, which reads as off-label promotion. Under 21 CFR 202.1 both are problems, and if a brand fed or influenced the source the engine relied on, the path from its content to the off-label sentence is short. The deeper exposure is that this regulated communication was never reviewed, because MLR governs documents and a model reconstruction on demand is not a document.

See it on your brand

See Juncture run on your brand.

Bring an asset and a brand. We will pre-check the asset against the label and show how the machine answers about the brand today, inside and out.