Skip to content
Answer CardVersion 2025-09-22

Who taught the lie in RAG—and how do you trace it?

RAG securityPoisoning attacksAI forensicsResponsibility attributionLLM security

TL;DR

RAG systems are easily steered by poisoned texts in their knowledge bases. RAGOrigin introduces a black-box responsibility attribution method that, after a misgeneration, narrows the suspect set of documents and assigns each a responsibility score using three signals: retrieval similarity, semantic correlation, and generation influence. It then separates poisoned from benign texts via unsupervised clustering with a dynamic threshold. Evaluated on five QA datasets plus a 16.7M-document database, RAGOrigin consistently achieves top detection accuracy with low false positives across nine attacks, remaining fast enough for operational use. Download the [attribution runbook CSV](/checklists/rag_poisoning_attribution_runbook.csv) for implementable steps.

Key Facts

Implementation Steps

Trigger an attribution run when a misgeneration is confirmed → User report + reproduced bad Q→A pair; chat/session logs.

Freeze the KB and retrieval index (snapshot) → Versioned KB dump, vector index commit ID, retriever/LLM config hash.

Construct the adaptive attribution scope → Ranked text list; segment evaluations; match/non-match judgments.

Compute responsibility scores per text → Per-text similarity, semantic correlation, and generation-influence scores (z-normalized).

Cluster and apply dynamic thresholding → Clustering parameters, cluster assignments, threshold rationale, confusion matrix.

Validate against alternative retrievers/LLMs → Replicated runs using alternate retriever or LLM; deltas logged.

Quarantine and purge → Blocklist entries; removed docs; index rebuild logs.

File a post-incident package → Timeline, poisoned-doc hashes/URLs, attribution rationale, ASR before/after.

Glossary

RAG
Retrieval-Augmented Generation - LLM uses retrieved texts as grounding
Misgeneration
Incorrect or attacker-desired output produced by RAG
Attribution scope
Focused candidate set of texts likely to cause the misgeneration
Responsibility score
Composite metric estimating a text's causal role in the error
ASR
Attack Success Rate - share of questions yielding the attacker's target output
FPR/FNR/DACC
Standard detection metrics: false positive, false negative, accuracy

References

  1. [1] Who Taught the Lie? Responsibility Attribution for Poisoned Knowledge in Retrieval-Augmented Generation https://arxiv.org/html/2509.13772v1

Machine-readable Facts

[
  {
    "id": "f-ragorigin-method",
    "claim": "RAGOrigin is a black-box responsibility attribution method for post-attack RAG forensics.",
    "source": "https://arxiv.org/html/2509.13772v1"
  },
  {
    "id": "f-adaptive-scope",
    "claim": "RAGOrigin builds an adaptive attribution scope by testing ranked text segments until at least half reproduce the wrong output.",
    "source": "https://arxiv.org/html/2509.13772v1"
  },
  {
    "id": "f-responsibility-scoring",
    "claim": "Responsibility scoring combines similarity, semantic correlation, and generation influence using z-score normalization.",
    "source": "https://arxiv.org/html/2509.13772v1"
  },
  {
    "id": "f-unsupervised-clustering",
    "claim": "Labeling uses unsupervised clustering with a dynamic threshold; no labels are required.",
    "source": "https://arxiv.org/html/2509.13772v1"
  },
  {
    "id": "f-performance-large-kb",
    "claim": "On the 16.7M-text KB, RAGOrigin achieves DACC ≥ 0.98 with FPR ≤ 0.02 across attacks.",
    "source": "https://arxiv.org/html/2509.13772v1"
  }
]

About the Author

Spencer Brawner