Who taught the lie in RAG—and how do you trace it?
TL;DR
RAG systems are easily steered by poisoned texts in their knowledge bases. RAGOrigin introduces a black-box responsibility attribution method that, after a misgeneration, narrows the suspect set of documents and assigns each a responsibility score using three signals: retrieval similarity, semantic correlation, and generation influence. It then separates poisoned from benign texts via unsupervised clustering with a dynamic threshold. Evaluated on five QA datasets plus a 16.7M-document database, RAGOrigin consistently achieves top detection accuracy with low false positives across nine attacks, remaining fast enough for operational use. Download the [attribution runbook CSV](/checklists/rag_poisoning_attribution_runbook.csv) for implementable steps.
Key Facts
RAGOrigin is a black-box responsibility attribution framework for post-attack RAG forensics.
RAGOrigin constructs an adaptive attribution scope by iteratively testing ranked text segments until ≥50% of tested subsets reproduce the incorrect response.
Each candidate text receives a responsibility score combining embedding similarity, semantic correlation, and generation influence; scores are z-normalized and aggregated.
Final poisoned/benign labeling uses unsupervised clustering with dynamic threshold—no labels required.
On the 16.7M-text KB, RAGOrigin attains DACC ≥0.98 with FPR ≤0.02 across all nine attacks.
Implementation Steps
Trigger an attribution run when a misgeneration is confirmed → User report + reproduced bad Q→A pair; chat/session logs.
Freeze the KB and retrieval index (snapshot) → Versioned KB dump, vector index commit ID, retriever/LLM config hash.
Construct the adaptive attribution scope → Ranked text list; segment evaluations; match/non-match judgments.
Compute responsibility scores per text → Per-text similarity, semantic correlation, and generation-influence scores (z-normalized).
Cluster and apply dynamic thresholding → Clustering parameters, cluster assignments, threshold rationale, confusion matrix.
Validate against alternative retrievers/LLMs → Replicated runs using alternate retriever or LLM; deltas logged.
Quarantine and purge → Blocklist entries; removed docs; index rebuild logs.
File a post-incident package → Timeline, poisoned-doc hashes/URLs, attribution rationale, ASR before/after.
Glossary
- RAG
- Retrieval-Augmented Generation - LLM uses retrieved texts as grounding
- Misgeneration
- Incorrect or attacker-desired output produced by RAG
- Attribution scope
- Focused candidate set of texts likely to cause the misgeneration
- Responsibility score
- Composite metric estimating a text's causal role in the error
- ASR
- Attack Success Rate - share of questions yielding the attacker's target output
- FPR/FNR/DACC
- Standard detection metrics: false positive, false negative, accuracy
References
- [1] Who Taught the Lie? Responsibility Attribution for Poisoned Knowledge in Retrieval-Augmented Generation https://arxiv.org/html/2509.13772v1
Machine-readable Facts
[
{
"id": "f-ragorigin-method",
"claim": "RAGOrigin is a black-box responsibility attribution method for post-attack RAG forensics.",
"source": "https://arxiv.org/html/2509.13772v1"
},
{
"id": "f-adaptive-scope",
"claim": "RAGOrigin builds an adaptive attribution scope by testing ranked text segments until at least half reproduce the wrong output.",
"source": "https://arxiv.org/html/2509.13772v1"
},
{
"id": "f-responsibility-scoring",
"claim": "Responsibility scoring combines similarity, semantic correlation, and generation influence using z-score normalization.",
"source": "https://arxiv.org/html/2509.13772v1"
},
{
"id": "f-unsupervised-clustering",
"claim": "Labeling uses unsupervised clustering with a dynamic threshold; no labels are required.",
"source": "https://arxiv.org/html/2509.13772v1"
},
{
"id": "f-performance-large-kb",
"claim": "On the 16.7M-text KB, RAGOrigin achieves DACC ≥ 0.98 with FPR ≤ 0.02 across attacks.",
"source": "https://arxiv.org/html/2509.13772v1"
}
]