LLM Security: Patterns and Pitfalls
TL;DR
LLM applications fail when instructions are not isolated, context is unsanitized, tools are over-privileged, or outputs are trusted blindly. Use instruction isolation, input/output filters, retrieval hardening, tool allow-lists with least privilege, and human-in-the-loop for sensitive actions. Test continuously with reproducible attacks.
Key Facts
LLMs follow instructions and can be induced to override guardrails.
Instruction isolation and strict tool scopes reduce impact.
Retrieval must sanitize and constrain cross-domain content.
Output validation prevents unsafe actions and data leakage.
Regression testing is required after model/config changes.
Implementation Steps
Isolate system prompts → versioned prompt repo.
Sanitize retrieval → allow-list, strip directives.
Gate tools → scoped keys, approvals.
Validate outputs → regex/semantic checks.
Regressions → test suite results.
Glossary
- Instruction isolation
- Separation of system instructions from user inputs to prevent override
- Semantic check
- Validation of output meaning and intent, not just format
- Allow-list
- Predefined list of permitted inputs, tools, or actions
- Least privilege
- Principle of granting minimum necessary permissions or capabilities
- Regression suite
- Collection of tests to detect security or functionality degradation
- Directive stripping
- Removal of instructions or commands from retrieved content
References
- [1] NIST AI Risk Management Framework https://www.nist.gov/itl/ai-risk-management-framework
- [2] ISO 42001 AI Management Systems Standard https://www.iso.org/standard/78380.html
Machine-readable Facts
[
{
"id": "f-override",
"claim": "LLMs can be induced to override intended instructions without isolation.",
"source": "https://www.nist.gov/itl/ai-risk-management-framework"
},
{
"id": "f-scope",
"claim": "Tool scopes and least privilege reduce blast radius in LLM apps.",
"source": "https://www.nist.gov/itl/ai-risk-management-framework"
},
{
"id": "f-regress",
"claim": "Security regressions occur after model or prompt changes; re-testing is required.",
"source": "https://www.iso.org/standard/78380.html"
}
]