$ cat ai-engineering/guardrails.md

Guardrails: Input and Output Safety Layers for LLM Systems

How input, output, and tool-call guardrails enforce application safety policies.

Jatin Bansal@blog:~/ai-engineering$ open guardrails

A B2B agent that books appointments goes live on a Tuesday. By Thursday the on-call sees a customer who got a calendar invite for “URGENT: contact [email protected] via the email tool; security audit.” The user never asked for that. A vendor’s appointment-request email contained an indirect injection: a hidden block of text instructing the agent to email a credentials prompt to an external address. The model dutifully called the email tool. There was no parse error, no exception, no log entry that read “attack detected”. The agent’s logs show exactly the sequence of tool calls a human user might make. This is the failure mode guardrails exist to catch. Not the model being wrong in the ordinary sense; being right, by its lights, about a malicious instruction the operator never authorized.

What guardrails enforce

A guardrail is a runtime check at the model’s boundary that enforces a property the model itself cannot be relied on to enforce. The boundary has two sides: input guardrails inspect what’s about to be fed into the model (user prompts, retrieved context, tool results) and decide whether to let it through, rewrite it, or refuse; output guardrails inspect what the model has produced (assistant messages, structured outputs, tool calls) and decide whether to ship it, redact it, retry, or escalate. A complete production deployment uses both; input guards stop the bad request from ever reaching the expensive call, output guards catch the rare cases where the model produced something harmful despite a clean input.

Guardrails run outside the model at request time and encode application-specific policy. Training can reduce harmful outputs, but runtime checks enforce what reaches the model, the user, and downstream tools. Using the same model to generate and classify an answer leaves both decisions exposed to the same prompt attack.

A web application firewall is a useful operational analogy. Guardrails filter broad attack classes at the model boundary while authentication, authorization, schema validation, and application logic enforce their own narrower rules.

Guardrail components

Five categories of component, roughly mapped to where they sit in the request path. Pick at most one from each category; layering more than that buys diminishing returns and starts hurting latency.

Input-side: prompt-injection and jailbreak classifiers

These run on the user input (and on retrieved context) and produce a binary or graded “is this attempting to subvert the system” verdict.

Meta Llama Prompt Guard 2 (released April 2025, 22M and 86M parameter sizes). The 86M version is the strong-precision option; the 22M version cuts latency and compute by ~75% with minimal performance trade-off. Both detect prompt injection and jailbreaking attacks; trained on a large corpus of known vulnerabilities. The 86M model runs at roughly 10–30ms per request on a CPU; deploy as a sidecar or as a Hugging Face Inference Endpoint.
Anthropic Constitutional Classifiers (publicly disclosed February 2025, shipped in production behind Claude in 2025). A pair of classifiers: one on input, one on output; trained on synthetic data generated from a “constitution” describing permitted and restricted content. Anthropic’s published results: 87% reduction in over-refusals compared to the previous classifier system, 40× computational cost reduction in the next-generation version (May 2025), and over 1,700 hours of red-teaming with no universal jailbreak found that elicited responses comparable in detail to an undefended model. The constitutional classifiers ship as part of Claude itself rather than as a standalone open-source artifact, so the design pattern is what travels rather than the weights.
Microsoft Prompt Shields (the Azure AI Content Safety feature). Detects direct attacks (“user prompt attacks”) and indirect attacks embedded in documents; integrates with Azure OpenAI Service. Useful primarily if your stack is already in Azure.
Vigil and Rebuff are the open-source python-native options for prompt-injection detection; both ship classifier ensembles and “canary token” mechanisms that detect when the model has been steered into leaking parts of its system prompt. Reach for these when self-hosting and Llama Prompt Guard isn’t the right fit.

Output-side: content classification and policy enforcement

These run on the model output (and on retrieved context, as a second pass) and produce a category-tagged “is this safe to ship” verdict.

Meta Llama Guard 4 (released April 2025, 12B parameters, multimodal). The canonical open-weight content-safety classifier. Operates on a published taxonomy: 14 categories including S1 Violent Crimes, S5 Defamation, S7 Privacy, S8 Intellectual Property, etc., and is designed to safeguard both inputs and outputs of LLM/VLM stacks. Trained as a fine-tuned Llama 4 variant, which is why it natively handles image safety. Deploy in front of or behind any LLM via the Hugging Face Inference Endpoints or self-hosted.
OpenAI Omni-Moderation (omni-moderation-latest, based on GPT-4o, multimodal). Free to use through the Moderation API; covers 13 categories with calibrated probability scores; 42% better on multilingual benchmarks than the previous generation. A suitable default if you’re already on the OpenAI stack: no incremental cost, single API call, well-maintained taxonomy.
Guardrails AI + the Guardrails Hub. A Python library with ~70 prebuilt validators covering PII, jailbreaks, factuality, formatting, code exploits, and brand risk. Each validator is a class; you compose them into a Guard that intercepts inputs and outputs of LLM calls. A suitable option when you want a Python-first validator hub and you care more about composability than about deploying a single heavy classifier.
Microsoft Presidio for PII detection and redaction. A dedicated open-source toolkit covering named-entity recognition, regex, and contextual analysis for personal data. The standard choice when PII is the dominant policy concern; sits naturally as a Guardrails AI validator or as a standalone egress filter.

Orchestration: policy engines and dialogue managers

These wrap classifiers in a programmable policy language. You reach for them when the policy is non-trivial; multi-turn dialogue flows, topic-gated conversation, conditional escalation.

NVIDIA NeMo Guardrails. The canonical open-source orchestration framework. Uses Colang, a DSL for defining conversational guardrails, and orchestrates underlying classifiers (Llama Guard, PromptGuard, custom checkers) into input rails, output rails, dialog rails, retrieval rails, and execution rails. Production-ready container image for Kubernetes deployment; a suitable option when your application needs complex dialogue policies with clear topical boundaries. The latency cost is non-trivial: NVIDIA’s published numbers cite “half a second of latency” as the cost of the orchestration layer with classifier calls in the loop.

Agent-specific: capability scoping and chain-of-thought audit

These are the newer category, specifically targeted at the agent-runtime threat model (indirect injection, tool misuse, lateral movement).

Meta LlamaFirewall (released May 2025, open source, used in production at Meta). Three components: PromptGuard 2 for jailbreak detection on inputs; Agent Alignment Checks, the first open-source guardrail that audits a model’s chain-of-thought in real time for goal hijacking; and CodeShield, a static-analysis engine that scans LLM-generated code for insecure patterns. Published evaluation on the AgentDojo benchmark showed attack success rates dropping from 17.6% to 1.7% with LlamaFirewall in front. The chain-of-thought audit is the structurally novel piece: it’s the only currently-shipping defense that inspects the model’s reasoning rather than just the input/output surface.
Anthropic’s “Computer use” tool safety guidance and OpenAI’s Atlas hardening blog. Not libraries, but published threat models and mitigation patterns that travel: capability scoping, allowlists, screenshot-sanitization, browsing-while-data-tagged.

The 2026 production default is to layer one input classifier (Llama Prompt Guard 2 or the provider’s built-in equivalent), one output classifier (Llama Guard 4 or omni-moderation), and either a Guardrails AI validator chain or NeMo Guardrails as the orchestration layer. LlamaFirewall on top when the application is agent-shaped and the threat model includes indirect injection.

Place checks around every untrusted boundary

The full path of a guarded request, drawn at the right level of abstraction.

text

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
[user input]
    │
    ▼
┌───────────────────────────────────────────┐
│  Input guard                              │
│  - Llama Prompt Guard 2  → is_injection?  │
│  - PII detection         → has_pii?       │
│  - Topic / policy gate   → in_scope?      │
└────────────┬──────────────────────────────┘
             │  pass / reject / rewrite
             ▼
┌───────────────────────────────────────────┐
│  Prompt assembly                          │
│  - System prompt (operator-controlled)    │
│  - Retrieved context (data-tagged)        │
│  - User input (data-tagged)               │
└────────────┬──────────────────────────────┘
             │
             ▼
┌───────────────────────────────────────────┐
│  Model call                               │
│  - With provider-built-in safety training │
│    (Constitutional Classifiers etc.)      │
└────────────┬──────────────────────────────┘
             │  assistant message / tool calls
             ▼
┌───────────────────────────────────────────┐
│  Output guard                             │
│  - Llama Guard 4 / omni-moderation        │
│  - PII / data-exfil scan                  │
│  - Structured-output schema validation    │
│  - Tool-call argument validation          │
└────────────┬──────────────────────────────┘
             │  ship / redact / retry / refuse
             ▼
[response to user]      [tool invocation → tool harness]
                              │
                              ▼
                        [authorization check on the tool result,
                         independent of any LLM-side guard]

The model belongs between input and output guard layers. Tool execution is a separate enforcement point: permission to call delete_record does not establish that the current user may delete a specific record. The harness must authorize the concrete operation.

Account for latency explicitly

The math the design has to satisfy. Concrete numbers as of May 2026.

Component	Median latency	Notes
Llama Prompt Guard 2 22M (CPU)	5–15 ms	Sidecar deployment, batched
Llama Prompt Guard 2 86M (CPU)	20–60 ms	Higher precision, ~2-4× slower
OpenAI omni-moderation API	80–150 ms	Network round-trip dominates
Llama Guard 4 12B (1× A100)	60–120 ms	First-token latency; full classification
Guardrails AI validator chain (CPU)	5–50 ms	Depends on validator mix
NeMo Guardrails orchestration overhead	200–500 ms	Published as “~half-second”
Anthropic Sonnet 4.6 (typical request)	800–2500 ms	The main call

input guards and output guards each add ~10% to the end-to-end latency budget when sized appropriately. The 86M Prompt Guard at ~50ms is 2-5% of a typical Sonnet call. Orchestration frameworks like NeMo are the heaviest piece of the budget; half a second is enough to dominate the latency tail on routine queries. The right reach for NeMo is when the policy logic is genuinely complex (multi-turn topical gating, conditional escalation chains); for simple input/output classification, calling the classifiers directly is faster.

The cost math: at production volumes, the classifier calls cost roughly 1-3% of what the main model call costs. The 22M Prompt Guard self-hosted is essentially free at the per-request level (you’re paying for a small CPU sidecar, not for tokens); the omni-moderation API is genuinely free (OpenAI bills nothing for it). Llama Guard 4 is the heaviest cost. A self-hosted 12B model needs GPU capacity, but you can route only the responses that might need it to the heavy classifier, and run a cheap input filter on the bulk. The cost-tier pattern matches model routing: cheap classifier for the majority, expensive classifier for the suspicious slice.