shield-halvedAgent Content Guardrails

Implement modular content filtering and safety checks for AI agent interactions. This guide covers rule-based phrase matching and advanced LLM-based content safety engines, showing how to prevent harmful content in both user inputs and AI outputs.

circle-check
circle-info

Guardrails integrate seamlessly with agent execution—configure once and they work locally (via agent.run()) or remotely (via agent.deploy() + agent.run()). The SDK automatically handles middleware injection and serialization.

Overview

Agent Content Guardrails provide modular content filtering and safety checks for AI agent interactions. They help prevent harmful content in both user inputs and AI outputs, making them essential for security-conscious organizations and developers who need content safety controls.

Guardrails work by checking content against predefined safety rules before and after AI model interactions. When unsafe content is detected, execution is halted and a warning message is returned.

Key Features

  • Multiple Engine Types: Rule-based (PhraseMatcherEngine) and LLM-based (NemoGuardrailEngine) filtering

  • Flexible Configuration: Check inputs only, outputs only, or both

  • Fail-Fast Behavior: Stops on first safety violation for immediate response

  • Agent Integration: Seamless integration with existing agent workflows

  • Optional Dependencies: Works without requiring additional packages for basic usage

Installation

Guardrails are included as an optional dependency. Install with:

circle-exclamation

Quick Start

Basic Phrase Matching

Advanced LLM-Based Filtering

Engine Types

PhraseMatcherEngine (Rule-Based)

Best for simple, predictable content filtering based on exact phrase matches.

Configuration Options:

  • config: Optional BaseGuardrailEngineConfig object. If not provided, defaults to GuardrailMode.INPUT_OUTPUT

  • guardrail_mode: Enum value - GuardrailMode.INPUT_ONLY, GuardrailMode.OUTPUT_ONLY, or GuardrailMode.INPUT_OUTPUT

  • banned_phrases: List of phrases to block (required)

NemoGuardrailEngine (LLM-Based)

Advanced filtering using AI models for context-aware content safety analysis.

Direct Guardrail Usage

When to use: Validate content before sending it to agents or perform standalone content filtering outside of agent execution.

You can use guardrails independently of agents for content checking. This is useful for validating content before sending it to agents or for standalone content filtering.

Important Notes:

  • When using guardrails with agent.run(), async handling is automatic

  • For direct usage, you must use await since check_content() is an async method

  • Use GuardrailInput when you want to check both user input and AI output in a single call

  • The filtered_content field may contain sanitized content if the engine provides it

Agent Integration

When to use: Integrate guardrails into agent workflows for automatic content filtering during agent execution.

Local Execution

When running agents locally, guardrails are enforced through middleware injection:

Remote Execution

For deployed agents, guardrails are serialized and enforced by the backend:

Configuration Patterns

When to use: Combine multiple engines, configure different checking modes, or customize guardrail behavior for specific use cases.

Multiple Engines

Combine multiple guardrail engines for comprehensive protection:

Input-Only vs Output-Only

Configure different checking modes:

Checking Both Input and Output Together

Use GuardrailInput to check both user input and AI output in a single call:

Disabling Guardrails

You can disable guardrails for a specific engine:

Note: Disabled mode is useful for temporarily disabling guardrails during development or testing.

Error Handling

Guardrail Violations

When unsafe content is detected, execution halts and returns a warning message. Internally, a GuardrailViolationError exception is raised, which is caught and converted to a user-friendly warning message in the response.

With Agent Integration:

With Direct Usage:

Handling Exceptions

When using agents, violations are automatically converted to warning messages:

Key Points:

  • Guardrail violations are caught internally and converted to warning strings

  • Users typically see warning messages like "⚠️ Guardrail violation: [reason]" in responses

  • The underlying GuardrailViolationError exception contains a GuardrailResult with details

  • For direct usage, you can catch GuardrailViolationError explicitly if needed

Best Practices

Performance Considerations

  • PhraseMatcherEngine: Fast, low latency (<1ms) - ideal for high-throughput scenarios

  • NemoGuardrailEngine: Higher latency (~100-500ms depending on model) - use for advanced filtering when needed

  • Fail-fast behavior: Multiple engines stop on first violation, reducing unnecessary processing

  • Async/await requirements:

    • Direct usage (manager.check_content()) requires await since it's async

    • Agent integration (agent.run()) handles async automatically

  • Multiple engines: Engines run sequentially until first violation, so total latency is sum of engines until violation

  • Performance tip: Place faster engines (PhraseMatcherEngine) first in the list to catch violations quickly

Configuration Tips

  1. Start Simple: Begin with PhraseMatcherEngine for basic filtering

  2. Layer Protection: Use multiple engines for comprehensive coverage

  3. Test Thoroughly: Validate configurations with various inputs

  4. Monitor Performance: Measure latency impact on agent response times

Security Recommendations

Troubleshooting

Common Issues

"Guardrails module not found"

  • Install optional dependencies: pip install glaip-sdk[guardrails]

"NemoGuardrailEngine not available"

  • Ensure gllm-guardrail package is installed

  • Check that OPENAI_API_KEY or required credentials are set

"Agent execution hangs"

  • Check guardrail configuration for overly broad rules

  • Verify model endpoints are accessible

  • Review network connectivity for LLM-based engines

"False positives in phrase matching"

  • Review banned phrases for overly generic terms

  • Consider case sensitivity settings

  • Test with various input variations

Debugging

Enable detailed logging to troubleshoot issues:

Remote vs Local Behavior

  • Local execution: Immediate blocking with detailed error messages

  • Remote execution: Backend-enforced with standardized warning format

API Reference

Core Classes

  • GuardrailManager: Orchestrates multiple guardrail engines

  • PhraseMatcherEngine: Rule-based phrase filtering

  • NemoGuardrailEngine: Advanced LLM-based content safety

  • GuardrailMiddleware: Integrates guardrails into agent execution

Configuration Schemas

  • GuardrailMode: Enum with values INPUT_ONLY, OUTPUT_ONLY, INPUT_OUTPUT, DISABLED

  • TopicSafetyMode: Enum with values ALLOWLIST, DENYLIST

  • BaseGuardrailEngineConfig: Common engine configuration class with guardrail_mode parameter

  • GuardrailInput: Input schema for checking both input and output together (contains input and output fields)

Result Objects

  • GuardrailResult: Contains:

    • is_safe: Boolean indicating if content passed all safety checks

    • reason: String explanation when content is blocked (None if safe)

    • filtered_content: Optional cleaned/sanitized content if the engine provides it (None if not available)

Input Schemas

  • GuardrailInput: Schema for checking both input and output together:

    • input: Optional string containing user input content

    • output: Optional string containing AI output content

Additional Resources

最后更新于