Guardrails prebuilt policies

Coralogix Guardrails includes prebuilt policies that you can apply immediately to protect your LLM applications. Each policy performs a real-time, deterministic check on model inputs or outputs and throws an exception when a violation is detected.

Available prebuilt policies

Policy	What it detects	When to use
Prompt Injection	Attempts to manipulate model behavior by injecting malicious instructions into user prompts	All production AI applications that accept user input
PII Detection	Personally identifiable information such as email addresses, phone numbers, credit card numbers, and Social Security Numbers	Data privacy compliance (GDPR, HIPAA) and sensitive data leakage prevention
Toxicity	Toxic, harmful, or offensive content including hate speech, threats, harassment, and abusive language	Customer-facing AI and brand safety

For domain-specific protection beyond the prebuilt options, see Custom Policies.

Understanding thresholds

All prebuilt policies return a score between 0 and 1.
Score range Meaning Action taken?
Closer to 1 Violation detected with high confidence Yes — exception thrown
Closer to 0 Low severity — probably not a violation No

Guardrail policies use configurable thresholds in the code. When the score meets or exceeds the threshold you set, the policy throws an exception and blocks the interaction.

Threshold: Defines the value from which a guardrail action is triggered. When the threshold is met or exceeded, the guardrail action is executed, returned through the API, and the system marks the event as an issue.
Threshold Use case
0.4–0.6 Low-risk applications
0.7 (default) General applications
0.8–0.9 High-security applications

Prompt Injection

The prompt injection detection policy identifies and blocks attempts to manipulate model behavior through malicious instructions injected into user prompts — such as attempts to ignore system instructions, leak system prompts, or perform unintended actions.

Configuration

Custom threshold

Adjust detection sensitivity (0.0 to 1.0, default 0.7):

# Lower threshold — more sensitive
await guardrails.guard_prompt(
    prompt=user_input,
    guardrails=[PromptInjection(threshold=0.5)],
)

# Higher threshold — less sensitive
await guardrails.guard_prompt(
    prompt=user_input,
    guardrails=[PromptInjection(threshold=0.9)],
)

For full setup including SDK installation and environment variables, see Prompt Injection Detection.

PII Detection

The PII detection policy identifies personally identifiable information in prompts or responses, preventing sensitive personal data from being processed or exposed by your LLM applications.

Available PII categories

Category	Enum value	Description
Email Address	`PIICategory.EMAIL_ADDRESS`	Email addresses
Phone Number	`PIICategory.PHONE_NUMBER`	Phone numbers
Credit Card	`PIICategory.CREDIT_CARD`	Credit/debit card numbers
US SSN	`PIICategory.US_SSN`	US Social Security Numbers

Configuration

Specific categories

Detect only specific PII types:

await guardrails.guard_prompt(
    prompt=user_input,
    guardrails=[PII(categories=[PIICategory.EMAIL_ADDRESS, PIICategory.CREDIT_CARD])],
)

Custom threshold

Adjust detection sensitivity (0.0 to 1.0, default 0.7):

await guardrails.guard_prompt(
    prompt=user_input,
    guardrails=[PII(threshold=0.8)],
)

For full setup including SDK installation and environment variables, see PII Detection.

Toxicity

The toxicity detection policy identifies harmful, offensive, or inappropriate content — including hate speech, threats, harassment, and abusive language — in prompts or responses before they reach users.

Configuration

Custom threshold

Adjust detection sensitivity (0.0 to 1.0, default 0.7):

await guardrails.guard_prompt(
    prompt=user_input,
    guardrails=[Toxicity(threshold=0.8)],
)

For full setup including SDK installation and environment variables, see Toxicity Detection.

Need help? Contact Support.

What's new? Find out here.

LLM? Read llms.txt.

Previous Getting Started

Next Prompt Injection Detection

Score range	Meaning	Action taken?
Closer to 1	Violation detected with high confidence	Yes — exception thrown
Closer to 0	Low severity — probably not a violation	No

Threshold	Use case
0.4–0.6	Low-risk applications
0.7 (default)	General applications
0.8–0.9	High-security applications