Eval Catalog

Overview

The AI Eval Catalog is a structured repository that stores and organizes evaluations, metric-based tools designed to evaluate the security and quality of various aspects of LLM-based applications. Use the catalog to select, configure, and apply predefined evaluations for each application. This will enable you to analyze AI models, gain valuable insights, and resolve potential issues to improve reliability and performance.

How does it work?

Here’s an overview of how the evals work, outlining the steps and processes involved in using them effectively.

Select relevant evals from the Coralogix catalog to monitor specific issues in your AI applications.
Apply the chosen eval to your apps.
The system evaluates the spans passing through the applied evals.
Issues are identified, and a score is assigned to each of them:
- A high score indicates that a factory-set threshold has been exceeded, marking it as an issue.
- A low score (below the threshold) is not flagged as an issue.
The high scores are displayed as issues in the relevant dashboards on the AI Center Overview and Application Overview pages, enabling comprehensive AI model analysis. In addition, both high and low scores are displayed as relevant labels for each LLM call on the LLM Calls page.

The inner workings

Let’s look under the hood of the eval to understand how it processes data.

The evaluation system is designed to enrich spans within LLM conversations, enabling automated quality and content assessments for observability purposes. This process ensures transparency, safety, and supports billing operations.

Tag-based filtering. Only spans tagged with gen_ai.system are considered for evaluation. All other spans are skipped.
Evaluation tagging. Each enrichment operation is referred to as an evaluation and is logged with the following tag format: php-templateCopyEditgen_ai.<target>.evaluations.<type>.score.
Billing. Customers are billed based on the number of successful evaluations and the volume of LLM interactions (tokens). See Pricing Model.

Eval types

Coralogix offers a comprehensive selection of predefined evaluations, ready to be applied to any of your AI applications.

Security
Topic Description
Prompt injection Detects any user attempt of prompt injection or jailbreak.
SQL (read-only access) Detects any attempt to use SQL operations which requires more than read-only access.
SQL (load limit) Detects SQL statements that are likely to cause significant system load and affect performance.
SQL (restricted tables) Detects the generation of SQL statements that access specific tables considered sensitive.
SQL (allowed tables) Detects SQL operations on tables that are not configured in the eval.
PII Detects the existence of PII (Personally Identifiable Information) in the user message or the LLM response based on the configured sensitive data types.

Hallucinations
Topic Description
Context adherence Measures whether the model's response strictly follows the provided context without introducing new information.
Context relevance Assesses how relevant and similar the provided context is to the user query, ensuring it contains the necessary information for an accurate response.
Completeness Evaluates how well the model’s response includes all relevant information from the context.
Correctness Determines if a model's response is factually accurate.
SQL hallucination Detects hallucinations in LLM-generated SQL queries.
Tool parameter correctness Ensures the correct tools are selected and invoked with accurately derived parameters based on chat history.
Task adherence Detects whether the LLM's answer aligns with the given system prompt.

Note

When running a hallucination-category eval, the following are considered as the context:

Completeness – system prompt and tool call.
Task adherence – system prompt.
Context adherence – tool call.
Context relevance – tool call.

Toxicity
Topic Description
Sexism Detects whether an LLM response or user prompt contains sexist content.
Toxicity Detects any user message or LLM response containing toxicity.

Topics
Topic Description
Restricted topics Detects any user or LLM attempt to initiate a discussion on the topics mentioned in the eval.
Allowed topics Ensures the conversation adheres to specific and well-defined topics.
Competition discussion Detects whether any prompt or response includes references to competitors mentioned in the eval.

User experience
Topic Description
Language mismatch Detects when an LLM is answering a user question in a different language.

Compliance
Topic Description
Restricted phrases Ensures the LLM does not use specified prohibited terms and phrases by blocking or replacing them based on regex patterns.

Customization

Each eval offers various customization options, such as selecting where to apply it (user prompt, LLM response, or both) and specifying categories to restrict.

Managing your evals

This section explains how to access the Eval Catalog, add evals to your applications, and remove them from the apps as needed.

Accessing the Eval Catalog

In the Coralogix UI, navigate to AI Center > Eval Catalog.
Browse the Eval Catalog tiles to find the eval you need. If it's not visible on the main page, use the Search field to quickly locate it. Or, narrow the eval list by selecting a specific category or using the dropdown in the top right to show all, used, or unused evals.

Adding an eval to your app

By adding evals to your apps, you enable the evaluation of spans passing through them, allowing the identification of issues and assignment of scores (high or low). The high scores are displayed as issues in the relevant dashboards on the AI Center Overview and Application Overview pages, enabling comprehensive AI model analysis. In addition, both high and low scores are displayed as relevant labels for each LLM call on the LLM Calls page.

In the Eval Catalog, locate an eval that you intend to assign to a specific app.
Click the Add to app button.
If no additional configuration is needed, the eval is added to the app immediately. Otherwise, select the following relevant options:
- Specify whether to run the eval on the user prompt, LLM response, or both.
- Select the evaluation categories (e.g., self-harm, violence for the Toxicity eval), values (e.g., topics for the Restricted Topics eval), or any other available attributes for detection based on the chosen eval type.
Click Next to continue.
In the Add eval to application dialog box, select the app(s) you want to apply the eval to. Applications that are already using this eval will be grayed out and unavailable for selection.
Click Done to finish.

An indication of the eval’s usage, along with the number of apps it is applied to, is displayed next to the eval type name in its catalog card.

Managing evals from your Application Catalog

Add, enable/disable, remove, or edit evals directly from your Application Catalog.

Assigning an eval from the Application Catalog

You can add an eval to your app from the Application Catalog page.

In the Coralogix UI, navigate to AI Center > Application Catalog.
Choose an application to which you want to assign an eval and go to Eval Configuration and click Add Eval.
In the Eval Catalog, locate the desired eval:
- Use the Search field to look for your eval.
- Narrow the eval list by selecting a specific category or using the dropdown in the top right to show all, used, or unused evals.
Click Add to App to assign the selected eval to your app.

Enabling/disabling, removing, or editing evals

In the Coralogix UI, navigate to AI Center > Application Catalog.
Select an application to manage its evals.
Go to the Eval Configuration and select an eval.
If needed, take the following action on the selected eval:
- Toggle the State switch to enable or disable the eval. A disabled eval becomes inactive, stops running on the spans, and will not incur charges. However, its configuration is retained, allowing you to easily reactivate it when needed.
- Click the Delete icon to remove the eval from the app.
- Click the Edit button to modify the eval settings through the standard configuration dialog. Click Done to finish.

Application Overview

Next LLM Calls

Topic	Description
Prompt injection	Detects any user attempt of prompt injection or jailbreak.
SQL (read-only access)	Detects any attempt to use SQL operations which requires more than read-only access.
SQL (load limit)	Detects SQL statements that are likely to cause significant system load and affect performance.
SQL (restricted tables)	Detects the generation of SQL statements that access specific tables considered sensitive.
SQL (allowed tables)	Detects SQL operations on tables that are not configured in the eval.
PII	Detects the existence of PII (Personally Identifiable Information) in the user message or the LLM response based on the configured sensitive data types.

Topic	Description
Context adherence	Measures whether the model's response strictly follows the provided context without introducing new information.
Context relevance	Assesses how relevant and similar the provided context is to the user query, ensuring it contains the necessary information for an accurate response.
Completeness	Evaluates how well the model’s response includes all relevant information from the context.
Correctness	Determines if a model's response is factually accurate.
SQL hallucination	Detects hallucinations in LLM-generated SQL queries.
Tool parameter correctness	Ensures the correct tools are selected and invoked with accurately derived parameters based on chat history.
Task adherence	Detects whether the LLM's answer aligns with the given system prompt.

Topic	Description
Sexism	Detects whether an LLM response or user prompt contains sexist content.
Toxicity	Detects any user message or LLM response containing toxicity.

Topic	Description
Restricted topics	Detects any user or LLM attempt to initiate a discussion on the topics mentioned in the eval.
Allowed topics	Ensures the conversation adheres to specific and well-defined topics.
Competition discussion	Detects whether any prompt or response includes references to competitors mentioned in the eval.