Datasets
What is a dataset?
Datasets are the fundamental building blocks of data organization within a dataspace. They allow you to logically segment your logs, spans, and other entity types, thereby improving query performance, access control, and storage clarity. For instance datasets can be created to segment along team lines, like frontend
, backend
, or security
.
Note
Only system datasets are currently supported. User-defined dataspaces are in development for future release.
A dataset is a scoped collection of related data within a dataspace. Each dataset contains a specific stream of observability data (e.g., logs, traces, alerts) that inherits configuration from its parent dataspace.
Datasets are created:
- Automatically — via routing logic
- Manually — through the UI or
writeTo
queries - Dynamically — based on values like
$d.region
,$l.applicationname
, etc.
Each dataset lives in a dataspace and is queried using:
Note
Datasets currently work only with archived data.
Why use datasets?
Datasets are designed to solve long-standing limitations in the traditional logs
and spans
model. They provide a way to logically isolate data by structure, purpose, and access level—improving control, schema hygiene, and performance.
Limitations of traditional logs
/spans
:
- Hardcoded pipelines: All data flows into fixed logs/spans buckets with no user control over routing.
- Schema pollution: Different sources (e.g., alerts, RUM, enrichments) get dumped into the same dataset.
- Fixed
$l
labels: System-defined labels likeapplication
orsubsystem
create naming conflicts and ambiguous semantics. - Immutable archive paths: Changing storage configuration renders old data inaccessible unless manually copied.
- Cross-contaminated data: Mixed data types (e.g., metrics vs. alerts vs. enrichments) degrade schema clarity and performance.
Benefits of datasets:
- No pollution: Each dataset tracks its own schema, reducing collisions and ambiguity.
- Dynamic labels (
$l
): Labeling is now fully user-defined—e.g.,$l.env
,$l.cluster
,$l.region
. - Flexible storage: Datasets can route to different buckets or prefixes, and their location can change without needing to copy data.
- Improved performance: Queries run faster because datasets only contain semantically related data.
- Reusable outputs: You can write query results into new datasets—supporting summary tables, derived views, or archival logic.
- Future-proof: Logs and spans will eventually migrate to dynamic datasets with full support for
$l
-based routing and labeling.
In short: Datasets offer structure, separation, and flexibility.
Key capabilities of datasets
Capability | Description |
---|---|
Dynamic creation | Datasets are created on-the-fly based on routing rules or labels like $l.applicationname . No manual setup required. |
Scoped performance | Segmented datasets reduce schema collisions and improve query speed by narrowing the search space. |
Granular control | Apply retention, access, routing, and enrichment policies at the datasets level. |
Reusability | You can write query results into datasets and retrieve them later for dashboards, joins, or long-term analytics. |
Clarity and structure | Datasets make data easier to organize and reason about — by team, service, environment, or data type. |
Example: writing to and reading from a dataset
Note
Duplicated data created by queries will count towards your quota.
// Write query results to a dataset
source logs
| filter status_code >= 500
| writeTo default/high_errors
This workflow is especially helpful for recurring reports, dashboards, and trend analyses.
Query syntax
If you're in the default
dataspace, you can omit the prefix:
These are equivalent:
And when used alone:
is implicitly querying default/logs
.
System datasets
Coralogix includes several system datasets in the system
dataspace. These are read-only and auto-generated.
Dataset | Description |
---|---|
system/aaa.audit_events | Stores audit logs for compliance and access monitoring. |
system/alerts.history | Records alert evaluation and trigger metadata. |
system/cases | Models each case from creation and acknowledgement through resolution. |
system/engine.queries | Historical record of user queries for introspection and optimization. |
system/engine.schema_fields | Tracks field-level schema evolution over time. |
system/labs.limit_violations | Records each time a configured limit is exceeded. |
system/notification.deliveries | Captures the lifecycle of outbound alert notifications. |
system/notification.requests | Captures each incoming notification request metadata. |
These datasets power features like schema visualization, alert performance tracking, and auditing. See System dataset for more information.
Dataset schemas
Each dataset has an associated schema, influenced by its pillar (logs, spans, etc.) and entity type (e.g., alerts
, browserLogs
, cpuProfiles
).
Pillar | Entity type | Example schema |
---|---|---|
logs | alerts | { alert_name, severity, status, triggered_at } |
logs | browserLogs | { user_agent, page_url, timestamp } |
logs | text | { text: "..." } |
spans | spans | OpenTelemetry-formatted span objects |
metrics | metrics | { __name__, value, labels... } |
binary | sessionRecordings | Metadata + link to binary |
binary | files | File metadata (e.g., name, size, uploaded_by) |
Schema docs for common datasets:
Managing datasets
With Dataset Management can manage your datasets from the UI by navigating to:
Data Flow > Dataset Management
Here, you can:
- View all active datasets
- Enable/disable
system
datasets - Apply configuration rules
- View schema definitions
- Inspect sample documents
Enabling and disabling datasets
Datasets, especially system datasets, must be manually enabled. Once enabled:
- All users can query them
- They count toward your daily quota
- Previously generated data remains accessible, even if later disabled
Disabling a dataset stops its ingestion — not its storage.