Skip to content

Datasets

What is a dataset?

Datasets are the fundamental building blocks of data organization within a dataspace. They allow you to logically segment your logs, spans, and other entity types, thereby improving query performance, access control, and storage clarity. For instance datasets can be created to segment along team lines, like frontend, backend, or security.

Note

Only system datasets are currently supported. User-defined dataspaces are in development for future release.

A dataset is a scoped collection of related data within a dataspace. Each dataset contains a specific stream of observability data (e.g., logs, traces, alerts) that inherits configuration from its parent dataspace.

Datasets are created:

  • Automatically — via routing logic
  • Manually — through the UI or writeTo queries
  • Dynamically — based on values like $d.region, $l.applicationname, etc.

Each dataset lives in a dataspace and is queried using:

source <dataspace>/<dataset> | ...

Note

Datasets currently work only with archived data.


Why use datasets?

Datasets are designed to solve long-standing limitations in the traditional logs and spans model. They provide a way to logically isolate data by structure, purpose, and access level—improving control, schema hygiene, and performance.

Limitations of traditional logs/spans:

  • Hardcoded pipelines: All data flows into fixed logs/spans buckets with no user control over routing.
  • Schema pollution: Different sources (e.g., alerts, RUM, enrichments) get dumped into the same dataset.
  • Fixed $l labels: System-defined labels like application or subsystem create naming conflicts and ambiguous semantics.
  • Immutable archive paths: Changing storage configuration renders old data inaccessible unless manually copied.
  • Cross-contaminated data: Mixed data types (e.g., metrics vs. alerts vs. enrichments) degrade schema clarity and performance.

Benefits of datasets:

  • No pollution: Each dataset tracks its own schema, reducing collisions and ambiguity.
  • Dynamic labels ($l): Labeling is now fully user-defined—e.g., $l.env, $l.cluster, $l.region.
  • Flexible storage: Datasets can route to different buckets or prefixes, and their location can change without needing to copy data.
  • Improved performance: Queries run faster because datasets only contain semantically related data.
  • Reusable outputs: You can write query results into new datasets—supporting summary tables, derived views, or archival logic.
  • Future-proof: Logs and spans will eventually migrate to dynamic datasets with full support for $l-based routing and labeling.

In short: Datasets offer structure, separation, and flexibility.


Key capabilities of datasets

CapabilityDescription
Dynamic creationDatasets are created on-the-fly based on routing rules or labels like $l.applicationname. No manual setup required.
Scoped performanceSegmented datasets reduce schema collisions and improve query speed by narrowing the search space.
Granular controlApply retention, access, routing, and enrichment policies at the datasets level.
ReusabilityYou can write query results into datasets and retrieve them later for dashboards, joins, or long-term analytics.
Clarity and structureDatasets make data easier to organize and reason about — by team, service, environment, or data type.

Example: writing to and reading from a dataset

Note

Duplicated data created by queries will count towards your quota.

// Write query results to a dataset
source logs
| filter status_code >= 500
| writeTo default/high_errors
// Reuse it later
source default/high_errors
| groupby path agg count()

This workflow is especially helpful for recurring reports, dashboards, and trend analyses.


Query syntax

source <dataspace>/<dataset>

If you're in the default dataspace, you can omit the prefix:

source logs

These are equivalent:

source logs
source default/logs

And when used alone:

| filter status_code >= 500

is implicitly querying default/logs.


System datasets

Coralogix includes several system datasets in the system dataspace. These are read-only and auto-generated.
DatasetDescription
system/aaa.audit_eventsStores audit logs for compliance and access monitoring.
system/alerts.historyRecords alert evaluation and trigger metadata.
system/casesModels each case from creation and acknowledgement through resolution.
system/engine.queriesHistorical record of user queries for introspection and optimization.
system/engine.schema_fieldsTracks field-level schema evolution over time.
system/labs.limit_violationsRecords each time a configured limit is exceeded.
system/notification.deliveriesCaptures the lifecycle of outbound alert notifications.
system/notification.requestsCaptures each incoming notification request metadata.

These datasets power features like schema visualization, alert performance tracking, and auditing. See System dataset for more information.


Dataset schemas

Each dataset has an associated schema, influenced by its pillar (logs, spans, etc.) and entity type (e.g., alerts, browserLogs, cpuProfiles).
PillarEntity typeExample schema
logsalerts{ alert_name, severity, status, triggered_at }
logsbrowserLogs{ user_agent, page_url, timestamp }
logstext{ text: "..." }
spansspansOpenTelemetry-formatted span objects
metricsmetrics{ __name__, value, labels... }
binarysessionRecordingsMetadata + link to binary
binaryfilesFile metadata (e.g., name, size, uploaded_by)

Schema docs for common datasets:


Managing datasets

With Dataset Management can manage your datasets from the UI by navigating to:

Data Flow > Dataset Management

Here, you can:

  • View all active datasets
  • Enable/disable system datasets
  • Apply configuration rules
  • View schema definitions
  • Inspect sample documents

Enabling and disabling datasets

Datasets, especially system datasets, must be manually enabled. Once enabled:

  • All users can query them
  • They count toward your daily quota
  • Previously generated data remains accessible, even if later disabled

Disabling a dataset stops its ingestion — not its storage.


Learn more