Fleet Management

Overview

Coralogix Fleet Management provides tracking and control for OpenTelemetry agents deployed into customer infrastructure, powered by the Open Agent Management Protocol (OpAMP). It is a centralized control plane that enables customers to monitor, configure, upgrade and diagnose agents at scale - remotely and securely.

Benefits

With a small number of agents, managing versions, patching bugs, updating config and ensuring uptime is simple. As the fleet of agents scales, so too does the operational overhead. Your agents form a fundamental part of your telemetry pipeline, and as such, are essential to the continued observability of your production workloads.

Fleet management aims to simplify the process of tracking, upgrading and maintaining your agent estate, so that changes can be easily rolled out, bugs can be simply patched and your engineers can remain focused on maintaining and improving your production workloads.

Use it to:

Track the uptime and functionality of your fleet of agents.
Ensure consistent configuration across all of your agents.
Track the resource consumption of your agents that run on production workloads, to minimise the Noisy Neighbour problem.
Ensure full agent coverage of your applications, by understanding the servers to which your agents are deployed today.

Prerequisites

OpAMP enabled in your OpenTelemetry Agent

The OpAMP protocol needs to be enabled in your OpenTelemetry agents, and configured to communicate with the Coralogix API endpoints. While your individual configuration will vary, this is typically done by adding the opamp extension to your configuration:

extensions:
  opamp:
    server:
      ws:
        endpoint: wss://127.0.0.1:4320/v1/opamp # Target the Coralogix OpAMP servers

If you've installed OpenTelemetry via the Coralogix Helm Chart, then you can simply follow the instructions in the README.md, and add the following to your values.yaml file before updating your agents:

presets:
    fleetManagement:
        enabled: true

Note

There are a number of security considerations to factor in when you integrate with any OpAMP based solution. The most obvious is that it will surface your configuration. If you have secrets in your configuration, you are strongly advised to template those values into your configuration using environment variable expansion.

Exploring your fleet in Coralogix

To explore fleet management in Coralogix, first navigate to the Infrastructure Explorer, and select Coralogix Agents. This brings up a list of all of your agents, alongside a number of filters and query options to help you narrow down your working set.

By default, the Coralogix Agents view shows high level information about every agent:

The agent name (often derived from the server on which the agent is hosted)
The agent type
The OS of the host server
The Status of the agent
The telemetry pipelines set up in the OTel config for that agent
The version of the agent
The name of the cluster in which this agent is deployed
The agent CPU usage
The agent memory usage
Records refused by the receiver on this agent
Send failures by the exporter on the agent

Filtering and searching

On the left hand side of the screen is a collection of predefined filters for the most common filter dimensions. Selecting any of these filters will automatically update the working set.

Alternatively, users can use the search bar at the top of the page to perform key value searching (using the Lucene syntax), or use free text searching to search for a the presence of a string in all fields attached to a given agent. For example, searching for gateway alone will find all agents with an agent type of gateway, but also will find any agents with 'gateway' in the name.

Note

Free text searching can yield some noise, so the simplest experience is often to know the field you wish to query.

Users can also alter the time range. This is particularly useful to understand what has happened to a given agent if it is no longer appearing. By default, the agent will show Current Time, which is the most up to date view of your fleet.

Queries can also be updated by selecting a field in the table.

Investigating a single agent

By selecting one of your agents, a model window expands indicating a number of dimensions.

Overview - A high level reading of the live status of your agent, CPU, memory and any receiver / exporter errors
Relationships - Significant infrastructure relationships that this agent has. For example, the server on which it is deployed
Configuration - The OpenTelemetry configuration for this agent
Metrics - Explorable metrics that are being published by this collector through OpAMP

Note

The time range selector can also be altered when viewing a single agent, to track agent performance over time. The default time range is 15 minutes.

Overview

The Overview section provides high level graphs, to give an "at a glance" view of the health of your collector.

At the top, the Live Status view gives aggregate views of four golden signals for your agent. Below, in the Key Metrics section is a time-series view of the same data.

Relationships

The Relationships view connects the agent with the infrastructure on which it depends. This is useful when investigating an issue with an agent, to understand if the underlying infrastructure is the root cause.

There are two tabs:

Nodes: A Kubernetes based view of a server, with metadata about cluster name and pod health
EC2 Instances: A view of the underlying AWS virtual machine.

Selecting the node or EC2 instance will open the infrastructure explorer view for that node, allowing users to drill down into the details of that specific piece of infrastructure. For example, as above, selecting the node will open up that node in a separate view, that can be further explored.

Note

Experienced users can click Explore to return to the Infrastructure Explorer, with filters automatically applied for the selected node or instance.

Configuration

The Configuration view gives an explorable view of the OpenTelemetry configuration for this particular agent. From this view, users can dive into the specific fields that have been set, that may be contributing to any misbehaviours or performance issues that an agent may be experiencing.

Users can search for a specific value within the configuration:

And using the icons to the right of the search bar, they can both copy the entire view to clipboard, or download the configuration as a YAML file.

Metrics

The metrics pane gives users the ability to freely explore the metrics produced by the agent. This can be useful for debugging and root cause analysis, where the root cause may be revealed by a more obscure metric.

From this view, users can expand and explore specific views, to better see what is going on.

Users are also able to search and filter using the search bar at the top of the view. This will search for metrics containing the inputted string in the name of the metric. For example, searching for "consumer" narrows down to only consumer related metrics.

If users wish to understand the underlying query that is pulling a metric, they can do so using the icon at the top right of the window for that given metric:

Selecting this button will open a pane that highlights the underlying query: