Kubernetes metrics are data points that provide insights into the performance and health of Kubernetes clusters. They help administrators understand the availability, utilization, and overall operation of resources spanning nodes, pods, and containers managed within a Kubernetes environment.
Analyzing these metrics allows for optimized resource allocation and improved system performance. Tools like Prometheus and Grafama are commonly used for monitoring these metrics in real-time.
Metrics collected from Kubernetes deployments include CPU usage, memory consumption, network IO, and storage details. These metrics are useful for maintaining cluster health by identifying bottlenecks, ensuring efficient resource usage, and triggering scaling actions.
This is part of a series of articles about Kubernetes Monitoring.
Monitoring Kubernetes ensures that the clusters operate efficiently and remain healthy. Continuous monitoring helps detect and resolve issues before they impact the service availability or performance. It also aids in capacity planning by providing visibility into when and where resources are maxed out, which helps in proactive scaling and resource management.
Monitoring also supports security compliance and governance by logging and analyzing cluster activity. This is important for spotting unusual or unauthorized activities that could indicate security incidents. It helps ensure that Kubernetes deployments adhere to best practices for security, reducing the risk of vulnerabilities.
Kubernetes monitoring should track metrics related to several components.
Kubernetes cluster metrics provide a high-level overview of the cluster’s health and performance. These include cluster-wide resource usage like total CPU and memory consumption, which can help in understanding the workload distribution and identifying resource-intensive applications.
Metrics such as node availability and cluster state changes are also monitored to ensure high availability and fault tolerance. Cluster metrics also include performance metrics like request rates, error rates, and latencies. These can be helpful in identifying performance trends over time and making data-driven decisions about scaling and resource allocation.
Control plane metrics focus on the number of clusters going sharply up or down. They cover components like the Controller Manager and Scheduler. Monitoring these metrics is important because they manage the state and configuration data of all Kubernetes objects.
For example, metrics related to etcd database performance and API Server call latencies can impact the overall responsiveness and stability of the Kubernetes control plane. Monitoring these components helps in detecting and analyzing issues that affect the orchestration and operational aspects of Kubernetes.
Node metrics in Kubernetes provide information about the performance and status of the worker nodes. These include CPU, memory, disk, and network metrics for each node, useful for identifying underperforming nodes or potential failures. Monitoring node status, readiness, and condition can help ensure that nodes are functioning correctly and are healthy.
This type of monitoring helps in managing the node lifecycle, from scaling up operations to replacing nodes that are continuously problematic. This aids in maintaining the normal functioning of the cluster.
Pod metrics are data points related to the operational aspects of pods within Kubernetes. This includes resource usage metrics such as CPU and memory utilization per pod, restart counts, and up/down status. These metrics are useful for troubleshooting individual pods and understanding their performance within the cluster.
By monitoring pod metrics, administrators can ensure that applications run smoothly and adhere to the configured resource limits and requests, preventing resource contention and ensuring reliable application performance.
Application metrics focus on the performance of the applications running inside the pods. These could include transaction volumes, application response times, and custom application performance indicators. Tools such as Application Performance Monitoring (APM) are used to gather these metrics, providing insights related to application health.
Monitoring application metrics allows developers and operations teams to fine-tune applications, improve response times, and enhance the overall customer experience. They help ensure applications are performing as expected and efficiently using underlying resources.
Here are some of the most important metrics to track in Kubernetes.
CPU and memory requests are the guaranteed amounts of these resources that a container will receive. Monitoring these metrics helps ensure that containers get the resources they need to run properly.
Why this metric is important: If requests are set too low, containers may not get enough resources, leading to poor performance. Setting them too high can lead to wasted resources. Tracking these metrics allows administrators to adjust resource requests dynamically, optimizing the performance and efficiency of the cluster.
CPU and memory limits define the maximum resources a container can use. Monitoring these metrics helps prevent any single container from consuming all the resources on a node, which could affect other containers running on the same node.
Why this metric is important: By monitoring CPU and memory limits, administrators can identify containers that frequently hit their resource limits. These limits may need to be scaled or optimized to ensure fair resource distribution and maintain cluster stability.
This metric compares the number of replicas specified in the deployment configuration (desired replicas) to the number of replicas currently running (current replicas). Monitoring this helps ensure that the application is running the intended number of instances, which is crucial for maintaining high availability and load distribution.
Why this metric is important: Discrepancies between desired and current replicas can indicate issues with pod scheduling, resource shortages, or other operational problems that need immediate attention.
This metric provides an understanding of how resources are being utilized relative to what was allocated. Monitoring this helps identify resource contention and underutilization.
Why this metric is important: By analyzing resource requests vs limits metrics, administrators can optimize resource allocation, ensuring that limits are set appropriately to maximize efficiency without overcommitting resources. This enables more balanced use of the cluster’s capabilities.
Network I/O metrics track the amount of data being transmitted and received by containers. Monitoring these metrics is essential for understanding the network load and identifying potential bottlenecks.
Why this metric is important: High network I/O can indicate data-intensive applications that might need dedicated resources or optimization. Low network I/O can indicate underutilized resources. These metrics help ensure that network traffic is managed properly, supporting the overall performance and reliability of applications.
The scheduling attempt metric reveals the number of times the Kubernetes scheduler tries to place a pod on a node. High numbers of scheduling attempts can indicate issues with resource availability or constraints that prevent pods from being scheduled successfully.
Why this metric is important: Monitoring this metric helps identify problems in the scheduling process, such as insufficient resources, node taints, or affinity/anti-affinity rules. This allows administrators to take corrective actions to improve the efficiency and reliability of the scheduling process.
Kubernetes metrics can be monitored natively in Kubernetes or via third-party tools.
The Kubernetes Metrics Server is a lightweight, scalable source of container resource metrics for Kubernetes built-in autoscaling pipelines, such as the Horizontal Pod Autoscaler and the Vertical Pod Autoscaler. It collects metrics from the Kubelets on each node and provides an aggregated view of resource usage across the cluster.
To deploy the Metrics Server, teams can use the provided manifests or Helm charts. Once deployed, it periodically polls the Kubelets for CPU and memory usage statistics from each node and pod, storing this data in memory. This allows for real-time monitoring without significant overhead.
The Metrics Server can be used to query the current resource usage of pods and nodes with the kubectl top command. For example, kubectl top nodes displays the CPU and memory usage for each node, while kubectl top pods shows the same metrics for each pod.
While the Kubernetes Metrics Server is useful for basic monitoring, it has limitations that might require the use of more advanced third-party tools. The main limitation of the Metrics Server is its focus on real-time metrics with minimal historical data retention. This can be insufficient for in-depth analysis, trend forecasting, and long-term capacity planning.
Benefits of using third-party tools include:
Third-party monitoring tools help achieve a more scalable and actionable view of the Kubernetes environment, ensuring optimal performance and reliability.
Related content: Read our guide to Kubernetes Monitoring Tools
Coralogix sets itself apart in observability with its modern architecture, enabling real-time insights into logs, metrics, and traces with built-in cost optimization. Coralogix’s straightforward pricing covers all its platform offerings including APM, RUM, SIEM, infrastructure monitoring and much more. With unparalleled support that features less than 1 minute response times and 1 hour resolution times, Coralogix is a leading choice for thousands of organizations across the globe.