Kubernetes monitoring can be complex. To do it successfully requires several components to be monitored simultaneously. First, it’s important to understand what those components are, which metrics should be monitored and what tools are available to do so.
In this post, we’ll take a close look at everything you need to know to get started with monitoring your Kubernetes-based system.
When monitoring the cluster, a full view across all areas is obtained, giving a good impression of the health of all pods, nodes and apps.
Key areas to monitor at the cluster level include:
Cluster monitoring provides a global view of the Kubernetes environment, but collecting data from individual pods is also essential. It reveals the health of individual pods and the workloads they are hosting, providing a clearer picture of pod performance at a granular level, beyond the cluster.
Key areas to monitor at the cluster level include:
To gain a higher visibility into a Kubernetes installation, there are several metrics that will provide valuable insight into how the apps are running.
These are metrics collected from the Kubernetes code, written in Golang. They allow understanding of performance in the platform at a cellular level and display the state of what is happening in the GoLang processes.
Monitoring the standard metrics from the operating systems that power Kuberntees nodes provides insight into the health of each node.
Each Kubernetes Node has a finite capacity of memory and CPU and that can be utilized by the running pods, so these two metrics need to be monitored carefully. Other common node metrics to monitor include CPU load, memory consumption, filesystem activity and usage and network activity.
One approach to monitoring all cluster nodes is to create a special kind of Kubernetes pod called DaemonSets. Kubernetes ensures that every node created has a copy of the DaemonSet pod, which virtually enables one deployment to watch each machine in the cluster. As nodes are destroyed, the pod is also terminated.
To ensure the Control Plane is communicating efficiently with each individual node that a Kubelet runs on, it is recommended to monitor the Kubelet agent regularly. Beyond the common GoLang common metrics described above, Kubelet exposes some internals about its actions that are useful to track as well.
To ensure that workloads are orchestrated effectively, monitor the requests that the Controller is making to external APIs. This is critical in cloud-based Kubernetes deployments.
To identify and prevent delays, monitor latency in the scheduler. This will ensure Kubernetes is deploying pods smoothly and on time.
The main responsibility of the scheduler is to choose which nodes to start newly launched pods on, based on resource requests and other conditions.
The scheduler logs are not very helpful on their own. Most of the scheduling decisions are available as Kubernetes events, which can be logged easily in a vendor-independent way, thus are the recommended source for troubleshooting. The scheduler logs might be needed in the rare case when the scheduler is not functioning, but a kubectl logs call is usually sufficient.
etcd stores all the configuration data for Kubernetes. etcd metrics will provide essential visibility into the condition of the cluster.
Looking specifically into individual containers will allow monitoring of exact resource consumption rather than more general Kubernetes metrics. CAdvisor analyzes resource usage happening inside containers.
The Kubernetes API server is the interface to all the capabilities that Kubernetes provides. The API server controls all the operations that Kubernetes can perform. Monitoring this critical component is vital to ensure a smooth running cluster.
The API server metrics are grouped into a major categories:
kube-state-metrics is a service that makes cluster state information easily consumable. Where the Metrics Server exposes metrics on resource usage by pods and nodes, kube-state-metrics listens to the Control Plane API server for data on the overall status of Kubernetes objects (nodes, pods, Deployments, etc) as well as the resource limits and allocations for those objects. It then generates metrics from that data that are available through the Metrics API.
kube-state-metrics is an optional add-on. It is very easy to use and exports the metrics through an HTTP endpoint in a plain text format. They were designed to be easily consumed / scraped by open source tools like Prometheus.
In Kubernetes, the user can fetch system-level metrics from various out of the box tools like cAdvisor, Metrics Server, and Kubernetes API Server. It is also possible to fetch application level metrics from integrations like kube-state-metrics and Prometheus Node Exporter.
Prometheus scrapes metrics from instrumented jobs, either directly or via an intermediary push gateway for short-lived jobs. It locally stores all scraped samples and runs rules over this data to either aggregate and record new time series from existing data or generate alerts. Grafana or other API tools can be used to visualize the collected data.
One of the most popular Kubernetes monitoring solutions is the open-source Prometheus, Grafana and Alertmanager stack, deployed alongside kube-state-metrics and node_exporter to expose cluster-level Kubernetes object metrics as well as machine-level metrics like CPU and memory usage.
Prometheus is a pull-based tool used specifically for containerized environments like Kubernetes. It is primarily focused on the metrics space and is more suited for operational monitoring. Exposing and scraping prometheus metrics is straightforward, and they are human readable, in a self-explanatory format. The metrics are published using a standard HTTP transport and can be checked using a web browser.
Apart from application metrics, Prometheus can collect metrics related to:
Prometheus can configure rules to trigger alerts using PromQL, Alertmanager will be in charge of managing alert notification, grouping, inhibition, etc.
PromQL (Prometheus Query Language) lets the user choose time-series data to aggregate and then view the results as tabular data or graphs in the Prometheus expression browser. Results can also be consumed by the external system via an API.
How does Alertmanager fit in? The Alertmanager component configures the receivers, gateways to deliver alert notifications. It handles alerts sent by client applications such as the Prometheus server and takes care of deduplicating, grouping, and routing them to the correct receiver integration such as email, PagerDuty or OpsGenie. It also takes care of silencing and inhibition of alerts.
Grafana can pull metrics from any number of Prometheus servers and display panels and dashboards. It also has the added ability to register multiple different backends as a datasource and render them all out on the same dashboard. This makes Grafana an outstanding choice for monitoring dashboards.
Logs are useful to examine when a problem is revealed by metrics. They give exact and invaluable information which provides more details than metrics. There are many options for logging in most of Kubernetes’ components. Applications also generate log data.
Digging deeper into the cluster requires logging into the relevant machines.
The locations of the relevant log files are:
/var/log/kube-apiserver.log – API Server, responsible for serving the API
/var/log/kube-scheduler.log – Scheduler, responsible for making scheduling decisions
/var/log/kube-controller-manager.log – Controller that manages replication controllers
/var/log/kubelet.log – Kubelet, responsible for running containers on the node
/var/log/kube-proxy.log – Kube Proxy, responsible for service load balancing
etcd uses the Github capnslog library for logging application output categorized into levels.
A log message’s level is determined according to these conventions:
When it comes to troubleshooting the Kubernetes cluster and the applications running on it, understanding and using logs are crucial. Like most systems, Kubernetes maintains thorough logs of activities happening in the cluster and applications, which highlight the root causes of any failures.
Logs in Kubernetes can give an insight into resources such as nodes, pods, containers, deployments and replica sets. This insight allows the observation of the interactions between those resources and see the effects that one action has on another. Generally, logs in the Kubernetes ecosystem can be divided into the cluster level (logs outputted by components such as the kubelet, the API server, the scheduler) and the application level (logs generated by pods and containers).
Use the following syntax to run kubectl commands from your terminal window:
kubectl [command] [TYPE] [NAME] [flags]
kubectl get pod pod1 # Lists resources of the pod ‘pod1’
kubectl logs pod1 # Returns snapshot logs from the pod ‘pod1’
Since Kubernetes Events capture all the events and resource state changes happening in your cluster, they allow past activities to be analyzed in your cluster. They are objects that display what is happening inside a cluster, such as the decisions made by the scheduler or why some pods were evicted from the node. They are the first thing to inspect for application and infrastructure operations when something is not working as expected.
Unfortunately, Kubernetes events are limited in the following ways:
To address these issues, open source tools like Kubewatch, Eventrouter and Event-exporter have been developed.
Kubernetes monitoring is performed to maintain the health and availability of containerized applications built on Kubernetes. When you are creating the monitoring strategy for Kubernetes-based systems, it’s important to keep in mind the top metrics to monitor along with the various monitoring tools discussed in this article.