What to Watch on EKS – a Guide to Kubernetes Monitoring on AWS

Peter Hammond
April 5, 2022

It’s impossible to ignore AWS service monitoring as a major player in the public cloud space. With $13.5 billion in revenue in the first quarter of 2021 alone, Amazon’s biggest earner is ubiquitous in the technology world. Its success can be attributed to the wide variety of services available, which are rapidly developed to match industry trends and requirements.

One service that keeps AWS ahead of the game is EKS or AWS’s monitoring tool Elastic Kubernetes Service. With customers from Snap Inc to HSBC, EKS is the most mature of the public cloud providers’ managed Kubernetes services. However, as we’ve said before, Kubernetes monitoring can be tricky. Add an extra layer of managed services to that, as with EKS, and it becomes more complex. Fortunately, at Coralogix, we’re the experts on observability and monitoring. Read on for our take on EKS and what you should be monitoring.

What is EKS? An overview

Launched in June 2018, AWS took the open-source Kubernetes project and promised to handle the control plane, leaving the nodes in the hands of the customers. Both Google’s GKE and Azure’s AKS do the same.

Of the three Kubernetes services, EKS has the fewest out-of-the-box features but favors the many organizations with pre-existing AWS infrastructure. For this reason, no doubt, EKS remains the most popular Kubernetes service.

EKS Architecture

As with any Kubernetes deployment, there are two main components in EKS – the control plane and the nodes/clusters. As mentioned, AWS runs the control plane for you, allowing your DevOps teams to focus on the nodes and clusters.

Whilst you can have the typical container compute and storage layer, or data fabric, EKS can also run on AWS Fargate. Fargate is basically AWS Lambda but for containers.

So then, EKS monitoring needs to focus on three main things. The Kubernetes objects, such as the control plane and nodes, the usage of these objects, and the underlying services that support or integrate with EKS.

Monitoring EKS Objects

So, we’ve covered what makes up a standard EKS deployment. Now, we’re going to examine some key metrics that need to be monitored within EKS. This focuses on health and availability, not usage (as we’ll get onto that later).

Cluster Status

There are a variety of cluster status metrics available on EKS which come from the Kubernetes API Server. We’ll examine the most important ones below.

Node Status

Monitoring the node status is one of the most important aspects of EKS monitoring. It returns a detailed health check on the availability and viability of your nodes. Node status will let you know if your node is ready to accept pods. It will also let you know if the node has enough disk space or is resource-constrained, which is useful for understanding whether there are too many processes running on a given node.

These metrics can be requested ad hoc, or by the configuration of heartbeats. Heartbeats are, as standard, configured to 40-second intervals. However, this may be too long for mission-critical clusters and can be altered for sub-second data insights.

Pod Status

In Kubernetes, you can describe your deployment declaratively. This can include the number of pods you wish to be running at any given point, which is likely to be contingent on both resource availability and workload requirements.

By inspecting and monitoring the delta between desired pods running, and actual pods running, you can diagnose a range of issues. If your nodes aren’t able to launch the number of pods that you’re looking for, it may point to underlying resource problems.

Control Plane Status

If AWS is running and managing the control plane, then why would you waste your time monitoring it? Simply put, the performance of the underlying control plane is going to be critical to the success and health of your Kubernetes deployment, whether you’re responsible for it or not.

Performance and Latency

Monitoring metrics like ‘apiserver_request_latencies_sum’ will give you the overall processing time on the Kubernetes API server. This is useful for understanding the performance of the control plane, and what the connectivity is like with your cluster.

You can also examine the latency from the controller to understand network constraints that might be affecting the relationship between the EKS cluster and the control plane.

Monitoring Resources for EKS

When talking about resources relevant to EKS, we are most interested in storage and compute or CPU availability. Specifically, this is likely to be EBS volumes and EC2 instances.

Memory and Disk

As one of the key resources for EKS, you must monitor several key memory-related metrics produced by Kubernetes.

Utilization

Memory utilization, or over-utilization, can be a big performance bottleneck for EKS. If a pod exceeds its predefined memory usage limit, then it will be killed by the node. Whilst this is good for resource management and workload balancing, if this happens frequently it could be harmful to your cluster overall.

Requests

When a new container is deployed, it will request a default allocation of memory if this is not already defined. The memory requests per node must not exceed the allocated memory per node, or your nodes will start killing off individual pods.

Disk

If a node is running with low available disk space, then the node will not be able to create new pods. Metrics such as ‘nodefs.available’ will give you the amount of available disk for a given node, to ensure you have adequate resources and aren’t preventing new pods from being deployed.

CPU

The other important resource aspect for EKS is the CPU. This is typically supplied by EC2 worker nodes.

Utilization

Similarly to the memory above, EKS requires the CPU allocated per core to always be more than the CPU in use. By monitoring CPU usage between nodes, you can look for performance bottlenecks, underresourced nodes, and process efficiencies.

Monitoring Services for EKS

The last piece of the EKS monitoring puzzle is the various AWS services used to underpin any EKS deployment. We’ve touched on some of them above, but we’ll go on to discuss why it’s important to look at them and what you should be looking out for.

EC2

In an EKS deployment, your worker nodes are EC2 instances. Monitoring their capacity and performance is critical because it underpins the health and capacity of your Kubernetes clusters. We’ve already touched on CPU monitoring above, but so far have focused on the Kubernetes API server-side. For best practice, monitor both the EC2 instances and the CPU performance on the API server.

EBS

If EC2 instances are the CPU, then EBS volumes are the memory aspect. EBS volumes provide the persistent volume storage required for EKS deployments. Monitoring throughput and IOPs of EBS volumes are critical in understanding if your storage layer is causing bottlenecks. Because AWS throttle EBS volume IOPs, poor performance, or high latency could indicate that you have not provisioned adequate IOPs per volume for your workload.

Fargate

As mentioned above, Fargate is the serverless service for containers in AWS. Fargate has a completely separate set of metrics for both the Kubernetes Server API and other AWS services. However, there are some parities to look out for, such as memory utilization, allocated CPU, and others. Fargate runs on a task basis, so monitoring it directly will give you an idea of the success of your tasks and therefore an understanding of the health of your EKS deployment.

Load Balancers

AWS has the option of either an Elastic Load Balancer or Network Load Balancer, with the former being the default option. Load balancers form the interface between your containers and any web or web application layer. Monitoring load balancers for latency is a great way of rooting out network and connectivity issues and catching problems before they reach your user base.

Observability for EKS

As we know, monitoring and observability are not the same things. In this article, we have discussed what you need to be monitoring, but not necessarily how that boosts observability. AWS does provide Container Insights as part of Cloudwatch, but it hasn’t been well received in the market.

That’s where Coralogix comes in.

The power of observability is reviewing metrics from across your system with context from disparate data sources. The Coralogix integration with Fluentd gives you insights straight from the Kubernetes API server. What’s more, you can cross-compare data with the AWS Status Log in real-time, ensuring that you know of any issues arising with the organizations you’re entrusting with the hosting of your control plane. You can integrate Coralogix with your load balancer and even with Cloudwatch, and then view all of this data in a dashboard of your choice.

So, EKS is complicated. But monitoring it doesn’t need to be. Coralogix has helped enterprises and startups across the world with their Kubernetes observability challenges, so we have seen it all.

What to Watch on EKS – a Guide to Kubernetes Monitoring on AWS

What is EKS? An overview

EKS Architecture

Monitoring EKS Objects

Cluster Status

Node Status

Pod Status

Control Plane Status

Performance and Latency

Monitoring Resources for EKS

Memory and Disk

Utilization

Requests

Disk

CPU

Utilization

Monitoring Services for EKS

EC2

EBS

Fargate

Load Balancers

Observability for EKS

Related Articles

Choosing the Best AWS Serverless Computing Solution

Using AWS Timestream for System Health Monitoring

What are AWS Log Insights and How You Can Use Them

Where Modern Observability and Financial Savvy Meet.

Where Modern Observability
and Financial Savvy Meet.