It’s impossible to ignore AWS as a major player in the public cloud space. With $13.5billion in revenue in the first quarter of 2021 alone, Amazon’s biggest earner is ubiquitous in the technology world. Its success can be attributed to the wide variety of services available, which are rapidly developed to match industry trends and requirements.
One service that keeps AWS ahead of the game is EKS or AWS’s Elastic Kubernetes Service. With customers from Snap Inc to HSBC, EKS is the most mature of the public cloud providers’ managed Kubernetes services. However, as we’ve said before, Kubernetes monitoring can be tricky. Add an extra layer of managed services to that, as with EKS, and it becomes more complex. Fortunately, at Coralogix, we’re the experts on observability and monitoring. Read on for our take on EKS and what you should be monitoring.
Launched in June 2018, AWS took the open-source Kubernetes project and promised to handle the control plane, leaving the nodes in the hands of the customers. Both Google’s GKE and Azure’s AKS do the same.
Of the three Kubernetes services, EKS has the fewest out-of-the-box features but favors the many organizations with pre-existing AWS infrastructure. For this reason, no doubt, EKS remains the most popular Kubernetes service.
As with any Kubernetes deployment, there are two main components in EKS – the control plane and the nodes/clusters. As mentioned, AWS runs the control plane for you, allowing your DevOps teams to focus on the nodes and clusters.
Whilst you can have the typical container compute and storage layer, or data fabric, EKS can also run on AWS Fargate. Fargate is basically AWS Lambda but for containers.
So then, EKS monitoring needs to focus on three main things. The Kubernetes objects, such as the control plane and nodes, the usage of these objects, and the underlying services that support or integrate with EKS.
So, we’ve covered what makes up a standard EKS deployment. Now, we’re going to examine some key metrics that need to be monitored within EKS. This focuses on health and availability, not usage (as we’ll get onto that later).
There are a variety of cluster status metrics available on EKS which come from the Kubernetes API Server. We’ll examine the most important ones below.
Monitoring the node status is one of the most important aspects of EKS monitoring. It returns a detailed health check on the availability and viability of your nodes. Node status will let you know if your node is ready to accept pods. It will also let you know if the node has enough disk space or is resource-constrained, which is useful for understanding whether there are too many processes running on a given node.
These metrics can be requested ad hoc, or by the configuration of heartbeats. Heartbeats are, as standard, configured to 40-second intervals. However, this may be too long for mission-critical clusters and can be altered for sub-second data insights.
In Kubernetes, you can describe your deployment declaratively. This can include the number of pods you wish to be running at any given point, which is likely to be contingent on both resource availability and workload requirements.
By inspecting and monitoring the delta between desired pods running, and actual pods running, you can diagnose a range of issues. If your nodes aren’t able to launch the number of pods that you’re looking for, it may point to underlying resource problems.
If AWS is running and managing the control plane, then why would you waste your time monitoring it? Simply put, the performance of the underlying control plane is going to be critical to the success and health of your Kubernetes deployment, whether you’re responsible for it or not.
Monitoring metrics like ‘apiserver_request_latencies_sum’ will give you the overall processing time on the Kubernetes API server. This is useful for understanding the performance of the control plane, and what the connectivity is like with your cluster.
You can also examine the latency from the controller to understand network constraints that might be affecting the relationship between the EKS cluster and the control plane.
As one of the key resources for EKS, you must monitor several key memory-related metrics produced by Kubernetes.
Memory utilization, or over-utilization, can be a big performance bottleneck for EKS. If a pod exceeds its predefined memory usage limit, then it will be killed by the node. Whilst this is good for resource management and workload balancing, if this happens frequently it could be harmful to your cluster overall.
When a new container is deployed, it will request a default allocation of memory if this is not already defined. The memory requests per node must not exceed the allocated memory per node, or your nodes will start killing off individual pods.
If a node is running with low available disk space, then the node will not be able to create new pods. Metrics such as ‘nodefs.available’ will give you the amount of available disk for a given node, to ensure you have adequate resources and aren’t preventing new pods from being deployed.
The other important resource aspect for EKS is the CPU. This is typically supplied by EC2 worker nodes.
Similarly to the memory above, EKS requires the CPU allocated per core to always be more than the CPU in use. By monitoring CPU usage between nodes, you can look for performance bottlenecks, underresourced nodes, and process efficiencies.
The last piece of the EKS monitoring puzzle is the various AWS services used to underpin any EKS deployment. We’ve touched on some of them above, but we’ll go on to discuss why it’s important to look at them and what you should be looking out for.
In an EKS deployment, your worker nodes are EC2 instances. Monitoring their capacity and performance is critical because it underpins the health and capacity of your Kubernetes clusters. We’ve already touched on CPU monitoring above, but so far have focused on the Kubernetes API server-side. For best practice, monitor both the EC2 instances and the CPU performance on the API server.
If EC2 instances are the CPU, then EBS volumes are the memory aspect. EBS volumes provide the persistent volume storage required for EKS deployments. Monitoring throughput and IOPs of EBS volumes are critical in understanding if your storage layer is causing bottlenecks. Because AWS throttle EBS volume IOPs, poor performance, or high latency could indicate that you have not provisioned adequate IOPs per volume for your workload.
As mentioned above, Fargate is the serverless service for containers in AWS. Fargate has a completely separate set of metrics for both the Kubernetes Server API and other AWS services. However, there are some parities to look out for, such as memory utilization, allocated CPU, and others. Fargate runs on a task basis, so monitoring it directly will give you an idea of the success of your tasks and therefore an understanding of the health of your EKS deployment.
AWS has the option of either an Elastic Load Balancer or Network Load Balancer, with the former being the default option. Load balancers form the interface between your containers and any web or web application layer. Monitoring load balancers for latency is a great way of rooting out network and connectivity issues and catching problems before they reach your user base.
As we know, monitoring and observability are not the same things. In this article, we have discussed what you need to be monitoring, but not necessarily how that boosts observability. AWS does provide Container Insights as part of Cloudwatch, but it hasn’t been well received in the market.
That’s where Coralogix comes in.
The power of observability is reviewing metrics from across your system with context from disparate data sources. The Coralogix integration with Fluentd gives you insights straight from the Kubernetes API server. What’s more, you can cross-compare data with the AWS Status Log in real-time, ensuring that you know of any issues arising with the organizations you’re entrusting with the hosting of your control plane. You can integrate Coralogix with your load balancer and even with Cloudwatch, and then view all of this data in a dashboard of your choice.
So, EKS is complicated. But monitoring it doesn’t need to be. Coralogix has helped enterprises and startups across the world with their Kubernetes observability challenges, so we have seen it all.