Why do organizations invest in observability? Because it adds value. Sometimes we forget this when we’re building our observability solutions. We get so excited about what…
Compute functions that run on Amazon’s Elastic Container Service (ECS) require regular monitoring to ensure proper running and managing of containerized functions on AWS – in short, ECS monitoring is a must. ECS can manage containers with either EC2 or Fargate compute functions. While EC2 and Fargate are compute services, EC2 allows users to configure virtually every functional aspect. Fargate is more limited in its available settings but is simpler to set up. Before setting up a solution with ECS, determine if this service best meets your needs.
When setting up an ECS service, you need to configure any monitoring and observability tools. AWS provides tools to collect and process ECS data to help monitor your solution. Developers can use Coralogix’s Log Analytics and AWS Observability platforms to provide more detailed metrics.
AWS provides metrics useful for ECS monitoring. Let’s go over several recommended metrics and evaluate how they will help ensure the health of your ECS infrastructure.
Task count measures how many tasks are active on the cluster. This metric helps analyze how busy your cluster is. Users can define the desired number of tasks to keep active when defining a service. ECS will automatically ensure that the desired number of tasks of a given container will run. The desired count is also available in the service metrics from ECS. Users can compare the desired count with the running task count to ensure their services are executing as expected.
Along with the number of running tasks, AWS tracks the number of pending tasks. Pending tasks are in a transition state, and ECS is waiting on the container to activate the task. Tasks can get stuck in a pending state, causing customer service outages. An unresponsive docker daemon or the ECS container losing connectivity with the ECS service may cause task outages during task launch.
AWS uses services to launch tasks automatically. Tasks stuck in a pending state or having a lower-than-required running task count would also benefit from automatic alerts. Alerts would allow development teams to intervene and limit outages on your platform in cases where automatic launches fail.
The total CPU reservation measures the total CPU units reserved by all the tasks running on a single ECS cluster. You can have multiple clusters in your AWS account, and each would specify its CPU reservation value.
Active tasks (Fargate) or container instances (EC2) on a cluster will use reserved CPU units. Each active instance will register a certain number of units based on its task definition. Instances that are ‘Active’ or ‘Draining’ will affect the results.
CPU utilization measures how many CPU units are used by the EC2 tasks running on your cluster. It is helpful to see how many spare CPU units are available to launch new EC2 instances at any given time by comparing CPU reservation for utilization. If you attempt to launch an instance without enough reserved units, a failure may occur.
Users can also enhance metrics on EC2 instances, so CPU usage is logged per instance instead of only per cluster. AWS provides CPU capacity, reserved capacity, and usage per instance. Users must set up these metrics separately from the other cluster-based metrics.
To troubleshoot a CPU utilization close to 100%, AWS recommends rebooting the instance. If the CPU requirements are higher than you have reserved for the cluster, restarting will not fix the problem. In this case, you may need to revise your CPU requirements or convert to a new instance type with better performance.
Memory reservation measures how much memory is allocated for use in the cluster by all tasks. The reservation value is used to calculate the percentage of memory available. Cluster memory reservation is the percent ratio of the total mebibytes (MiB) of memory reserved by tasks to the total MiB of memory registered by container instances in the same cluster.
The memory utilization value shows the percentage of memory used in the cluster. It is the percent ratio of the total MiB of memory used by tasks divided by the total MiB of memory registered by container instances in the same cluster.
Cluster reservation metrics are available only for clusters with tasks using EC2 launch types. Fargate launch types don’t reserve CPU or memory the same way, so they are measured differently.
Before running a task, users define a container that will run the task. ECS uses the task definition to reserve the appropriate amount of CPU, GPU, and memory. Even if a task has been defined, its reserved values are not part of this calculation if the task is not running.
The cluster CPU reservation is the percent ratio of total CPU units reserved to total CPU units registered by container instances in the cluster. In other words, it is the total CPU needed over the total CPU units allocated for all tasks running on the cluster. Users cannot use more CPU to run tasks than the cluster has available. Cluster CPU utilization instead compares the percent ratio of total CPU units used by tasks in the cluster to the total CPU units registered by container instances. By comparing the CPU reservation to the CPU utilization, developers can understand how much capacity is remaining on their cluster and when they need to increase the sizes of their cluster or container definitions.
Similarly, the cluster memory reservation is the percent ratio of the total MiB of memory reserved by tasks to the total MiB of memory registered by container instances in the cluster. It is the memory reserved by running tasks divided by the total memory allocated to the cluster. Developers can compare a cluster’s memory reservation to its utilization to determine if the size of the cluster should be increased. Further, graphical analysis of the memory utilization may show memory leaks in code.
Service utilization uses similar calculations to a tasks’ memory and CPU utilization calculations discussed above. The service CPU utilization is the percent ratio of the total CPU units used by tasks in the service to the total CPU units specified in the task definition for the number of tasks in the service. The service utilization calculations depend on the number of tasks in the service and not on the cluster itself. The memory calculation is the same but replacing CPU units with the MiB of memory used by tasks.
Service utilization is allowed to go over 100% when CPU or memory capacity is defined at the container level. When these units are defined at the task level, going over-limit is not allowed, and the task will fail.
With AWS Lambda, several metrics are automatically created for each function. For example, when an error is logged, it is added to a metric automatically. This metric, in turn, can be used to set the alarm and notify the development team of issues in the running function. Such metrics are not automatically created with ECS monitoring for tasks.
Error log metrics are beneficial for tracking the health of compute functions. Other metrics specific to your platform’s use case may also be necessary. To meet any metric needs inside your function, you can create custom metrics and alarms in AWS CloudWatch or send logs to third-party systems. The benefit of the third-party systems is sending logs and using their existing analytics to detect issues without predefining everything you may need to track. Coralogix’s log analytics platform detects anomalies in your logs and alerts you based on its findings.
The metrics listed above are critical to monitor and maintain the health of your ECS infrastructure and ensure users do not discover outages. Developers need to be alerted when problems arise to limit issues successfully.
Alerting can be done using different methods, either in AWS or with third-party services. AWS Cloudwatch provides alarms that can trigger alerts on known issues. Metric data is sent to CloudWatch, and if it meets the criteria set in the alarm, a notification is sent to the predefined location. Third-party systems like Coralogix’s AWS Observalbility platform use machine learning to detect issues with little customization. Coralogix provides a method of ECS monitoring by configuring an endpoint in the ECS task definition.
AWS provides standard metrics for monitoring ECS deployments. These metrics differ when using EC2 versus Fargate launch types. Generally, teams will need to watch for CPU usage, memory usage, and the number of running tasks in a cluster. Application-level metrics such as tracking error logs in a task need to be set up manually in CloudWatch when using AWS observability tools.
ECS monitoring data can also be sent to third-party tools for analysis and to gain observability into your platform. Coralogix’s AWS Observability platform can track ECS metrics and alert users when issues arise.