[Live Webinar] Next-Level O11y: Why Every DevOps Team Needs a RUM Strategy Register today!

Case study

How Nutanix used alerting to drive scalability

Nutanix.com

About Nutanix

Nutanix, Inc. is a global leader in cloud software, offering organizations a single platform for running apps and data across clouds. With Nutanix® solutions, customers can reduce complexity and simplify operations, freeing them to focus on their business outcomes.

Building on its legacy as the pioneer of hyperconverged infrastructure, Nutanix is trusted by companies worldwide to power hybrid multi-cloud environments consistently, simply, and cost-effectively.

Challenges in Managing Configurations 

Nutanix must manage configurations for multiple internal service instances. Thus there is a need for automation more than ever because efficiency is key to successful service delivery. 

That being said, as the company expands and its internal service instances multiply, any service instances that rely on manual configuration management are scrutinised for improvement via automation. 

To get a better understanding, an internal service instance contains settings called gflags that are advanced configuration parameters required to adjust Nutanix system configurations. New service instances must have the same gflag configuration as the existing ones on the same service application cluster.

To do this manually, multiple systems needed to be synchronized to set up new service instances, and a site reliability engineer (SRE) is involved wherever configurations do not use default values. Furthermore, if default gflag values changed, the SRE is required to create a custom pipeline to update the internal service type instances. Execution also needed to be monitored manually.

Updating existing gflag default settings manually when adding new service instances is not scalable as the number of instances increases. Nutanix was able to find a unique automation solution for this problem using the Coralogix full-stack observability platform.

Traditionally, monitoring service configurations require analyzing logs and keywords to trigger alerts. However, Nutanix had encountered a different challenge —beyond validating existing configurations, they also needed to identify missing gflag configurations (if any) that needed to be present in the flowing logs. This prompted a novel approach involving automation and a non-conventional trigger alerting method.

Driving Motivation for Coralogix Alerting Solutions

Coralogix provides alerts for observability that are highly configurable. These alerts are powered by Streama, our in-stream data analysis, which means the alerts can occur in real-time without delays for index latency or mapping dependencies.

The machine-learning methods used to build these alarms have proven to reduce false positives, preventing alert fatigue, and unnecessarily triggering automated functions. The need for an efficient and automated solution became apparent with the business scaling and an ever-increasing number of internal service instances.

Log analytics were already available through Coralogix, but the issue of the missing gflag configurations was a new use case. Nutanix devised an innovative approach to address this challenge by converting configurations into logs and harnessing Coralogix’s advanced alerting capabilities.

“Coralogix innovative ratio-based alerts and custom webhooks allowed us to automate gflag settings effortlessly. This streamlined our operations, ensuring substantial resource savings. Coralogix’s partnership proved pivotal for observability and is quite handy tool for any developer.”

-Shrehal Bohra, Software Developer, Nutanix, Inc.

 

Implementation and Solution

Coralogix’s alerting solution comes with several alert configurations that can be used out of the box. These include new value alerts, time-relative alerts, ratio-based alerts and unique value alerts. To meet Nutanix’s insistence for scalability and excellent performance,  ratio-based alerts were applied. Once an alert is triggered, a custom webhook is used so the alert can take automatic action on Nutanix’s environment to fix gflag configurations.

Ratio-based alerts

Ratio-based alerts calculate a ratio between two log queries and alert on the result if the ratio meets threshold requirements. A common use case for ratio alerts is to get an error rate by comparing the number of requests to the number of errors, producing an error rate.

The two log queries can be configured to any useful log query, resulting in Query 1’s results divided by Query 2’s rate. The rate of querying and the alert threshold are both configurable.

Developers can configure what error rate over what duration should trigger an alert. The configuration also includes an option so alerts may or may not trigger on infinity. This would occur when the result of Query 2 could be zero.

The ability to trigger on infinite ratios seems simple but provides Coralogix users with many alert options otherwise unavailable. By triggering alerts using infinite ratios, users can create innovative solutions by creating alerts on logs that are simply absent. 

Custom alert webhooks

Coralogix supports their alert platform with any integration. They support predefined and custom integrations, so users like Nutanix can trigger custom functions when an alert is triggered. 

Coralogix allows its users to create an unlimited number of webhooks. These can each be assigned to alerts so different events can trigger different actions against your platform. 

Nutanix’s solution using Coralogix’s ratio-based alerts

Nutanix embraced Coralogix’s ratio-based alerts. In this solution, Query 1 encompasses the entire log space relevant for monitoring, while Query 2 serves as an additional filter specific to gflag configurations.

When an internal service instance lacked a particular gflag, Query 2 yielded a value of 0, theoretically resulting in an infinity ratio. Utilizing one of the many features of Coralogix alerting has allowed Nutanix to receive an alert on the infinity quotient. This allowed the creation of alerts to detect missing configurations.

The next step was to pass on the triggered alert’s payload to Nutanix’s internal automated pipeline. Coralogix’s custom webhook integration played a pivotal role here.  The custom webhook feature seamlessly transfers the alert payload information to Nutanix’s internal automation pipeline.

Another key advantage of Coralogix’s alerts was the ability to group the payload based on exact configuration metadata fields. This streamlined the analysis process, providing Nutanix with comprehensive insights to address missing configurations.

Nutanix Updates Coralogix Configurations with GitOps 

When default gflag configurations needed to be updated, a custom pipeline must be triggered to update and monitor existing internal service instances. Nutanix used GitOps to update the Coralogix terraform provider, which will update Coralogix configurations, driving their custom ratio alert setup. The existing ratio alert will determine which instances require updating with the new default gflag configurations and trigger the custom webhook to make the required changes.

Improvements and Added Value

This case study showcases Nutanix’s internal drive for excellence, and its continuing journey toward increased efficiency and reliability. By harnessing the combined power of Coralogix’s ratio-based alerts, payload grouping, and custom webhooks, Nutanix devised an innovative solution to automate their gflag configuration management.

This resulted in proactively identifying missing configurations and ensuring data accuracy across its multitude of internal service instances. It also allowed Nutanix to lower the engineering resources required to configure and create new service instances.

Where Modern Observability
and Financial Savvy Meet.

Live Webinar
Next-Level O11y: Why Every DevOps Team Needs a RUM Strategy
April 30th at 12pm ET | 6pm CET
Save my Seat