[Live Webinar] Next-Level O11y: Why Every DevOps Team Needs a RUM Strategy Register today!

cure.fit cover image

Case Study

How cult.fit Optimized Monitoring Costs at Scale with Better Visibility

200+

Engineering Users

1TB+

Avg. Daily Data Volume

90%

Log Data from K8s

$50K+

Saved Per Year

cult.fit/

About the company

cult.fit is a health and wellness company offering a one-stop shop for customers to choose from digital and offline experiences across fitness, nutrition, and mental well-being. The company aims to make fitness and health easy and accessible for millions of users.

cult.fit was founded in 2016 and saw rapid growth as it used technology to disrupt the health sector. Their digital business now serves millions of users and generates more than a terabyte of log data every single day.

To manage and monitor that data, cult.fit’s engineers use Coralogix. The Platform team is able to optimize their Total Cost of Ownership (TCO) and gain better visibility and insights into the logging patterns of their applications. Developers also have better visibility and are able to investigate issues more quickly with the Live Tail, log templating, and better query performance than their previous solution.

Monitoring Challenges at Scale

cult.fit started out using Amazon Cloudwatch for all of their logging, and in the beginning, it did a good job of covering their use cases. Developers were very comfortable using it and were able to easily query their logs for investigations. As the team and company grew though, and as log volumes increased, the overhead, performance, and cost issues became unavoidable.

About 90 percent of cult.fit’s log data is collected from Kubernetes and was then sent straight into Amazon Cloudwatch. Not only did this require manual configurations of log groups and policies for every new application being onboarded, when multiple pods produced high volumes of log data the query performance in Amazon Cloudwatch became very slow.

With log volumes constantly growing and Amazon Cloudwatch charging a flat rate for ingestion regardless of their value, the cost was adding up faster than their return on investment. So the team began to look for alternative logging frameworks that would be more flexible than Amazon Cloudwatch, but still offer the same level of insights and enterprise-level features.

Driving Motivation to Adopt Coralogix

When the team at cult.fit was introduced to Coralogix by our partners at Onnivation, there were 3 key capabilities that answered the administrative challenges of the previous solution without reducing coverage or comfort for developers. Ultimately, the team saw the opportunity to extract more value than they were with Amazon Cloudwatch at a better price point.

The first motivation was the Live Tail feature which gives real-time visibility into the logs without needing to grant permissions to production environments. Both the developers and the system admins benefit from the flexibility and power of immediate access to the logs. Amazon Cloudwatch gave the team the power to search their logs, but investigations were all done by accessing the production logs directly. Now, the Platform Team is able to restrict access to production without developers losing crucial visibility when troubleshooting.

The second was actually a set of capabilities, namely anomaly detection and log templating, that would help to reduce the time it took to identify and resolve issues. Coralogix has automated anomaly detection for error volume spikes and flow anomalies that help to proactively identify issues that would otherwise go undetected.

cult.fit’s developers are then able to investigate much more easily using the log templates created by Coralogix’s Loggregation features. One of the problems they faced in the past was analyzing the sheer volume of log data they were collecting. Investigations involved manually sifting through hundreds of log lines trying to figure out which log, or which bug, is important. Loggregation gives them the ability to quickly see which log is creating the bulk of the logs and dive into the variable distribution.

Lastly, but perhaps most importantly, was the ability to optimize costs and gain even greater flexibility and control over their data. With Coralogix, the cult.fit’s platform team can split their logs into different pipelines. This way they can ensure that the logs needed are easily searchable, and the bulk of their log data which is low value can be converted to metrics and archived.

Migration from Amazon Cloudwatch to Coralogix

One of the advantages to using Coralogix is that the provisioning is automatic. Whenever the team has a new system, they don’t need to do any manual on-boarding. As soon as they push it live, policies in Coralogix automatically set the correct pipeline for the data. Adding new log sources is very straightforward.

With the ease of sending data to Coralogix, the team started seeing value on the monitoring side almost immediately. About one month after beginning the implementation process, cult.fit was able to stop sending data to Amazon Cloudwatch. Within 3 months, cult.fit saw value beyond what they were able to achieve with Amazon Cloudwatch and at a lower cost.

Coralogix is now being used for all logging use cases and has been adopted across the organization by approximately 200 engineers.

Improving Infrastructure Visibility

cult.fit’s platform team uses Coralogix’s Logs2Metrics feature to generate trackable metrics from raw log data that they send to archive. They can then easily monitor and alert on application behavior such as volume spikes and more. For example, the team created an alert which triggers if the number of log lines in 10 minutes exceeds a particular value.

“As an administrator, the visibility into how much each application is logging and what are the logging patterns is a game changer – we had no visibility into that before.”
Vikramaditya M – Architect, cult.fit

This particular alert has helped the Platform team identify applications that experience a log spike due to being stuck in a crash loop. In this case, the only clear symptom of the issue was the sudden increase in logs. Other times, the alert has helped to catch instances where a developer accidentally wrote a debug log to a workflow that is published millions of times.

In one example, an application that typically writes about a 1,000 log lines every 10 minutes suddenly started logging almost a 1,000,000 lines over the same period. The alert was triggered and immediately sent to a member of the team who opened the Live Tail to see what happened.

The Live Tail was full of Java stack traces, and the engineer immediately understood that there was an issue with loading the cache from one of the databases. Ultimately, this was caused by a release of new functionality that contained a typo error in the database. From the time they detected the issue to pushing the fix, it took about 45 minutes.

Without Coralogix, these types of issues could be running for several days before the team would see that events were not being processed or they would get a cost alert due to the massive increase of log volume.

Vikramaditya M cult.fit

Vikramaditya M
Architect

As an administrator, the visibility into how much each application is logging and what are the logging patterns is a game changer – we had no visibility into that before.

Optimizing Costs with Advanced TCO Usage

The Platform team at cult.fit is responsible for managing the monitoring infrastructure for many applications built by different teams of developers with varying logging practices. In order to establish cost optimization with Coralogix’s TCO Optimizer feature, policies were created for each application.

Core applications such as the log management system, payment system and others for which quick searches of the logs are critical are typically kept in the Frequent Search pipeline. This way whenever there is an issue where payments are affected or people are unable to place orders, the developers can immediately investigate and resolve it. For this use case, Coralogix is much stronger than the previous solution with real-time visibility into the logs as well as much better search performance.

Log data from a majority of the applications are sent to the Monitoring pipeline where the team can use Logs2Metrics to generate metrics that they can monitor and alert on, without needing to index the raw logs. The Platform team uses this to monitor log volumes from each application, and each team is also able to easily set up metrics and alerts for themselves.

Finally, there are many applications that are now legacy. They don’t have active deployment and are relatively stable, but every once in a while they do need to be able to look into the logs. These logs are sent to the Compliance pipeline for archive storage and then queried using the archive query function when needed.

Whenever there is an ongoing outage or any investigation, teams can easily select the application and subsystem and change the TCO policy from Monitoring or Compliance to Frequent Search until they finish the investigation. Essentially, they create an override using the TCO Optimizer API and then delete it afterwards.

Summary

In five years, cult.fit’s business grew to serve millions of customers. Customer experience and overall stability and reliability of their applications is crucial to the success of their business. With more than a terabyte of log data being generated each day, the cost and performance issues of other solutions impacted the company’s bottom line.

With Coralogix, the team at cult.fit has reduced the cost of their log monitoring while gaining better visibility into their logs, specifically from an infrastructure point of view, and improving developer productivity.

Where Modern Observability
and Financial Savvy Meet.

Live Webinar
Next-Level O11y: Why Every DevOps Team Needs a RUM Strategy
April 30th at 12pm ET | 6pm CET
Save my Seat