How continuous profiling cut our cloud spend
At Coralogix, we’re constantly looking to evolve the measurements we take to better understand the efficiency of our infrastructure. We constantly assess and investigate sources of cost in our cloud infrastructure, to ensure we’re getting the best return on investment. This activity, often referred to as FinOps, is becoming a cornerstone of engineering teams.
Our FinOps teams are pushing new boundaries, and using Continuous Profiling, a tool typically reserved for debugging and issue diagnosis, to build an extremely detailed view of where our costs are truly coming from.
Why is FinOps so important?
The cloud has been around for a long time, and yet we continue to rapidly increase our spend. In 2024 alone, cloud spend increased by about 20%. Gartner reports that of 200 IT leaders surveyed, 69% exceeded their allocated budget for cloud spend. Not only is the cloud expensive, it is unpredictable and difficult to track. 78% of companies reported between 21% and 50% of their cloud spending was wasted, with G2 highlighting that 28% of annual cloud budgets are typically wasted.
This paints a grim picture, of complex and unpredictable costs that regularly exceed our budgeting and, upon retrospective analysis, are almost 1/3rd higher than they need to be.
So what can FinOps teams do, and how is Coralogix tackling this problem?
In the same Gartner survey, those organizations who managed to remain in budget listed effective resource optimization as a chief reason for ensuring their cloud spend didn’t get out of control. Yet, FinOps teams often sit at the periphery of delivery and it makes it challenging for them to effect impactful change.
At Coralogix, we realised that we needed to expand our toolset, if we’re going to really own our spend. We need a new level of detail, and that’s when we turned to Continuous Profiling.
What is Continuous Profiling?
Continuous profiling, in simple terms, is a microscope for your applications. Rather than saying that a particular application is using a lot of resources (like CPU, memory, network bandwidth, or even battery on a mobile device), it goes several levels deeper, and indicates which particular line of code is resource hungry. It shows where the bottlenecks are.
This toolset is typically reserved for engineers who are attempting to debug an issue, or optimize their application, but when applied to the problem of cloud spend, the results were dramatic.
An impending event, and a nerve-wracking load test
Coralogix is built to handle enormous scale from customers producing 100s of terabytes of data every day, but occasionally, some of our larger customers have some big demands. In this case, one of our largest streaming customers had a large sporting event coming up, and were expecting in excess of 200 million concurrent users.
This translated into around 164TB of data being generated every 5 minutes, or roughly 550GBs every single second. Coralogix is built on the Streama© architecture, which processes telemetry in-stream with no storage dependency. This makes us uniquely set up to handle this kind of volume. We knew we were ready for the load test, but we wanted to push things to the next level, so we turned to continuous profiling.
Extra servers vs leaner software
Our load test revealed spikes in CPU utilization that kept occurring when the volume increased, and we realised that there was a serious bottleneck hidden somewhere in our applications. We had it narrowed down to a specific application, but beyond that, traditional telemetry wasn’t very helpful.
At this stage, we had a choice. With the event coming up, we could scale up to increase the total amount of cores available and thus, help us deliver a great quality of service for our customer. We knew that we had this option, but the additional cost ran in the thousands of dollars, and we wanted to see if there was a way to avoid this. So we turned to continuous profiling.
JSON Parsing was killing us
After we investigated the problem using our profiler, the issue became clear. Our service was leveraging io.circe as a JSON parser. io.circe offers a clean interface, but has some known performance issues. We extracted the data into our dashboard, and saw immediately what was going on. Below is an extract of that data (we can’t show the original for security reasons):

This particular library was driving some serious spikes in utilization. After surfacing this, we realised that the correct solution was to allow the continuous profiling data to help us to create lean software.
A small patch, and thousands of dollars saved
We swapped this out for another JSON parsing library (we don’t disclose internal libraries for security reasons!!) and we immediately saw that the utilization went down, by a lot. It resulted in around a 50% reduction in cloud spend and overall CPU usage for this service, which, in a telemetry pipeline, is a serious reduction.
Continuous profiling is for everyone, from FinOps to DevOps
While profiling tools are typically leveraged as a mechanism for engineers to find and tune problems in their application, after this event, we have learned that it has applications everywhere. Coralogix is determined to help businesses make better decisions, and to do that, they need the best tools available.
That is why Coralogix has released Continuous Profiling, to help engineers bridge the gap between the metrics that their servers are reporting, and the reality of their application consumption. Profiling saved us thousands of dollars in one single investigation, and we’re excited for you to find out what it can do for you.