Here at Coralogix, one of the main goals we have is what we call “Make Big Data small” the idea is to allow our users to view a narrow set log patterns instead of just indexing terabytes of data and providing textual search capabilities.
Because as cloud computing and open source lowered the bar for companies to create distributed systems at high scale, the amount of data they have to deal with are overwhelming to them.
What most companies do is to collect all their log data and index it, but they have no idea of what to search, where to search and most of all, when to search.
To examine our assumption we did a little research on the data we have from 50 customers who are generating over 2TB of log data on a daily basis. The disturbing lessons we’ve learned are below.
One definition I have to make before we start is “Log Template”.
What we call a log template is basically similar to the printf you have in your code, which means it contains the log words split to constants and variables.
For instance if I have these 3 log entries:
- User Ariel logged in at 13:24:45 from IP 22.214.171.124
- User James logged in at 12:44:12 from IP 126.96.36.199
- User John logged in at 09:55:27 from IP 188.8.131.52
Then the template for these entries would be:
- User * logged in at * from *
And I can say that this template arrived 3 times.
Now that we are on the same page, here’s what we’ve learned from clustering daily Terabytes of log data for over 50 customers :
1) 70% of the log data is generated by just 5 different log templates. This demonstrates how our software usually has one main flow which is frequently used while other features and capabilities are rarely in use. So we kept going and found out that 95%(!) of your log data is generated by 10 templates, meaning you are basically analyzing the same log records over and over not even knowing 99% of your log data.
2) To support fact #1, we found out that over 90% of the queries ran by users are on the top 5 templates. These statistics show us how we are so blinded by these templates dominance we simply ignore other events.
3) 97% of your exceptions are generated by less than 3% of the exceptions you have in your code. You know these “known errors” that always arrive ? they are creating so much noise that we fail to see the real errors on our systems.
4) 0.7% of your templates are of level Error and Critical, and they generate 0.025% of your traffic. This demonstrates just how easy it is to miss these errors, not to mention that most of them are generated by the same exceptions.
5) Templates that arrive less than 10 times a day are almost never queried (1 query every 20 days in average by all 50 customers together!). This is an amazing detail that shows how companies keep missing those rare events and only encounter them once they become a widespread problem.
The facts above show how our current approach towards logging is very much affected by the log variance and not from our perspective. We react to our data instead of proactively analyzing it according to our needs because the masses of data are so overwhelming we can’t see past them.
By automatically clustering log data back to its original structure, we allow our users to view all of their log data in a fast and simple way and quickly identify suspicious events that they might ignore otherwise.