Avg daily data volume
Applications being monitored
Byju’s started in 2015, and is India’s largest ed-tech company. They provide curated learning programs for students in grades K-12 and competitive exams like JEE and IAS. It has 50 million students registered and 3.5 million paid subscriptions. In the last few years, Byju’s has made a series of ed-tech acquisitions such as Osmo (2019), White Hat Jr (2020), Great Learning (2021), Akash (2021), Gradeup (2021), and Toppr (2021).
Each of Byju’s acquisitions brought its own tech stack, and the central DevOps leadership needed to manage the diversity of tools at the cloud infrastructure and site reliability level. Observability is critical because it creates an important feedback loop that helps the team understand their systems better and plan what can be done to improve them.
We wanted to implement a specific stack with minimal effort from the development teams and required observability to be application agnostic and run on any kind of system.
Today, almost all the engineering teams in Byju’s portfolio use Coralogix as their unified observability platform for logs, metrics, and traces. The interesting thing about their system is the tech stack is not unified and different teams use different orchestration tools such as Amazon ECS, Amazon EC2, or Kubernetes. They have used Coralogix across this diverse set of systems and engineering teams to achieve observability, with all the monitoring data being piped to Coralogix.
The monitoring solution chosen depended on the underlying public cloud system (like AWS), and the subsidiaries ended up with diverse monitoring solutions to suit their purpose. To address this, Byju’s first did a cost-benefit analysis and realized that using too many observability tools had a financial downside, along with a lack of standardization. So, they drafted a set of technical requirements to address their needs and were introduced to Coralogix by Onnivation.
“At an organizational level you need to manage multiple vendors and deal with higher costs because you have a situation where you’re not utilizing any of these observability systems optimally.”- Hitesh Kumar, Principal Member Of Technical Staff – Byju’s
With every project using a different tool, the developers also go through a learning curve as they move from one system to the next. What Byju’s really liked about Coralogix was the Open Source (OS) support. They felt it allowed developers to engage more with the code base to understand what’s happening, make better decisions, and build custom solutions.
Real-time monitoring was a key requirement, which meant they had to have a robust full-stack observability stack that provided details of logs, metrics, and traces as soon as they were produced. It needed a strong correlation feature so they don’t miss out on the contextual data while resolving an issue. They wanted customized alerting where depending on the severity and type of alert, they could fan it out to the respective team. They had multiple receiver points such as Phone, SMS, Slack, or Jira. Lastly, they wanted to have strong ownership from a cost perspective, with the teams who produce the logs also managing the optimization.
Solving for these is why Byju’s opted for Coralogix, which additionally had capabilities like dashboards and visualizations for anomaly detection, scalability, and integration with the other open-source tools. All these criteria were important because it would be a deal breaker without any of these.
Byju’s felt that ~80% of the logs generated are not very useful. Once they started shipping logs that were earlier going to other vendors into Coralogix, they were immediately able to identify logs that were not beneficial and move it to a low TCO bucket.
“We were able to migrate quickly because one of the Coralogix features we used is the Total Cost Optimizer, which none of the other vendors offered.” – Hitesh Kumar, Principal Member Of Technical Staff – Byju’s
Coralogix helped standardize a bulk of the monitoring data that transferred from other stacks and helped build a unified dashboard across services and diverse systems. Byju’s were able to feel the progress a month down the line.
Byju’s started off with ELK on AWS, then used Logstash with Fluent and also had AWS Cloudwatch, Datadog, New Relic, PLD, and Appsignal in the system to manage their logs.
They were fully aware of all the OS technologies used in the industry and when they looked at the Coralogix documentation, it was familiar. They immediately shared this with the teams and Byju’s central DevOps team offered support through the migration. They trained a few DevOps team members on the implementation, highlighting issues they are likely to face as the team had prior rigorous experience in implementing such solutions. There are now 200+ active developers using Coralogix, and 1000s of services that were on-boarded over the last few months.
Whenever we faced a challenge that could not be solved internally, the Coralogix Sales Engineering and Support teams stepped up and helped the developer 1-on-1 to resolve issues. Their response time is flawless.
Byju’s uses Coralogix’s APM features, which allows them to filter out requests that perform under the SLA. For a system that performs well 90% of the time, they are only interested in understanding the other 10% which show errors or anomalies and have them sent to the console. As Coralogix is built on OS, Byju’s can do this level of custom optimization and set rules to filter the success logs, and only push those traces outside the SLA. E.g. Traces that take longer than a certain threshold time, or are errored due to some other attribute.
According to Byju’s, between 10% and 20% of the logs are critical to getting the right metrics. For example, there are access logs that help identify the stability of the services. Their teams expose business metrics out of their applications such as the number of transactions happening, source of the transaction, receipt of the transaction, etc. So they actively ingest those metrics across multiple services, plot them to visualize it, and configure alerts against them. From a business point of view, costs are associated with the metric’s value. In debugging scenarios, the Coralogix traces are the most important in correlation with the logs.
With an eye on future growth, Byju’s would like to segregate the view across services so that individual teams can view only the data relevant to them, and set custom thresholds that may vary across teams. They plan to build more sophisticated dashboards to better visualize the data being ingested into Coralogix.
Outside of the TCO Optimizer, the value derived from using Coralogix was the ability to ingest only those logs which they wanted to keep for the long term. They saw avoidable data repetitions across logs, metrics, and traces, and gave this feedback to the teams to reduce overall volumes. Inputs were provided directly to teams managing the application layer so that it could be fixed at the code level. Coralogix also had availability in the AWS India region which others did not have, and this reduced data transfer costs (The systems are primarily in Mumbai and Singapore).
Data ownership is also a big factor and all the data ingested by Coralogix eventually ends up within Byju’s ecosystem in an Amazon S3 account or GCP bucket.
Byju’s also saves resources in terms of people because teams no longer have to manage their own observability stack. They are now able to onboard a new service in days that previously took weeks. This delivers savings in thousands of dollars across resource costs and time. Now that it is standardized, teams can efficiently work on cleaning up the logs, which helps in debugging.
With teams coming together across acquisitions and locations, it was initially hard to get alignment. However, after creating a clear observability strategy and aligning the leadership on both cost and operational benefits, Byju’s have been able to get top-down buy-in to use standardized tooling. The DevOps team did multiple training sessions within the organization to communicate best practices for logging, how to avoid pushing excessive logs, cost optimization, and other important information related to the management of logs, metrics, and traces. The developers now see value in the approach as the data and strategy presented were well-researched.