Our next-gen architecture is built to help you make sense of your ever-growing data. Watch a 4-min demo video!

How Khatabook Built Observability into their System, Processes, and Culture

  • 0 min read
case study
khatabook cover image

Case Study

How Khatabook Built Observability into their System, Processes, and Culture

1TB+

Avg. Daily Data Volume

60+

Engineering Users

~2

Days Saved Per Month

$100K+

Yearly Savings

khatabook.com/

About the company

Khatabook is India’s fastest-growing SaaS company that enables micro, small and medium businesses to increase efficiency and profitability through safe and secure digital solutions. With its user base spread in almost every district in India, Khatabook intends to become the most accessible FinServ platform for Indian merchants.

Contributing towards building the infrastructure for better financial access, the Khatabook platform will provide digital touchpoints to avail and manage credit, savings, and cash flows. Founded in January 2019, Khatabook today is available in 13 languages and has 10M+ monthly active users. In May 2021, Khatabook acquired Biz Analyst, a tally accounting software integrated real-time business intelligence app with 100K+ premium users globally.

Observability is a primary concern for the team at Khatabook, as ensuring the highest level of reliability and stability to protect users’ financial data and transactions is the top priority.

Overview

The Khatabook platform supports more than 50 million customers’ financial activities. In case the system goes down, it directly impacts the ability for people to complete their business transactions. Hence, the team has a huge responsibility to ensure a seamless user experience.

Observability is a central part of everyday life at Khatabook where it’s been built into both the architecture and the culture. It’s critical for the business that the system is highly available and reliable, and the team is agile enough to quickly identify and resolve any abnormalities.

To ensure continued success and optimize processes, the engineering team at Khatabook decided to migrate the former in-house log management system to Coralogix. Since the integration, the team has optimized their log data to extract deeper insights, reduce the time spent maintaining observability, and significantly improved MTTI and MTTR.

Achieving High Adoption of the Coralogix Platform

One of the challenges of moving from an in-house observability system to adopting a third-party system is developers’ resistance to adopting new tools. Observability is important, and it’s just as important that engineers use all of the tools at their disposal to achieve it.
If something goes wrong in the middle of the night, developers have to wake up and work on that issue. With the sensitive nature of the applications, the team must be able to figure out problems and investigate and resolve them as soon as possible.

When a new system, like Coralogix, is being added to Khatabook’s tech stack, it’s important to evangelize the tool and ensure that the DevOps team is fully trained on the platform. Both of these steps are crucial for widespread adoption in the organization.

First, everybody must be on the same page about why the system is good, and how it helps developers and everybody in the organization. Second, once the developers begin to adopt and use the solution, they must have resources available to them to answer deeper questions about how to use the system for very specific use cases.

In this case, the Coralogix team works closely with Khatabook’s DevOps engineers to streamline their logs and provide in-depth onboarding sessions so that any time a developer has a question, they can reach out to someone within the organization to get a quick answer. The team is also leveraging the 24/7 in-app chat feature to get quick answers to any question that comes up.

In the end, although adoption within the organization was a concern when moving from an in-house to a third-party system, the team’s usage of the platform was tremendous with the support that Coralogix offered in the form of training and on-boarding sessions, round-the-clock in-app support chat, and even occasional Zoom calls to get to the bottom of any issue. Today, there are more than 60 engineers using Coralogix to monitor and maintain the reliability of Khatabook’s systems.

Improving Incident Response During On-Call Rotations

The high adoption and usage of Coralogix in Khatabook has led to significant improvements in developer productivity, reducing maintenance efforts, and providing better reliability for troubleshooting needs.

The majority of Khatabook’s SLAs, especially in cases when the impact is high, are within minutes. If the systems go down, in some cases they have a response SLA of 30 minutes, which is very aggressive.

There is no separate team that is looking at how the systems are behaving or if they go wrong. It’s the responsibility of each developer who’s writing the code to make sure that when it is deployed that it runs the way it is supposed to run on production.

Using the previous in-house observability solution, the team often faced challenges due to running out of memory or VMs. So the majority of the time, people were struggling because they didn’t have sufficient data to investigate a problem. There was no real way to understand the root cause, so that was the biggest impact.

Moving to Coralogix, investigating issues is streamlined for the team and they don’t need to worry about the system or their data not being available. This has greatly improved the developer experience and increased their productivity.

For the Infra team, there was a significant amount of effort and energy being put in to make sure that they were building the right systems and then keeping them up and running. The team was constantly needing to worry about the availability of VMs, memory availability, and CPU for their monitoring systems. Migrating to Coralogix’s fully managed platform has saved the team the equivalent of around half a day of work each week.

Measuring Custom Application Metrics with Coralogix

The team is monitoring all the standard performance metrics like CPU, RAM, network bandwidth, and everything, but when you talk about application observability, there’s a lot more to it than just the CPU. The team is also generating custom data points from their logs which they can then monitor in their dashboards and add to their alerting.

Coralogix enables the team to build those monitoring points very easily using Logs2Metrics. So, developers are coming up with more data points to create monitoring or alerting around new behaviors and use cases.

In the Khatabook system, because there are many connections between internal and external services like payment partners or banking systems, it’s very easy for transactions to get blocked. But this has nothing to do with CPU or RAM. The team is monitoring transaction statuses for individual services and have proactive alerting set up for issues relating to OTP, DNS, and Loan-API services.

Now, when looking at these data points, if the number is growing too high or if the number is growing too fast, they need the ability to respond immediately because there are use cases when they settle millions of merchants’ money in their account at midnight, and suddenly something goes wrong and everybody’s transaction is blocked.

Without the ability to monitor that metric and respond proactively, the next morning there will be a flood of calls to customer care asking, where is my money? So this is a simple example. It’s just one data point, but it’s critical for day-to-day operations.

Addressing the Cost of Observability

A common assumption is that moving from a homegrown solution to a SaaS product imposes a higher cost, but Rajeev from Khatabook insists that first, we must understand on a higher level, what is ‘cost’?

If you look only at the direct cost, which is what you’re paying for third-party software, then yes it is more costly. But then there are many things which are more important than the direct cost.

For example, at the time that Khatabook migrated to Coralogix, there was only one DevOps engineer in the organization. Before the migration, a significant amount of time was going just to maintain the observability system. It’s just not economical to use your strongest engineers to build and maintain monitoring tools rather than focusing on the systems that are core to the business. Ultimately, that’s a much higher cost than adopting third-party software.

“If you look at the holistic picture of what the cost is, the answer is pretty simple. I reduced the cost overall as opposed to increasing it.”

This is before even addressing the rising costs associated with analyzing the growing data being produced by Khatabook’s systems. Like any modern company, Khatabook is facing exponential data growth and uses Coralogix to optimize the analysis and storage of their data to extract powerful insights and maintain query access to their compliance data without needing to store it in expensive indices.

Using the TCO Optimizer, Khatabook is able to designate the majority of their data to the Compliance use case in the Coralogix platform so they can parse and enrich their data, view it in the Live Tail, and then archive it to their S3 bucket with full query capabilities. The data that is relevant for monitoring the health of their systems is analyzed in-stream and used to generate trackable metrics and templates for streamlined troubleshooting.

Rajeev Kumar Sharma

Rajeev Kumar Sharma
Director of Engineering, Khatabook

If you look at the holistic picture of what the cost is, the answer is pretty simple. I reduced the cost overall as opposed to increasing it.

Summary

As a company that is responsible for the financial success of millions of businesses, Khatabook must prioritize observability and ensure that they maintain the highest level of quality and reliability in their systems.

With Coralogix, observability is built into the workflows and culture of the entire engineering organization. Since the migration, the team has optimized their data to extract deeper insights, reduce the time spent maintaining observability, and significantly improved MTTI and MTTR.

Where Modern Observability
and Financial Savvy Meet.