API Error Tracking
Overview
Consistently monitoring the errors collected by Coralogix is essential for maintaining your system's health. When there are many individual error events, it becomes hard to prioritize errors for troubleshooting. API Error Tracking simplifies debugging of backend services by assembling thousands of similar API errors into a single group, enabling you to:
Follow, grade, and resolve fatal errors.
Categorize similar errors into error groups. For example, organize all errors with HTTP status
502 Bad Gateway
into one group or collect all errors with gRPC status5 - NOT_FOUND
into another. These groupings help you identify and prioritize API errors that are most impactful, reduce noise, and minimize service downtimes.- Track issues over time to determine when they started, whether they are ongoing, and how frequently they occur
Availability
Service error data is extracted from spans within the time interval selected in the time picker, based on HTTP or gRPC status codes. To enable API error tracking using span metrics, follow the instructions here.
Data sources
Service error data is extracted from spans during the interval selected in the time picker, according to HTTP or gRPC status codes.
Track your service errors
- Navigate to APM > Service Catalog. Click on a service of interest.
- On the service page, go to the API Errors tab.
- View aggregated information related to service errors:
- Number of error groups
- Total number of API errors
- Percentage of API errors in relation to the total number of service requests
- Click on the Errors chart to display a modal with a detailed view of the error occurrences. Use it for a better understanding of error dynamics in the service and pinpointing error spikes. The chart presents errors over time (count and percentage) for the top error groups that affect most of your service operations.
Scroll down to the detailed Error Groups summary to study the following:
- Error messages and related operations for each group
- The first and last appearances of this error within the selected time range
- Total number of occurrences and error percentage
Use this information to cut down on noise and improve the visibility of the error data. Easily locate a specific error group using the Search field above the Error Groups tables.
Click on an error group to display a modal with a detailed view of all error occurrences.
Typical use cases
The API Errors tab focuses on profiling errors to uncover their specific impact on service operations. Its time-based analysis offers valuable insights—allowing you to slice and dice data by various dimensions—into the frequency of different error groups and the exact parts of the service they affect. The use cases below focus on understanding error patterns, prioritizing their resolution, and minimizing their operational impact.
Isolating errors by component
Identify the specific endpoint or outgoing call causing particular error types, enabling you to take precise, targeted actions to resolve the issue efficiently.
Scenario
You notice 14 Unavailable
errors associated with a specific service endpoint, hipstershop.CartService/GetCar
.
Solution
Debugging reveals configuration issues within the region, which are quickly identified and resolved.
Profiling and grouping errors
Understand error categories, such as HTTP 500
, gRPC UNIMPLEMENTED
, to pinpoint specific issues and their operational impact.
Scenario
Frequent HTTP 500
Internal Server Errors are observed in checkout API.
Solution
Profiling the error reveals the issue stems from a misconfigured database operation. This insight enables swift optimization and resolution of the issue.
Time-based analysis
Use error trends to identify recurring patterns, such as spikes triggered by deployments or traffic surges.
Scenario
Spikes in HTTP 503
Service Unavailable errors are detected every morning between 9 and 10am.
Solution
Through time analysis, you can correlate the issue with an automated backup process that is overloading the API. This insight helps pinpoint the root cause, enabling you to address the overload effectively.
Prioritize error resolution based on their impact and urgency
Prioritize resolving errors that have the greatest operational impact to minimize service disruptions and maintain optimal performance.
Scenario
An authentication service is reporting frequent HTTP 401 Unauthorized
errors, while fewer HTTP 500 Internal Server Errors
are having a notable impact on user logins.
Solution
Prioritizing the resolution of HTTP 500
errors is crucial, as fixing them first restores critical functionality and ensures uninterrupted user logins.
Additional resources
Documentation | Application Performance Monitoring: Components, Metrics, and Practices |
Tutorial | Introduction to APM |
Support
Need help?
Our world-class customer success team is available 24/7 to walk you through your setup and answer any questions that may come up.
Feel free to reach out to us via our in-app chat or by sending us an email to [email protected].