Tracing, or more specifically distributed tracing or distributed request tracing, is the ability to follow a request through a system, joining the dots between all the individual system calls required to service a particular request.
Although tracing logs have been around for some time, the trend toward distributed architectures, microservices, and containerization has elevated it from nice-to-have status to an essential piece of the observability puzzle.
For software developers, SRE teams, and DevOps engineers managing complex, distributed systems made up of dozens of interdependent services, with multiple instances of each deployed via containers to a mix of ephemeral cloud-based infrastructure and on-site servers.
Having a mechanism that enables you to understand the flow of logic and data is essential for maintenance and troubleshooting.
Traces form one of the three pillars of observability, alongside logs and metrics. Whereas logs provide details about specific events, such as an individual database call and writing to a file or error statement, a trace connects a chain of related events. A single trace provides the details of each step the request went through, the amount of time spent on each step, and the total time taken to handle the request.
Imagine an HTTP request to a web server: a single request may spawn multiple sub-requests to different services, which need to be completed before the response can be served.
These might include fetching the relevant HTML file, executing server-side scripts, running database queries to populate certain page elements. While logs provide granular details about activities performed by each service in the process, trace logging provides a complete record of the chain of events triggered by the initial request.
This information is invaluable for debugging complex issues, identifying performance bottlenecks, and optimizing the design of your system.
A trace is a set of related events triggered by an input, such as a user clicking a button or a request made by an external system. An individual trace comprises multiple spans, and each span represents an operation performed due to the initiating request.
When a trace is started, a parent or root span is generated to encapsulate the entire transaction. For example, a user clicking a button to place an order on an e-commerce site would typically constitute such a transaction, and a root span would represent the entire transaction.
Suppose a transaction comprises multiple operations, such as checking that the user is authenticated, querying the database to confirm the product is available, and retrieving the user’s account details. In that case, a child span is created for each operation.
If an operation breaks down further, such as a request for an authorization service, each descendant is represented by a grandchild span. You can think of the spans as a hierarchical tree structure that starts from the root span, with descendants for each dependent operation and branches for operations that run in parallel.
When the root span is generated, it is given a context or identifier, which is propagated to each child span. In addition to the identifier, each span typically records the start and end time of the operation, together with optional additional attributes, such as the service instance identifier. This process makes it possible to identify each request, call, and process that made up a particular transaction, the order in which they occurred, and the time each stage took.
One of the main advantages of distributed tracing is the ability to follow the path of a transaction from beginning to end. In a traditional monolithic architecture, following the course of request is relatively straightforward.
However, the move towards distributed, containerized architectures — while offering many advantages in terms of scalability, ease, and speed of deployment — has made this much more difficult.
By implementing distributed tracing, you can map the entire flow of a request — even across network boundaries and security contexts — and identify its constituent parts. Regarding debugging complex issues, this trace data can help you quickly pinpoint the source of the issue and identify which trace logs to focus your attention on, reducing your mean time to resolution (MTTR).
As each span in a trace records the start and end time of the operation, traces are particularly effective for identifying dependencies and performance bottlenecks. Using traces to identify the requests that take the most time, you can drill down to identify which process stages are taking the most time and determine whether dependencies on other services contribute to latency.
One of the advantages of distributed architectures is that software development teams can develop and deploy changes in parallel. While this has many benefits, it can contribute to siloes within the enterprise.
With less visibility and understanding of the other elements that make up the application, it becomes more difficult for teams to understand how their changes impact other parts of the system. Introducing distributed tracing can help to mitigate against this by providing a clearer picture of how the individual pieces form a whole.
Finally, with distributed tracing in place as part of your observability solution, you can start to reap the benefits of reduced response times. Increased confidence in your ability to detect and respond to issues quickly means you can start releasing changes more frequently and use this as an opportunity to conduct experiments, such as trialing new features with a subset of users.
While distributed tracing offers many advantages, there are some potential issues to bear in mind.
Generating identifiers, propagating them to child spans, and sending that span data to a datastore can impact performance and generate considerable volumes of data. For this reason, tracing a representative sample of requests is recommended rather than tracing logs every transaction.
Sampling can be implemented in two ways: head-based and tail-based. With head-based sampling, deciding whether to trace a transaction is made when the transaction is initiated. If so, the trace identifier will be propagated to each span. How the sampling decision is made depends on the sampling configuration.
With tail-based sampling, tracing is applied to every transaction and the decision on which requests to sample is made after the fact. While this temporarily increases storage, you can apply some analysis before choosing which traces to keep.
As with logs and metrics, you need to instrument your application code to emit traces and send them to a storage backend to make them available for analysis.
As distributed tracing developed as a practice within the software development community, the need for standardized data formats and instrumentation to avoid being locked-in to a particular tool was soon recognized.
OpenTelemetry (formerly OpenCensus and OpenTracing), managed by the Cloud Native Computing Foundation, provides that open standard. OpenTelemetry includes instrumentation libraries for the most commonly used programming languages and support for exporting traces (and other observability data) to multiple open-source and proprietary backends.
Choosing a tool that adheres to the OpenTelemetry standard means you can more easily change vendors in the future without having to change your tracing implementation. Popular open-source examples include Zipkin and Jaeger. If you’re using AWS to host your distributed system, you may want to consider Coralogix’s full-stack observability tool.
Building observability into your systems can deliver a host of benefits. Combining logs and metrics with distributed tracing will provide you with a rich dataset that you can use to observe and understand how your applications are behaving.
Armed with these deeper insights into how your system operates, you’ll be able to respond far more quickly when something goes wrong. Having identified that a problem is emerging, traces connect the dots between operations so that you can drill down to the source of the problem and hone in on the trace logs that will provide the additional detail you need to debug the issue.
Good visibility of your system’s health and the ability to respond to issues faster also reduces the risk of deploying changes, opening up opportunities to innovate, and experiment.