Error logs are the first port of call for any outage. Great error logs provide context and cause to a mysterious, 3am outage. Engineers often treat…
First of all, don’t ask this! Instead of asking what to log, we should start by asking “what questions do we want to answer?” Then, we can determine which data needs to be logged and monitored in order to best answer these questions.
Once a question comes up, we can answer it using only the data and knowledge that we have on hand. In emergent situations such as an unforeseen system failure, we cannot change the system to log new data to answer questions about the current state of the system.
This means that we must do our best to anticipate what information we’ll need to answer comprehensive questions about our system in the future.
When we do not have precise questions, we can fall back on community knowledge and log data that help answer similar precise questions posed frequently in the community in the context of systems similar to our system.
This log data will be invaluable if our questions turn out to be the questions posed frequently in the community. If not, it can still provide valuable insights and guidance to determine improvements to the data being logged as well as the system itself.
It’s important to be proactive in collecting the data that you need, but in cases that we don’t achieve full coverage, we need to be agile to modify and deploy the system.
In this post, we will explore common aspects of modern-day systems that are interesting to monitor based on community knowledge. We will also explore common questions pertaining to these aspects and the data needed to answer these questions.
Before we begin, a word of caution. While gathering more data may be helpful, excessive data collection can be detrimental to the performance of the system and, more importantly, raise concerns and risks about user security and privacy. So, exercise caution when gathering nice-to-have data.
The core aspects of any software system that should be monitored are functional correctness, performance, reliability, and security. Other domain-specific aspects include things such as scale and privacy. Each respective domain has its own set of questions that need to be answered. These are only a few examples:
1. Functional Correctness
Assuming we agree these aspects are commonly relevant, we will now dig a bit deeper into each aspect and corresponding questions.
In the following exposition, we will present ideas in the context of service-based systems where components/services serve as the basic compositional units and every request serviced by a component has a unique id. For clarity, we will use the term component instead of service.
With functional correctness, we are interested in a component doing what it promised to do.
More precisely, we are interested in the operational aspect and the data aspect of what a component does. For example, consider a component that promises to charge $25 to credit card X. The component exhibits operational correctness if it charges a credit card. The component exhibits data correctness if the involved amount is $25 and the credit card is X.
In general, a component exhibits operational correctness if it performs the expected operations to service a request and data correctness if it consumes and produces the expected data when servicing a request.
To monitor the functional correctness of a component, we need to monitor both the operational correctness and data correctness of the component. Consequently, we need to collect data to answer the following questions.
Since a component may depend on other components in a system to service a request, monitoring for data and operational correctness of a component boils down to tracking what data (request payload) was consumed and provided by the component and what operations (requested operations) were requested by the component to service a request. With components possibly relying on other components to service a request, track every operation performed and data consumed and provided to service a request.
To answer the above questions, log every request, its id, and the corresponding response.
For the logged data to be useful, capture additional data that allows relating requests to each other (e.g., request qx triggered requests qy and qz), relating requests to responses (e.g, response sy corresponds to request qy), and recreate the global order of requests and responses (e.g., request qx was followed by request qy which was followed by response sy which was followed by request qz).
Since requests and response payloads may contain sensitive (e.g., private) information, make sure such sensitive information is appropriately handled and processed by the logging system and any downstream systems.
With performance, we are interested in a component being fast enough in doing what is promised to do.
In the context of service-based systems, this aspect has two facets: latency and throughput. Latency is the time taken by a component to service a request, i.e., the time between the component accepting a request and responding to the request. Throughput is the number of requests serviced (responded to) by a component in a second.
To monitor the performance of a component, we need to collect data to answer the following questions.
Unlike functional correctness, latency and throughput are local to a component, i.e., these aspects of a component can be measured independent of other components involved in the component’s function. However, since a component can depend on other components to service a request, we need to consider these aspects of other components to determine if and how they affect these characters of the dependent component.
To answer the above questions, for every request, log its id, the time when it was received, and the time when the corresponding response was provided.
As in the case of functional correctness, capture additional data that helps relate requests to their dependent requests. This data will help determine how the latency of one request directly or indirectly affects others within the system.
With reliability, we are interested in the possibility of a component (not) failing to do what it promised to do in ways that impact dependent components.
Functional correctness and reliability are like dual aspects. While functional correctness is about what happens when a component successfully serviced a request (e.g., by providing an error response), reliability is about what happens when a component fails to service a request (e.g., crashed before responding).
Given the focus is on the fallibility of a component, we are interested in questions that help understand and address failures and also help devise a strategy to mitigate and prevent future occurrence of failures. Specifically,
To answer the above questions, for every request, log its id, the time when it was received, if the component failed to service it, and the time when the service failure was detected.
The logged data suffices to answer the first two questions. To answer the third question, log additional data as a component’s failure may stem from the failure of other components, the request received by the component, the state of the component, or the state of the component’s execution environment. This additional data will be specific to the request, the component, and its execution environment.
Exercise caution while logging additional data. Specifically, ensure the appropriate handling of sensitive information. Also, strike a balance between the data needed to understand failures and the volume of log data. Being smart in how the log data is processed (batch vs. stream) and stored (aggregated vs. raw) to understand failures can help reduce the volume of log data.
With security, we are interested in a component doing what it promised to do without causing harm.
The definition of harm is often tightly coupled with the domain and the system in which a component operates. For example, theft of funds would constitute harm in the context of financial systems. In contrast, loss or incorrect alteration of medical records would constitute harm in the context of medical systems.
Even so, almost all domains that heavily depend on software and automation have a security framework composed of requirements (e.g., standards) and processes (e.g., certifications, inspections, audits) as protection against harm.
With such frameworks, we can view the security aspect through a more general lens of a component is compliant with the security framework native to the domain. Consequently, we arrive at the following security-related questions.
To answer the above questions, for every request that may have security implications, log the request, the time it was accepted and serviced, the authorization and authentication information associated with the request, and the data associated with it. In addition, log additional data to demonstrate compliance as required in the domain of application.
For example, if a component services requests to modify sensitive records, then it should log authorization and authentication information about the modifier, the modified parts (or pointers to them) of the record, the modifications (or points to them), the source of origin of the request, and the time of the request. Similar information should be logged even when the request for modification completes without modifications (due to errors or no modifications).
As with earlier aspects, capture additional data to recreate the relation and ordering between requests. Ensure appropriate handling of sensitive data during logging; this may entail logging only a pointer to the data and not the actual data.
Being proactive in logging data will always lead you to the next steps and questions that need to be handled. But it’s important to remember to exercise caution when gathering data, especially an excessive amount, as at times concerns and risks can be raised regarding user security and privacy. Keep in mind the main aspects of the software systems and move forward at a comfortable and efficient pace.