AWS Centralized Logging Guide
The key challenge with modern visibility on clouds like AWS is that data originates from various sources across every layer of the application stack, is varied…
Whether you are just starting your observability journey or already are an expert, our courses will help advance your knowledge and practical skills.
Expert insight, best practices and information on everything related to Observability issues, trends and solutions.
Explore our guides on a broad range of observability related topics.
Running a successful company relies on current and accurate information about the underlying systems. Much of this information is contained within your application logs. By investing in your log monitoring solution, you can unlock these crucial insights and access a wealth of powerful data. This post presents a series of goals that will allow you to make the best possible use of your application logs.
Definition: Logging is an activity in almost all present-day applications, be it web, desktop, mobile apps, or services. Application logs capture timestamped data, comprising of decisions taken, actions initiated, runtime characteristics, or error states.
Such log data is rich with information that can enable and support management, engineering, and operational tasks at all levels of an organization. It can be difficult to work out where to begin. Below are several goals to target when looking for how to measure the maturity and success of your logging solution.
When applications are composed of multiple services, service logs capture data that is similar to the data captured in application logs but specific to services. Also, the logs will typically capture auxiliary yet essential data (e.g., request-id) that, along with timestamps, can help collate the actions performed by different services in the context of a specific action performed by an application.
Almost all paid services are accompanied by service-level agreements (SLAs) that codify the quality of service offered to customers. Service providers ensure that they honor SLAs by tracking if their services are fulfilling service level objectives (SLOs) derived from SLAs and defined in terms of service-level indicators (SLIs). Almost always, SLIs stem from the data about services/applications that are either logged by services/applications or captured and logged by the infrastructure running/supporting the services/applications. For example, the service latency can be derived from the arrival time of a request and the departure time of the corresponding response, which can be easily captured and logged by the infrastructure running/supporting the service.
Since SLAs are closely connected to log data via SLOs and SLIs, analyzing log data can be a great means of keeping tabs on the quality of the offered service and ensuring the promised quality is indeed provided.
Modern applications deal with external data in some form, be it storing user data in the cloud (e.g., email services) or using locally stored data in conjunction with remote services (e.g., smart IDEs). Consequently, such applications need to prove compliance with rules and regulations governing the handling of external data. This goal is accomplished via audits of how an application, directly and indirectly, handles external data. Often, many bits of the data used in such audits are captured in logs that are related to the actions taken by an application (e.g., details about data access performed by an application) or in the context of an application (e.g., changes to a security group related to an application).
As logs capture data relevant to audits used to prove compliance, analyzing log data of an application is a great way to assess current practices used in the application and make changes to ensure the application is compliant.
As applications evolve, we need reference characteristics of the application to make assessments and take evolution and maintenance related decisions. For this purpose, an application’s target characteristics after the last change seem like a good reference. However, current operating characteristics may differ from desired characteristics due to a myriad of factors. For example, the latency of a service could have increased with the number of service users. In contrast, the same latency could have decreased because we migrated the service to a more powerful hardware platform.
Logs form an outstanding baseline for your application behaviour, but it can be difficult to derive patterns from such vast information. Coralogix provides machine learning based tooling to do this for you. Assuming application logs capture relevant data, we can analyze the logs and sketch the operating characteristics of an application.
Unlike desired characteristics of an application determined at design time, the operating characteristics of an application are fluid as a running application is exposed to its clients, service providers, and environment. Consequently, the health of an application is more prone to deterioration due to various factors such as increased client requests, failing internal components, and hardware malfunctions.
In most applications, data about such health-related factors are directly or indirectly captured in logs. So, by analyzing the logs, we can get a glimpse of the current health of the application by comparing its current operating characteristic with a recent past operating characteristic deemed as healthy, e.g., using features like Ratio Alerts in Coralogix. Based on this glimpse, we can plan/take corrective and/or preventive action to restore the health of the application (read the following possibilities).
Most modern-day applications either expose or rely on web-based service access points. This allows external actors to impact applications via malicious attacks. Often such attacks are characterized by unusual extrinsic behavior, such as increased network traffic and increased data ingress. They also often result in unusual network traffic and data access patterns.
Similar to external attacks, internal errors stemming from coding faults (e.g., incorrect units) or deployment time issues (e.g., incorrect configuration) can also cause application failures that result in unusual intrinsic behaviors such as increased request failures. These failures can also result in unusual network traffic and data access patterns.
Many applications and their deployments log data about network traffic, data ingress/egress and access, and the computational load generated by applications. So, we can employ techniques such as pattern mining and outlier analysis to analyze logs and identify unusual behavior and patterns. Equipped with such patterns, we can monitor the application logs for anomalies and take corrective action to prevent or mitigate failures, e.g., using simple rule-based features such as Automatic Threat Discovery or learning-based features such as Flow Anomaly Detection in Coralogix.
Once an application is published, how the application is used (e.g., number of users, frequency of use of features) can affect its operating characteristics. Such effects can trigger maintenance changes, e.g., fixing bugs in existing features, addressing performance/scaling issues, and improving availability.This can also trigger feature evolution, e.g., improve UX of existing features, add new features.
While such changes can be driven by user feedback, they can also be implemented proactively by continuously monitoring the operating characteristics of an application and its environment. This allows you to identify improvement opportunities and plan to implement the changes before the users are affected by the absence of these changes.
As customer needs evolve and improvement opportunities are discovered, applications are modified and new versions are released/deployed. While such modifications are well-intentioned, they could turn out to be ineffective. Further, when alternative solutions are available to address an issue, measuring the effectiveness of alternative solutions (usually realized as different versions) after deployment is crucial to validate decisions.
In such situations, we can collect data that is necessary to measure the effectiveness of different versions, e.g., using the Tags feature in Coralogix. Alternatively, if we can identify the expected effects of the changes on the data that is currently logged by and about the application, then analysis of such log data can easily help us assess the effectiveness of different versions.
While most changes to applications are intended to serve its users, they could have unintended detrimental effects. For example, a feature could ease the burden of authentication but, when combined with certain other features or processes, can open a security vulnerability. In such cases, analyzing application logs can help uncover such second-order effects and allow us to respond instantly, with a full picture of the issue, using features such as New Error and Critical Log Detection in Coralogix.
Often application logs are rich with information about the behavior of their applications. Analyzing these logs can tell us a lot about the applications, their environments, and their clients. This information can help improve applications and provide better service to users. So, failing to analyze logs is failing to use a precious resource!
The key challenge with modern visibility on clouds like AWS is that data originates from various sources across every layer of the application stack, is varied…
If you think log files are only necessary for satisfying audit and compliance requirements, or to help software engineers debug issues during development, you’re certainly not…
Log file monitoring tools plays a central role in enhancing the observability of your IT estate, helping operations teams and SRE engineers to identify issues as…