CI/CD pipelines have become a cornerstone of agile development, streamlining the software development life cycle. They allow for frequent code integration, fast testing and deployment. Having…
Continuous Integration and Continuous Delivery (CI/CD) delivers services fast, effectively, and accurately. In doing so, CI/CD pipelines have become the mainstay of effective DevOps. But this process needs accurate, timely, contextual data if it’s to operate effectively. This critical data comes in the form of logs and this article will guide you through optimizing logs for CI/CD solutions.
Logs can offer insight into a specific event in time. This data provides a metric that can be used to forensically identify any issue that could cause problems in a system. But logs, especially in modern, hyperconnected ecosystem-based services, require appropriate optimization to be effective. The development of logging has generated a new approach, one that works to optimize the data delivered by logs and in turn, create actionable alerts.
JSON effectively adds “life to log data”. It’s an open-standard file format that allows data to be communicated across web interfaces. Data delivered using JSON makes it easy to visualize and easy to read.
It can be filtered, grouped, tagged by type, and labeled. These features make JSON perfect for building focused queries and filtering based on two or more fields or zeroing in on a range of values within a specific field. At the end of the day, this saves developers. Also, it’s straightforward to transform legacy data into JSON format, so it really is never too late.
A definition of the severity of a log is important to standardize across the organization. These levels act as a baseline for understanding what actions should be taken. Typical levels of severity for logs are:
Debug: describes a background event. These logs are not usually needed except to add context and analysis during a crisis. These logs are also useful when tuning performance or debugging.
Info: logs that represent transaction data, e.g. a transaction was completed successfully.
Warning: these logs represent unplanned events, for example, they may show that an illegal character for a username was received but is being ignored by the service. These are useful to locate possible flaws in a system over time and allow changes to improve system usability, etc.
Error: represent failures in processes that may not materially affect the entire service. These logs are useful when aggregated and monitored to look for trends to help improve service issues.
Critical: these logs should trigger an immediate alert to take action. A fatal log implies a serious system failure that needs immediate attention. Typically, in a day, 0.01% of logs would be defined as high severity.
Once severity definition is resolved there are a number of best practices that further enhance the effectiveness of logging. These best practices facilitate the optimization of a CI/CD pipeline.
Log communication between components
Services are often ecosystems of interconnected components. All events in the system, including those that happen across the barrier between components, should be logged. The event lifecycle should be recorded, that is what happens at each component as well as during communication between components.
Log communications with external APIs
The API economy has facilitated the extended service ecosystem. But API events often happen outside your own organization. Optimized logging records what is communicated across the API layer. For example, if your offering uses a service such as SendGrid to send out communication emails to users, you need to know if critical alert emails are being sent? If not, this would need to be addressed. Log everything up to the point of sending to an API as well as logging the response from the other component to achieve a comprehensive view.
Add valuable metadata to your log
Modern service ecosystems, many stakeholders need access to logs. This includes Business Intelligence teams, DevOps, Support engineers, etc. Logs should include rich metadata, e.g. location, service name, version, environment name, and so on.
You may not be the only one reading the logs. As companies scale, often access is needed by other stakeholders to change code, etc. Centralizing the log data becomes a necessity.
Combine textual and metric fields
Improve snapshot understanding of logs by having explanatory text (e.g., “failed to do xxx”) combined with metadata fields to provide more actionable insights. This offers a way to look at logs to see an at-a-glance view of the issue before drilling into performance data.
The Art and Science of Alerts
Logs are a leading indicator of issues and can be used for more than a postmortem analysis. Whilst metrics for infrastructure only presents an outcome of problematic code/performance, logs are the first encounter with code in use. As such, logs offer an easy way to spot things before they are felt at the user end. This is key to CI/CD enhancement.
Logs need to be accurate and have context (unlike metrics). Being able to make alerts specific and contextual is a way to tease out greater intelligence from the alert. Classification of alerts is a way to ensure that your organization gets the most from alerts and does not overreach.
Immediate alert: These alerts point to a critical event and are generated from critical and fatal logs. They require immediate attention to fix a critical issue.
“More than” alert: Sent out if something happens more than a predefined number of times. For example, less than 1 in 1000 people are failing to pay; if greater than 10 users are failing to pay the alert is generated. If these types of alerts are properly defined and sent to the right channel they can be acted upon and be highly effective.
“Less than” alert: These tend to be the most proactive alert type. They are sent when something is NOT happening.
If these alerts are channeled correctly, have context, and can be interpreted easily, they will be actionable and useful.
Coralogix goes further and adds further granularity to alerts:
Dynamic alert: Set a dynamic threshold for criteria.
Ratio alert: Allows an alert based on a ratio between queries
Classification of alerts is one criterion; another is the alert structure. This ensures that alerts go to the right person(s). For example, is using Slack to collate alerts, a dedicated area can be set up to collate logging and metrics. The alerts can be then be directed to specific teams – this technique also helps alleviate alert fatigue.
To Push or Not to Push an Alert
The decision to push, or not, an alert, is an important aspect of creating an effective ‘alert culture’. Ask yourself certain questions: How would you feel if you received a specific type of alert at 2 am? What would you do? If you would be angered by getting this alert at that time, don’t push it. If, instead, your reaction is to say, “I must look at this tomorrow”, define the alert on your dashboard, but DO NOT push it. If the reaction is to stop what you are doing and respond – push the alert.
This simple logic can go a long way to making alerts actionable and workable and ends up reducing alert fatigue and keeping the team happy.
Coralogix is a full-stack observability platform that will drastically reduce logging costs but also improve your ability to query, monitor, manage log data, and to extend the log data value even further by turning logs into long-term metrics. Start a conversation with our logging experts today by clicking on the chat button on the bottom right corner, or start a free trial.