Adding Observability to Your CI/CD Pipeline in CircleCI

The simplest CI/CD observability pipeline consists of three stages: build, test, and deploy. 

In modern software systems, it is common for several developers to work on the same project simultaneously. Siloed working with infrequent merging of code in a shared repository often leads to bugs and conflicts that are difficult and time-consuming to resolve. To solve this problem, we can adopt continuous integration.  

Continuous integration is the practice of writing code in short, incremental bursts and pushing it to a shared project repository frequently so that automated build and testing can be run against it. This ensures that when a developer’s code gets merged into the overall project codebase, any integration problems are detected as early as possible. The automatic build and testing are handled by a CI server.

If passing the automated build and testing results in code being automatically deployed to production, that is called continuous deployment. 

All the sequential steps that need to be automatically executed from the moment a developer commits a change to it being shipped to production is referred to as a CI/CD pipeline.  CI/CD pipelines can range from very simple to very complex, depending on the needs of the application.

Important considerations when developing CI/CD pipelines

Building a CI/CD pipeline is no simple task. It presents numerous challenges, some of which include:

Automating the wrong processes

The whole premise of CI/CD is to increase developer productivity and optimize time-to-market. This goal gets defeated when the CI/CD pipeline has many steps in it that aren’t necessary or that could be done faster manually. 

When developing a CI/CD pipeline, you should:

  • consider how long a task takes to perform manually and whether it is worth automating
  • evaluate all the steps in the CI/CD pipeline and only include those that are necessary
  • analyze performance metrics to determine whether the pipeline is improving productivity
  • understand the technologies you are working with and their limitations as well as how they can be optimized so that you can speed up the build and testing stages.

Ineffective testing

Tests are written to find and remove bugs and ensure that code behaves in the desired manner. You can have a great CI/CD pipeline in place but still get bug-ridden code in production because of poorly written, ineffective tests. 

To improve the effectiveness of a CI/CD pipeline, you should:

  • write automated tests during development, ideally by practicing test-driven development (TDD)
  • examine the tests to ensure that they are of high quality and suitable for the application
  • ensure that the tests have decent code coverage and cover all the appropriate edge cases

Lack of observability in CI/CD pipelines

Continuous integration and continuous deployment underpin agile development. Together they ensure that features are developed and released to users quickly while maintaining high quality standards.  This makes the CI/CD pipeline business-critical infrastructure.

The more complex the software being built, the more complex the CI/CD pipeline that supports it. What happens when one part of the pipeline malfunctions? How do you discover an issue that is causing the performance of the CI/CD pipeline to degrade?

It is important that developers and the platform team are able to obtain data that answers these critical questions right from the CI/CD pipeline itself so that they can address issues as they arise.

Making a CI/CD pipeline observable means collecting quality and performance metrics on each stage of the CI/CD pipeline and thus proactively working to ensure the reliability and optimal performance of this critical piece of infrastructure. 

Quality metrics

Quality metrics help you identify how good the code being pushed to production is. While the whole premise of a CI/CD pipeline is to increase the speed at which software is shipped to get fast feedback from customers, it is also important to not be shipping out buggy code.

By tracking things like test pass rate, deployment success rate, and defects escape rate you can more easily identify where to improve the quality of code being produced. 

Productivity metrics

An effective CI/CD pipeline is a performant one. You should be able to build, test, and ship code as quickly as possible. Tracking performance-related metrics can give you insight into how performant your CI/CD pipeline is and enable you to identify and fix any bottlenecks causing performance issues.

Performance-based metrics include time-to-market, defect resolution time, deployment frequency, build/test duration, and the number of failed deployments. 

Observability in your CI/CD pipeline

The first thing needed to make a CI/CD pipeline observable is to use the right observability tool. Coralogix is a stateful streaming analytics platform that allows you to:

The observability tool you choose can then be configured to track and report on the observability metrics most pertinent to your application.

When an issue is discovered, the common practice is to have the person who committed the change that resulted in the issue investigate the problem and find a solution. The benefit of this approach is that it makes team members have a sense of complete end-to-end ownership of any task they take on as they have to ensure it gets shipped successfully. 

Another good practice is to conduct a post-mortem reviewing the incident to identify what worked to resolve it and how things can be done better next time. The feedback from the post-mortem can also be used to identify where CI/CD pipeline can be improved to prevent future issues. 

Example of a simple CircleCI CI/CD pipeline

There are a number of CI servers you can use to build your CI/CD pipeline. Popular ones include Jenkins, CircleCI, Gitlab and a newcomer Github Actions.

Coralogix provides integrations with CircleCI, Jenkins, and Gitlab that enable you to quickly and easily send logs and metrics to Coralogix from these platforms. 

The general principle of most CI servers is that you define your CI/CD pipeline in a yml file as a workflow consisting of sequential jobs. Each job defines a particular stage of your CI/CD pipeline and can consist of multiple steps. 

An example of a CircleCI CI/CD pipeline for building and testing a python application is shown in the code snippet below.

To add a deploy stage, you can use any one of the deployment orbs CircleCI provides. An orb is simply a reusable configuration package CircleCI makes available to help simplify your deployment configuration. There are orbs for most of the common deployment targets, including AWS and Heroku. 

The completed CI/CD pipeline with deployment to Heroku is shown in the code snippet below.

Having created this CI/CD pipeline you would think that you are done, but in fact, you have only done half the job. The above CI/CD pipeline is missing a critical component to make it truly effective: observability. 

Making the CI/CD pipeline observable

Coralogix provides an orb that makes it simple to integrate your CircleCI CI/CD pipeline. This enables you to send pipeline data to Coralogix in real-time for analysis of the health and performance of your pipeline. 

The Coralogix orb provides four endpoints:

  • coralogix/stats for sending the final report of w workflow job to Coralogix
  • coralogix/logs for sending the logs of all workflow jobs to Coralogix for debugging
  • coralogix/send for sending 3rd party logs generated during a workflow job to Coralogix
  • coralogix/tag for creating a tag and a report for the workflow in Coralogix

To add observability to your CircleCI pipeline:

  1. In your Coralogix account, go ahead and enable Pipelines by navigating to Project Settings -> Advanced Settings -> Pipelines and turn it on
  2. Add the Coralogix orb stanza at the top of your CircleCI configuration file
  3. Use the desired Coralogix endpoint in your existing pipeline

The example below shows how you can use Coralogix to debug a CircleCI workflow. Adding the coralogix/logs job at the end of the workflow means that all the logs generated by CircleCI during the workflow will be sent to your Coralogix account, which will allow you to debug all the different jobs in the workflow. 

Conclusion

CI/CD pipelines are a critical piece of infrastructure. By making your CI/CD pipeline observable you turn it into a source of real-time actionable insight into its health and performance.

Observability of CI/CD pipelines should not come as an after-thought but rather something that is incorporated into the design of the pipeline from the onset. Coralogix provides integrations for CircleCI and Jenkins that make it a reliable partner for introducing observability to your CI/CD pipeline. 

Continuously Manage Your CircleCI Implementation with Coralogix

For many companies, success depends on efficient build, test and delivery processes resulting in higher quality CI/CD solutions. However, development and deployment environments can become complex very quickly, even for small and medium companies.

A contributing factor to this complexity is the high adoption rate of microservices. This is where modern CI/CD solutions like CircleCI come in to provide greater visibility. In this post, we’ll walk through the Coralogix-CircleCI integration and how it provides data and alerts to allow CircleCI users to get even more value from their investment.

See this post for how to set up the integration.

Once completed, the integration ships CircleCI logs to Coralogix. Coralogix leverages our product capabilities and ML algorithms providing users with deep insight and proactive management capabilities. It allows them to continuously adjust, optimize and improve their CI processes. To borrow from CI/CD vocabulary, it becomes a CM[CI/CD], or a Continuously Managed CI/CD, process.

In addition, using the CircleCI Orb will automatically tag each new version you deploy allowing you to enjoy our ML Auto Version Benchmarks. 

The rest of this document provides examples of visualizations and alerts that will help implement this continuous management of your CircleCI implementation.

Visualizations

Top committers

This table shows the top committers per day. The time frame can be adjusted based on specific customer profiles. It uses the user.name’ field to identify the user.

Jobs with most failures

This visualization uses the ‘status’ and ‘workflows.job_name’ fields. It shows the jobs that had the most failures per time frame specified by the customer profile.

Class distribution

Class usage relates to memory and resource allocation. This visualization uses the field  ‘picard.resource_class.class’, part of the log’s picard object. Its optional values are ‘small’, ‘medium’, and ‘large’. The second field used is ‘workflows.workflow_name’ that holds the workflow name. Although classes are set by configuration and do not change dynamically, they are tightly related to your credit charges. It will be good to have this monitored to identify if a developer unexpectedly runs a new build that will impact your quota.

For the same reasons mentioned above, it could be beneficial to see a view of the class distribution per executors like the following example:

Average job runtime

The following visualizations help identify trends in job execution time, per build environment, and per executor. They give different perspectives on the average job runtime.

The first one gives an executor perspective for each job environment. It uses the ‘picard.build_agent.executor’ and the ‘build_time_millis’ to calculate the average per environment and per day (in this example day is the aggregation unit).

Depending on your needs, you can change the time frame for calculating the averages. It is important to note that the visualization should calculate the average time of successful job runs based on the filter ‘status:success’. 

In a very similar way, this visualization shows the average job runtime per workflow, using the ‘workflows.workflow_name’ field:

This visualization shows the average job runtime for each job:

Depending on preference these visualizations can be configured to show a line graph. This is applicable to companies with higher frequency runs:

Job runs distribution per workflow

This visualization gives information about the number of runs for each workflow’s job. It can alert the analyst or engineer to situations where a job has more than the usual number of runs due to failures or frequent builds:

Number of workflows

These two are quite simple. Again, it is the user specific circumstances that will determine the time range for data aggregation.

Job status ratio

This visualization shows the distribution of job completion statuses. There are four optional values; ‘success’, ‘onhold’, ‘failed’, and ‘cancelled’. The ‘onhold’ and ‘cancelled’ values are very rare. It is important to get visibility into the ratio as an indicator of when things actual do go wrong.

Alerts

In this section we will show how you can transform managing your deployment environment to be more proactive by using some of Coralogix’s innovative capabilities. Such as, its ability to identify changes from normal application behavior and the easy-to-use alert engine. Here are a few examples of alerts covering different aspects of monitoring CircleCI insights.

This tutorial explains how to use and define Coralogix alerts.

Job failure

This is an operational alert. It sends a notification when a job named ‘smtt’ fails.

Alert query:

workflows.job_name:smtt AND status:failed

If we have more than one job with the same name, running in different workflows, we can add the workflow name to the query, ‘AND workflows.workflow_name:CS’. The alert query can also be changed to capture a set of alerts based on a query language. For example ‘workflows.job_names:sm*’ or ‘workflows.job_names:/sm[a-z]+/’.

The alert condition is an immediate alert

Job duration

This is another example of an operational alert that sends a notification if a job runtime is larger than a threshold. We use ‘build_time_millis.numeric’. This is a numeric version Coralogix creates for every field. Since every job is different, the query for this alert can come with a job or workflow name. It can also look for outlier values like in this example:

Alert query:

build_time_millis.numeric:>20000

Alert condition:

Ratio of ‘failure’ compared to to all job runs is above the threshold

In this operational alert, a user will get a notification when the ratio of failed job runs to overall number of runs, is over a certain threshold. In this purpose, we’ll use the Coralogix ‘Ratio alerts’ type. For this alert type, users define two queries and then alert on the ratio between the number of logs in both queries’ results. Our query example counts the overall number of jobs and the number of failures:

Alert query 1:

status:*

Alert query 2:

status:failed

This condition alerts the user if failures are more than 25% of job outcomes:

SSH Disabled

This is a security alert. The field ‘ssh_disabled’ is a boolean field. When false, it indicates that users are running jobs and workflows remotely using SSH. For some companies, SSH runs will be considered a red flag.

Alert query:

ssh_disabled:false

Alert condition:

If choosing specific fields for the notification using the Coralogix Alert Notification Content, make sure you include ‘ssh_users’. Its value is an array of strings that includes the user names of the SSH users. 

You can of course set security alerts based on other key-value pairs like ‘user.login’, ‘user.name’, or ‘picard.build_agent.instance_ip’.

As an example, this query will create an alert if picard.build_agent.instance_ip does not belong to a group of approved IP addresses that start with 170:

Alert query:

NOT picard.build_agent.instance_ip.keyword:/170.d{1,3}.d{1,3}.d{1,3}/

To learn more about keyword fields and how to use regular expressions in queries see our queries tutorial.

As you know each company has its own build schedule and configuration. One of the powers of Coralogix is its ease of use and flexibility, allowing you to take the concepts and examples found in this document and adapt them to your own environment and needs. We are always an email or intercom chat away.