Understanding the Three Pillars of Observability

Data observability and its implementation may look different to different people. But, underneath all the varying definitions is a single, clear concept:

Observability enables you to understand what is happening in your software by looking at externally available information.

Most software that’s run today uses microservices or loosely coupled distributed architecture. While this design makes scaling and managing your system more straightforward, it can make troubleshooting issues more difficult. 

The three pillars of observability are different methods to track software systems, especially microservices. Individual pillars of observability include event logs, metrics, and traces. Using the three pillars together rather than individually will significantly increase DevOps teams’ productivity and give your users a better experience interacting with your system. 

Let’s dive into each of the three pillars and what insights and drawbacks they have. We will also examine how using them in combination will vastly improve your system’s observability.

Working with Logs

A log is a timestamped record of an event that occurred in your software. This record is the most granular information available in any of the three pillars. It is up to the developers to implement logging in code, ideally using some kind of standard. Logs are easy to implement since most software libraries and languages provide built-in support. 

The DevOps team might want logs to be:

  1. Plaintext or Unstructured: some free-form human-readable string
  2. Structured: formatted in a consistent manner such as JSON objects

The format chosen depends on how DevOps teams will use the logs in troubleshooting. Plaintext logs are common to use when first prototyping a system or mocking data. These logs are helpful because they are easily read and created by developers working on software. Structured logs are the preferred format for modern software development since structures like JSON lend well to analytics.

Insights from Logs

DevOps professionals need to know what happened in the software to troubleshoot issues at the system or software level. Logs provide insights into what happened before, during, and after a problem occurred. A trained eye who is monitoring logs can tell what went wrong during a specific time segment in a specific piece of software.

Logs allow for analysis at the most granular level of any of the three pillars. Use logs to uncover root causes for your system’s issues and find why incorrect, unpredictable, or suboptimal system behaviors occur.

Limitations of Logs

Logs can show what is happening in a specific piece of software. For companies running microservices, the issue may not lie within a given service but how different functions are connected. To understand the linkages between microservices, DevOps professionals need to look at another of the three pillars of observability: traces.

DevOps and business teams may need to define the urgency of a problem based on how often an issue is occurring in some situations. Logs alone may show the problem but do not also show how often the problem has occurred. To solve this issue, DevOps professionals need to again look to another of the three pillars of observability for the solution: metrics.

Saving logs that go back a long time can increase costs due to the amount of storage required to save all the information. Similarly, spinning up new containers or instances to handle increases in client activity means increasing the logging amount and storage cost. That’s why a platform like Coralogix is indispensable, especially with its new pricing model and philosophy that helps you categorize your data and pay for the usage.

The Value of Metrics

Metrics represent system performance data measured over some time as a numerical value. A metric conveys information about a defined and measurable attribute of your system, such as a service-level indicator (SLI). 

Since metrics are numerical values fluctuating over time, teams often choose to represent them in graphical format. The graphical analysis allows DevOps professionals to quickly see how aspects of the system are behaving over time. Often, different tools are required for collecting and displaying metrics. Prometheus, an open-source metric tool, can send data to tools like Grafana, a popular tool for visualization.

Metrics can trigger alerts when their value crosses a preset threshold. Typical metric measuring tools such as Prometheus have built-in alerting capabilities. Alerts give DevOps teams knowledge of when the system needs maintenance or troubleshooting and what issues have arisen at any given time. 

Insights from Metrics

Unlike logs, metrics do not necessarily scale up in cost and storage requirements as your client activity increases. Since they are just summations or averages of measurable system attributes, the numbers will shift with changing activity rates. Adding more containers or instances to your system may add more dimensions to your metrics, but being compact as they are, this should not significantly affect cost.

Metrics are better suited for alerting than logs since they are already a numerical value and can apply a simple threshold. They are also malleable in applying statistical analysis to make the data useful in visualization and troubleshooting.

Some log analysis and observability tools are also adding features that convert log data to metrics since it’s more scalable and easier to monitor and alert on.

Limitations of Metrics

Metrics tend to include a name and key-value pairs containing metadata relevant to the metric. The metadata values are called labels in Prometheus. The more labels provided, the more detail you have about what the metric means in your system. Another way of saying this is that the data has dimensionality. Without labels, the metric data has no context and is more challenging to use when troubleshooting system issues. 

Some labels may need to use high cardinality data in labels. High cardinality data is a label that has many distinct values, such as a single user identifier in a system with thousands or millions of users. High cardinality data is difficult to query and can cause time delays and efficiency issues for tools processing your metric data.

For metrics to be helpful, you need first to identify what to track. Ops teams will track system aspects such as availability, request rate, system utilization, and error rates. It’s up to you to work out the best metrics to track.

Before setting up metrics, Ops teams need to identify what to track in metrics. They must also take care not to track too many items in metrics. Otherwise, they will have more data than can be effectively analyzed. 

Understanding Tracing

Tracing is a valuable paradigm for any distributed system. Traces use a universally unique identifier for each piece of data. This unique identifier travels with the data, allowing for tracking its lifecycle as it travels throughout your microservices. 

This concept of tracing was introduced as distributed computing and microservices rose in popularity. Systems using stateless computing can quickly become difficult to track data sent to multiple services for subsequent processing. Tracing is useful whenever multiple components exist in a system and data is passed between them.

Insights from Traces

The trace will pick up data allowing DevOps teams to understand what path data has taken, how long the data takes to traverse the path and the data’s architecture at each step. With this information, ops teams can identify bottlenecks in your system to debug steady-state problems with data flows.

Limitations of Traces

Tracing is the most difficult of the three pillars to implement. If you need to add tracing to an existing system, this is especially true. Every component or function along the data’s path needs to propagate the trace data for it to be effective. For large codebases, this can mean developers need to update many functions. 

If your system uses different coding languages or frameworks, tracing can also be complicated to implement. Since all functions need to propagate traces, developers may need to find different tracing methods for each language or framework used in a system. Systems that use the same language and framework can be retrofit with tracing more easily than heterogeneous systems.

Greater than the Sum of its Parts

The three pillars of observability include event logs, metrics, and traces. Each pillar provides different insights into the health of your system. Implementing only some of the three pillars means your teams will not have complete insight into the system’s functions and cannot troubleshoot or enhance the system efficiently. Implement the three pillars of observability together to give your system the best possible outcome.

By using a holistic approach to observability, teams can take both a proactive and reactive approach to maintain their system’s health. They can proactively receive alerts when metrics operate outside of known thresholds. They can effectively react to alerts and customer feedback by looking at high-cardinality traces and high granularity logs to understand what happened in the system at any given time. Having all this data together reduces the meantime for a resolution of any given problem. 

Other support tools use machine learning to understand where thresholds and issues can arise where the DevOps or business teams have not foreseen them. These algorithms learn your system’s behavior over time and detect when abnormal logs, traces, or metrics are present. Coralogix’s Machine Learning tools, such as their Flow Anomaly detector, can alert DevOps teams to investigate issues expeditiously.

What is Observability?

Data observability is a term that is becoming commonplace in both startups and enterprises. Log observability is different from monitoring, as it provides visualized metrics from a variety of different systems in a single pane of glass view. This is invaluable for organizations to understand the interdependencies and links between external events and internal performance. 

The need for observability has been driven by increasingly complex systems and the link between user experience and platform performance. In this article, we are going to explain what observability is, and why it’s so important.

Defining Observability

Observability is different from the concept of monitoring because it both provides context to the insights being examined and provides those insights without having to define what it is you’re looking for.

Observability requires three constituent parts to work: logs, metrics, and traces. However, observability goes much further than traditional monitoring, because of the observability and monitoring combination of these three elements.

Logs for Observability

Of the three elements of observability, logs are the most granular in detail. They are time-stamped records of events occurring in a system or with software. Logging practice is commonplace nowadays, although there are still the best practices that can be followed. Logs are available in structured or unstructured forms, and whilst structured logs may be seen as the newer (and maybe preferred) form, software engineers will choose the format of the logs based on their requirements. 

The reason logs are so useful for observability is that they provide an actual timeline of events so that you can analyze what went wrong and when. Logs alone, however, are not a perfect answer to understanding your system. Logs are costly to store, and only relay information regarding one component of a system. 

Metrics for Observability

Different from logs, metrics are dynamic numerical values that will change over time. They might relate to varying disk capacity, network performance, or even data from marketing systems. They are best viewed in a graph or with a data visualization tool, given the discrete nature of the data. Metrics usually form the basis for alerts that you might set, as a defined numerical value can act as a trigger warning for a DevOps engineer to intervene. 

The issue with metric-based alerts alone is that you are only notified when a threshold is hit. If the alert threshold is at the wrong level, you might hear about an issue too late. By themselves, metrics don’t allow you to diagnose a problem, only make you aware of a problem.

Traces for Observability

Tracing is the final of the three constituent aspects of observability. While logs are discrete data, metrics are continuous data, traces use unique data “tags” to follow data throughout a process or application workflow. System traceability is complex to implement but invaluable for identifying performance bottlenecks across an entire distributed system.

Traces give you the ability to understand where problems are taking place, which is difficult in a microservices architecture. Distributed tracing gives visibility in a complex architecture, but its complexity to implement is directly linked to the complexity of your system.

The Power of Three 

Observability is the combination of the three constituent elements of log monitoring, metrics, and traces. Each of these three elements provides different insights for separate aspects of your system. 

The power of observability comes from the successful implementation, collation, and analysis of these three elements. Observability is the extrapolation of the inherent strengths of a good monitoring strategy. A true observability platform allows you to view, analyze, and interrogate data that gives you real insights into your system’s health and performance. By collecting holistic data from disparate elements of your system, and viewing them in a single pane of glass, you’ll be empowered to optimize your system performance and diagnose any problems.

Benefits and Challenges of Observability

Like any newer practice or principle, observability has both positives and negatives. The positives far outweigh the negatives, and most of the challenges come from the unsuccessful or incomplete implementation of an observability platform. 

Benefits of doing things differently

Organizations that fully invest in an observability process and platform like Coralogix see a tangible value-add. The ability to compare and analyze metrics from marketing systems alongside system performance can allow you to spot performance bottlenecks or even a misconfigured load balancer. 

Translating this sort of diagnosis into a cost-saving or customer experience improvement is a clear benefit. 

Traditional log monitoring means viewing data in isolation or having to switch between systems and log files to diagnose a problem. Observability allows metrics and traces to be overlaid with log data so that you spot where things could be improved without the need to dig through reams of logs.

Where challenges can arise

Most challenges that crop up in relation to observability come from not embracing it fully.  For example, if your applications are built in various programming languages, you’ll find that it will be costly and time-consuming to implement full traceability. By only having traces on some aspects of your system, you’re likely to receive an unbalanced or misrepresented view of your system’s performance.

An additional challenge may appear in the nature of observability tooling itself. SaaS observability tools are still behind the curve of progress, in comparison with the tools that they’re monitoring. Fortunately, Coralogix can use S3 as a log repository to make a single destination for disparate applications’ logs.

Getting Started with Log Observability

We’ve covered what makes up good observability practice, as well as some things to keep in mind to avoid any pitfalls. To summarise, it’s important to have your log monitoring, traces, and metrics correctly configured and implemented before looking at observability. These three pillars form the foundation of good observability practice. 

You also need to keep in mind the limitations of the data observability that your system may produce, and embrace the right tools to overcome those challenges. Be it logs from on-prem applications, or traces not being applied to all elements of your system, Coralogix is well placed to pull everything together and display the insights you need.

Observability and Cyber Resiliency – What Do You Need To Know?

Observability is one of the biggest trends in technology today. The ability to know everything, understand your system, and analyze the performance of disparate components in tandem is something that has been embraced by enterprises and start-ups alike.

What additional considerations need to be made when factoring in cyber resiliency? A weekly review of the headlines reveals a slew of news covering data breaches, insider threats, or ransomware. The latter is particularly of concern, with multinationals and government institutions having their precious data ransomed for millions of dollars of cryptocurrency by faceless hackers.

In this article, we’ll examine two stances on the relationship between observability and cyber resiliency. First, we will look at the cyber resiliency considerations you need to be mindful of when implementing and configuring your observability solution. Then, we’ll move on to examining how a correctly deployed observability solution can empower your organization’s cyber resiliency.

Log data – cyber resiliency considerations

Logs are the indicators of what’s going on with your system. When they aren’t inside an observability platform, they are just rows and rows of raw textual data in a database. When examining the NIST Cyber Security Framework, logs are a big part of both the ‘identify’ and ‘detect’ pillars. The reason for this is simple. Logs are critical when it comes to identifying when a system is compromised.

How long should you keep your logs?

According to IBM, a company takes on average 197 days to identify a security breach. This means that a hacker could reside in your system, undetected, for well over half a year. If your log data is going to be a pivotal part of identifying an attack, as well as diagnosing one, you need to keep hold of them. As a guide then, any critical system logs (as well as any web or application facing systems) should be retained for at least a year.

Naturally, with any storage requirement, there will be costs associated with prolonged log retention. There are several ways that you can offset these additional costs, and Coralogix can help you understand the total cost of ownership associated with longer retention.

However, with the average cost of a data breach in 2020 coming in at $3.86 million, and knowing that you can save $1 million by containing a breach in under 30 days, it might be worth spending a little more on storage for your logs.

Securing your logs

If logs are one of the primary ways of detecting a security breach or understanding the impact of a security breach, then you have to ensure they are stored securely. This enables easier forensic analysis of all relevant data if you suspect a security breach has taken place. Some cyber insurance companies will even require that your logs be stored in a certain way as a part of your agreement.

Log Air-Gapping

Traditionally, for ease and cost, organizations will store log data on the same system or a connected system to the one being monitored. Unfortunately, this practice can do more harm than good. As hacking attacks have become more sophisticated, hackers are increasingly altering or removing logs to cover their tracks. Air-gapping your logs off to a cloud object store, like AWS S3, is the best practice for a cyber-resilient system.

Audit Log Immutability 

For the same reason as above, keeping your logs safe is important. Taking further steps, such as implementing immutability for log files, is a key consideration for cyber resilience.

Immutable audit logs are often requirements of security audits such as SOC2. Because audit logs are often indicators of which user accessed which database or application, or changed fundamental configurations, you should consider making them more resilient with immutability.

Sensitive Data in Logs

Keeping sensitive data in logs is a big no-no. Log data with credit card details, passwords, or other personal information that isn’t natively obscured or encrypted can cause you big issues. There are a few different ways that you can avoid security breaches in log data.

Password hashing, encryption, and salting are all options for decreasing the possible sensitivity of your log data. However, it may take a few more serious cybersecurity incidents before organizations treat log data with the same security considerations as they do production data.

Observability for Cyber Resiliency – DDoS attacks

From 2019 to 2020, there was a 20% rise in DDoS attacks. Netscout went as far as to attribute the pandemic conditions to this rise, where users are overly reliant on eCommerce and streaming services, which are easy prey to such attacks.

A DDoS (distributed denial of service) attack is effectively where one malicious actor or group seeks to overwhelm a platform or service with traffic from an unmanageable and unforeseen number of sources. Cloudflare is widely regarded as one of the leaders in enterprise-grade DDoS protection. It can be used in conjunction with other aspects of your system, such as network load balancers, to manage and mitigate such attacks.

Coralogix for DDoS detection and prevention

Observability, when properly deployed and understood, has the opportunity to be a huge asset in your cyber resilience toolkit. Coralogix has integrations with Cloudflare and Cloudflare’s audit log function. By combining data from Cloudflare with other relevant metrics, you can effectively enable an early DDoS warning system. 

For example, you might seek to use the Security Traffic Analyzer (STA) to collect data from the load balancers within AWS. This data can provide you with a full readout on the security posture of your organization’s AWS usage. 

The combination of STA and Cloudflare data in the Coralogix dashboard gets to the heart of why observability is powerful for cyber resiliency. The ability to cross-analyze these metrics in a visualization tool gives you real-time insights into threats like DDoS attacks, allowing you to react effectively.

Observability for Cyber Resiliency – the AI advantage

As discussed earlier in this article, hackers can exist in your system undetected for months at a time. One of the aspects of the NIST Cyber Security Framework’s ‘detect’ pillar is to “establish baseline behaviors for users, data, devices, and applications.” This is because, in order to identify nefarious activity, you need to know what normal looks like.

The problem with a loosely coupled multi-component microservices architecture is that there are a huge number of parts, devices (in the case of IoT), applications, platforms, users, and networks that are all generating data. Manually, it would be nearly impossible to compile this data and establish a baseline intelligent enough to deal with various fluctuations.

Coralogix and AI as-a-Service

As part of the Coralogix platform, you can benefit from state-of-the-art AI tooling to help with anomaly detection in monitoring data. Error Volume Detection creates a baseline of all errors and bad API responses across all components in a system and correlates those metrics in relation to time. 

Flow Anomaly analyzes the typical flow of logs as they are returned and alerts the user if that ratio or pattern is broken. Again, this AI-powered insight tool creates an intelligent and flexible baseline automatically, ensuring that if things are amiss you’re alerted immediately. 

Both of these tools, inherent in Coralogix, give you AI-powered insights into what is normal in even the most complex of systems. This level of baselining is critical for cyber resiliency and harnesses all of the benefits of observability to get you there.

Summary

In this article, we’ve talked about how you need to handle monitoring data (in particular logs) to be more cyber resilient. Logs are going to be your best friend when you’re hunting for an attacker or diagnosing a security breach. That’s why they should be handled with the same care and caution as production data (although they rarely are). Organizations that identify cyber-attacks in under 30 days save on average $1 million, and it’s your logs (stored unfettered) that are going to propel you toward that outcome.

We’ve also looked at how you can harness monitoring data to empower your organization’s cyber resiliency. By monitoring everything, not just firewalls and security components, then you can get a real insight into when things might be awry in your system. This is one of the fundamental principles of observability. 

Observability, when done right, gives you the ability to analyze disparate components and their performance within your system. This is invaluable for cyber resiliency, such as in the instance of a DDoS attack.

Additionally, observability platforms like Coralogix have built-in AI technology that baselines your system’s “normal” and highlights when something deviates from the norm. Manual approaches simply cannot carry out that level of analysis or detection on so many sources of data, especially not in real-time.

Adding Observability to Your CI/CD Pipeline in CircleCI

The simplest CI/CD observability pipeline consists of three stages: build, test, and deploy. 

In modern software systems, it is common for several developers to work on the same project simultaneously. Siloed working with infrequent merging of code in a shared repository often leads to bugs and conflicts that are difficult and time-consuming to resolve. To solve this problem, we can adopt continuous integration.  

Continuous integration is the practice of writing code in short, incremental bursts and pushing it to a shared project repository frequently so that automated build and testing can be run against it. This ensures that when a developer’s code gets merged into the overall project codebase, any integration problems are detected as early as possible. The automatic build and testing are handled by a CI server.

If passing the automated build and testing results in code being automatically deployed to production, that is called continuous deployment. 

All the sequential steps that need to be automatically executed from the moment a developer commits a change to it being shipped to production is referred to as a CI/CD pipeline.  CI/CD pipelines can range from very simple to very complex, depending on the needs of the application.

Important considerations when developing CI/CD pipelines

Building a CI/CD pipeline is no simple task. It presents numerous challenges, some of which include:

Automating the wrong processes

The whole premise of CI/CD is to increase developer productivity and optimize time-to-market. This goal gets defeated when the CI/CD pipeline has many steps in it that aren’t necessary or that could be done faster manually. 

When developing a CI/CD pipeline, you should:

  • consider how long a task takes to perform manually and whether it is worth automating
  • evaluate all the steps in the CI/CD pipeline and only include those that are necessary
  • analyze performance metrics to determine whether the pipeline is improving productivity
  • understand the technologies you are working with and their limitations as well as how they can be optimized so that you can speed up the build and testing stages.

Ineffective testing

Tests are written to find and remove bugs and ensure that code behaves in the desired manner. You can have a great CI/CD pipeline in place but still get bug-ridden code in production because of poorly written, ineffective tests. 

To improve the effectiveness of a CI/CD pipeline, you should:

  • write automated tests during development, ideally by practicing test-driven development (TDD)
  • examine the tests to ensure that they are of high quality and suitable for the application
  • ensure that the tests have decent code coverage and cover all the appropriate edge cases

Lack of observability in CI/CD pipelines

Continuous integration and continuous deployment underpin agile development. Together they ensure that features are developed and released to users quickly while maintaining high quality standards.  This makes the CI/CD pipeline business-critical infrastructure.

The more complex the software being built, the more complex the CI/CD pipeline that supports it. What happens when one part of the pipeline malfunctions? How do you discover an issue that is causing the performance of the CI/CD pipeline to degrade?

It is important that developers and the platform team are able to obtain data that answers these critical questions right from the CI/CD pipeline itself so that they can address issues as they arise.

Making a CI/CD pipeline observable means collecting quality and performance metrics on each stage of the CI/CD pipeline and thus proactively working to ensure the reliability and optimal performance of this critical piece of infrastructure. 

Quality metrics

Quality metrics help you identify how good the code being pushed to production is. While the whole premise of a CI/CD pipeline is to increase the speed at which software is shipped to get fast feedback from customers, it is also important to not be shipping out buggy code.

By tracking things like test pass rate, deployment success rate, and defects escape rate you can more easily identify where to improve the quality of code being produced. 

Productivity metrics

An effective CI/CD pipeline is a performant one. You should be able to build, test, and ship code as quickly as possible. Tracking performance-related metrics can give you insight into how performant your CI/CD pipeline is and enable you to identify and fix any bottlenecks causing performance issues.

Performance-based metrics include time-to-market, defect resolution time, deployment frequency, build/test duration, and the number of failed deployments. 

Observability in your CI/CD pipeline

The first thing needed to make a CI/CD pipeline observable is to use the right observability tool. Coralogix is a stateful streaming analytics platform that allows you to:

The observability tool you choose can then be configured to track and report on the observability metrics most pertinent to your application.

When an issue is discovered, the common practice is to have the person who committed the change that resulted in the issue investigate the problem and find a solution. The benefit of this approach is that it makes team members have a sense of complete end-to-end ownership of any task they take on as they have to ensure it gets shipped successfully. 

Another good practice is to conduct a post-mortem reviewing the incident to identify what worked to resolve it and how things can be done better next time. The feedback from the post-mortem can also be used to identify where CI/CD pipeline can be improved to prevent future issues. 

Example of a simple CircleCI CI/CD pipeline

There are a number of CI servers you can use to build your CI/CD pipeline. Popular ones include Jenkins, CircleCI, Gitlab and a newcomer Github Actions.

Coralogix provides integrations with CircleCI, Jenkins, and Gitlab that enable you to quickly and easily send logs and metrics to Coralogix from these platforms. 

The general principle of most CI servers is that you define your CI/CD pipeline in a yml file as a workflow consisting of sequential jobs. Each job defines a particular stage of your CI/CD pipeline and can consist of multiple steps. 

An example of a CircleCI CI/CD pipeline for building and testing a python application is shown in the code snippet below.

To add a deploy stage, you can use any one of the deployment orbs CircleCI provides. An orb is simply a reusable configuration package CircleCI makes available to help simplify your deployment configuration. There are orbs for most of the common deployment targets, including AWS and Heroku. 

The completed CI/CD pipeline with deployment to Heroku is shown in the code snippet below.

Having created this CI/CD pipeline you would think that you are done, but in fact, you have only done half the job. The above CI/CD pipeline is missing a critical component to make it truly effective: observability. 

Making the CI/CD pipeline observable

Coralogix provides an orb that makes it simple to integrate your CircleCI CI/CD pipeline. This enables you to send pipeline data to Coralogix in real-time for analysis of the health and performance of your pipeline. 

The Coralogix orb provides four endpoints:

  • coralogix/stats for sending the final report of w workflow job to Coralogix
  • coralogix/logs for sending the logs of all workflow jobs to Coralogix for debugging
  • coralogix/send for sending 3rd party logs generated during a workflow job to Coralogix
  • coralogix/tag for creating a tag and a report for the workflow in Coralogix

To add observability to your CircleCI pipeline:

  1. In your Coralogix account, go ahead and enable Pipelines by navigating to Project Settings -> Advanced Settings -> Pipelines and turn it on
  2. Add the Coralogix orb stanza at the top of your CircleCI configuration file
  3. Use the desired Coralogix endpoint in your existing pipeline

The example below shows how you can use Coralogix to debug a CircleCI workflow. Adding the coralogix/logs job at the end of the workflow means that all the logs generated by CircleCI during the workflow will be sent to your Coralogix account, which will allow you to debug all the different jobs in the workflow. 

Conclusion

CI/CD pipelines are a critical piece of infrastructure. By making your CI/CD pipeline observable you turn it into a source of real-time actionable insight into its health and performance.

Observability of CI/CD pipelines should not come as an after-thought but rather something that is incorporated into the design of the pipeline from the onset. Coralogix provides integrations for CircleCI and Jenkins that make it a reliable partner for introducing observability to your CI/CD pipeline. 

Announcing our $55M Series C Round Funding to further our storage-less data vision

It’s been an exciting year here at Coralogix. We welcomed our 2,000th customer (more than doubling our customer base) and almost tripled our revenue. We also announced our Series B Funding and started to scale our R&D teams and go-to-market strategy.

Most exciting, though, was last September when we launched Streamaⓒ – our stateful streaming analytics pipeline.

And the excitement continues! We just raised $55 million for our Series C Funding to support the expansion of our stateful streaming analytics platform and further our storage-less vision.

Streamaⓒ technology

Streamaⓒ technology allows us to analyze your logs, metrics, and security traffic in real-time and provide long-term trend analysis without storing any of the data. 

The initial idea behind Streamaⓒ was to support our TCO Optimizer feature which enables our customers to define how the data is routed and stored according to use case and importance.

“We started with 3 very big international clients spending half a million dollars a year for our service, and we reduced that to less than $200,000. So, we created massive savings, and that allowed them to scale,” CEO Ariel Assaraf explains. “Because they already had that budget, they could stop thinking about whether or not to connect new data. They just pour in a lot more data and get better observability.”

Then we saw that the potential of Streama goes far beyond simply reducing costs. We are addressing all of the major challenges brought by the explosive growth of data. When costs are reduced, scale and coverage are more attainable. Plus, Streamaⓒ is only dependent on CPU and automatically scales up and down to match your requirements so we can deliver top-tier performance in the most demanding environments.

What’s next for Coralogix

Moving forward, our goal is to advance our storage-less vision and use Streamaⓒ as the foundation for what we call the data-less data platform.

There are two sides to this vision. On the one hand, we have our analytics pipeline which is providing all of the real-time and long-term insights that you need to monitor your applications and systems without storing the data. On the other hand, we’re providing power query capabilities for archived data that hasn’t ever been indexed.  

So, imagine a world where you can send all of your data for analysis without thinking about quotas, without thinking about retention, without thinking about throttling. Get best-in-class analytics with long-term trends and be able to query all the data from your own storage, without any issues of privacy or compliance.

With this new round of funding, we’re planning to aggressively scale our R&D teams and expand our platform to support the future of data.

Thank you to our investors!

We’re proud to partner with Greenfield Partners, who led this round, along with support from our existing investors at Red Dot Capital Partners, StageOne Ventures, Eyal Ofer’s – O.G. Tech, Janvest Capital Partners, Maor ventures, and 2B Angels.

We have a lot of ambitious goals that we expect to meet in the next few quarters, and this funding will help us get there even faster.

Learn more about Coralogix: https://coralogixstg.wpengine.com/

Intro to AIOps: Leveraging AI and Machine Learning in DevOps

AIOps is a DevOps strategy that brings the power of machine learning to bear on observability and system management. It’s not surprising that an increasing number of companies are now adopting this approach.  

AIOps first came onto the scene in 2015 (coincidentally the same year as Coralogix) and has been gaining momentum for the past half-decade. In this post, we’ll talk about what AIOps is, and why a business might want to use it for their log analytics.

AIOps Explained

AIOps reaps the benefits of fantastic advances in AI and machine learning in recent decades.  Because enterprise applications are complex, yet predictable systems, AI and machine learning can be used with great effect to analyze their data and extract patterns. The AIOps Manifesto spells out five dimensions of AIOps

  1. Data set selection – machine learning algorithms can parse vast quantities of noisy data and provide Ops teams with a curated sample of clean data.  It’s then much easier to extract trustworthy insights and make effective business decisions.
  2. Pattern discovery – this generally occurs after a data set has been appropriately curated. It involves using a variety of ML techniques to extract patterns. This can be rule-based or neural networks that involve supervised and unsupervised learning.
  3. Inference – AIOps uses a range of inference algorithms to draw conclusions from patterns found in the data. These algorithms can make causal inferences about system processes ‘behind the data.’  Combining expert systems with pattern-matching neural networks creates highly effective inference engines.
  4. Communication – For AIOps to be of value it’s not enough for the AI to have the knowledge, it needs to be able to explain findings to a human engineer! AIOps has a variety of strategies for doing this including visualization and natural language summaries.
  5. Automation – AIOps achieves its power by automating problem-solving and operational decisions. Because modern IT systems are so complex and fast-changing, automated systems need to be intelligent. They need machine learning to respond to quickly changing conditions in an adaptive fashion.

Why IT needs AIOps

As IT has advanced, it has shouldered more and more of the essential processes of business organizations.  Not only has technology become more sophisticated, it has also woven itself into business practice in increasingly intricate ways.

The ‘IT department’ of the ‘90s, responsible for a few niche business applications, has virtually gone. 21st century IT lives in the cloud. Enterprise applications are virtual, consisting of thousands of ephemeral components.  Businesses are so dependent on them that many business processes are IT processes.

This means that DevOps has had to upgrade. Automation is essential to managing the fast-changing complexity of modern IT. AIOps is an idea whose time has come. 

How companies are using AIOps

Over the past decade, AIOps has been adopted by many organizations. In a recent survey, OpsRamps found that 68% of surveyed businesses were experimenting with AIOps due to its potential to eliminate manual labor and extract data insights.

William Hill, COTY, and KPN are three companies that have chosen the way of AIOps and their experience makes fascinating reading:

AIOps Case Study: William Hill

William Hill started using AIOps to combat game and bonus abuse. As a betting and gaming company, their revenues depended on people playing by the rules and with so many customers, a human couldn’t keep track of the data.

William Hill’s head of Capacity and Monitoring Engineering, Andrew Longmuir explains the benefits of adopting AIOps.  First, it helped with automation, and in particular what Andrew calls “silo-busting”. AI and machine learning allowed William Hill to integrate nonstandard data sources into their toolchain.

Andrew uses the analogy of a jigsaw. Unintegrated data sources are like missing pieces of a puzzle. Using machine learning allows William Hill to bring them back into the fold and create a complete picture of the system.

Second, AIOps enables William Hill’s team to solve problems faster.  Machine learning can be used to window data streams, reducing alert volumes, and eliminating operational noise.  It can also detect correlations between alerts, helping the team prevent problems before they arise.

Finally, incorporating AI and Machine Learning into William Hill’s IT strategy has even improved their customer experience. This results from them leveraging insights extracted from their analytics data to improve the design of their website.

Andrew has some words of wisdom for other organizations considering AIOps. He recommends focusing on a use case that is central to your company.  Teams need to be willing to trial multiple different solutions to find the optimum setup.

AIOps Case Study: COTY

COTY adopted AIOps to take the agility and scalability of their IT strategy to the next level. COTY is a major player in the cosmetics space, with clients that include Max Factor and Calvin Klein.  As a dynamic business, they relied on flawless and versatile performance from their IT infrastructure to manage everything from payrolls to wireless networks.

With over 4,000 servers and a cloud-based infrastructure, COTY’s IT system is far too complex for traditional DevOps strategies to handle. To deal with it they’ve chosen AIOps.

AIOps has improved the way COTY handles and analyzes data. Data sources are integrated into a ‘data lake’, and machine learning algorithms can crunch its contents to extract patterns.

This has allowed them to minimize noise, so their operations department isn’t bombarded with irrelevant and untrustworthy information. 

AIOps has transformed the way COTY’s DevOps team thinks about visibility. Instead of a traditional events-based model, they now use a global, service-orientated model.  This allows the team to analyze their business and IT holistically.

COTY’s Enterprise Management Architect, Dan Ellsweig, wants to take things further. Dan is using his AIOps toolchain to create a dashboard for executives to view. For example, the dashboard might show the CTO what issues are being dealt with at a particular point in time.

AIOps Case Study: KPN

KPN is a Dutch telecoms business with operating experience in many European countries.  They adopted AIOps because the amount of data they were required to process was more than a human could handle.

KPN’s Chief Product Owner Software Tooling, Arnold Hoogerwerf, explains the benefits of using AIOps. First, leveraging AI and machine learning can increase automation and reduce operational complexity. This means that KPN’s DevOps team can do more with the same number of people.

Secondly, AI and machine learning can speed up the process of investigating problems. With traditional strategies, it may take weeks or months to investigate a problem and find the root cause. The capacity of AI tools to correlate multiple data sources allows the team to make crucial links in days that otherwise would have taken weeks.

Finally, Hoogerwerf has a philosophical reason for using AIOps.  He believes that while data is important, it’s even more important to keep sight of what’s going on behind the data.

Data on its own is meaningless if you don’t have the knowledge and wisdom with which to interpret it.

Implementing AIOps with Coralogix

Although the three companies we’ve looked at are much larger than the average business, AIOps is not just for big companies. The increasing number of platforms and vendors supporting AIOps tooling means that any business can take advantage of what AIOps has to offer.

The Coralogix platform launched two years after the birth of AIOps and our philosophy has always paralleled the principles of AIOps.  As Coralogix’s CEO Ariel Assaraf explains, organizations are burdened with the need to analyze increasing quantities of data. They often can’t do this with existing infrastructure, resulting in more than 99% of data remaining completely untapped.

In this context, the Coralogix platform is a game-changer. It allows organizations to analyze data without relying on storage or indexing. This enables significant cost savings and greater data coverage. Adding machine learning capabilities on top of that makes Coralogix much more powerful than any alternative in the market. Instead of cherry-picking data to analyze, stateful stream analysis occurs in real-time.  

How Coralogix can help with pattern discovery

One of the five dimensions of AIOps is pattern discovery. Due to the ability of machine learning to analyze large quantities of data, the Coralogix platform is tailor-made for discovering patterns in logs. As a case in point, gaming company AGS uses Coralogix to analyze 100 million logs a day.

The patterns extracted have allowed their DevOps team to reduce MTTR by 70% and their development team to create enhanced user experiences that have tripled their user base.

Another case is the neural science and ML company Biocatch. With exponentially increasing log volumes, their plight was a vivid illustration of the complexity that 21st century DevOps teams increasingly face.

Coralogix could handle these logs by clustering entries into patterns and finding connections between them. This allowed Biocatch to handle bugs and solve problems much faster than before.

How Coralogix can communicate insights

Once patterns have been extracted, DevOps engineers receive automated insights and alerts about anomalies in the system behavior.  Coralogix achieves this by integrating with a variety of dashboards and visualization solutions such as Prometheus and CloudWatch.

Coralogix also implements a smarter alerting system that flags anomalies to DevOps engineers in real time.  Conventional alerting systems require DevOps engineers to set alerting thresholds manually. However, as we saw at the start of this article, modern IT is too complex and fast-changing for this approach to work.

Coralogix solves this with dynamic alerts. These use machine learning to adjust thresholds in response to data.  This enables a much more effective approach to anomaly detection, one that is tailored to the DevOps landscape of the 21st century.

Wrapping Up

The increasing complexity and volumes of data faced by modern DevOps teams mean that humans can no longer handle IT operations without help.  AIOps aims to leverage AI and machine learning with a view to converting high-volume data streams into insights that human engineers can act on.

AIOps fits with Coralogix’s own approach to DevOps, which is to use machine learning to help organizations effectively use the increasing volumes of data they generate.  Observability should be for the many, not just a few.

Tutorial: Set Up Event Streams in CloudWatch

When building a microservices system, configuring events to trigger additional logic using an event stream is highly valuable. One common use case is receiving notifications when errors are seen in one of your APIs. Ideally, when errors occur at a specific rate or frequency, you want your system to detect that and send your DevOps team a notification.

Since AWS APIs often use stateless functions like Lambdas, you need to include a tracking mechanism to send these notifications manually. Amazon saw a need for a service that will help development teams trigger events under custom conditions. To fill this need, they developed CloudWatch Events and subsequently EventBridge.

Introduction to CloudWatch Events

CloudWatch Events and EventBridge are AWS services that deliver data to a target upon occurrence of certain system events. They work on the same backend functionality, with EventBridge having a few more implemented features. System events supported include operational changes, logging events, and scheduled events.

CloudWatch Events will trigger a subsequent event when a system event occurs, sending data to another service based on your setup. Triggered services can include calling Lambda functions, sending SNS notifications, or writing data to a Kinesis Data Stream. 

Event Triggers

AWS represents all events with JSON objects that have a similar structure. They all have the same top-level fields that help the Events service determine if an input matches your requested pattern. If an event matches your pattern, it will trigger your target functionality.

You can use commands to write directly to EventBridge from AWS services like Lambda. Some AWS services like CloudTrail and external tools can also automatically send data to EventBridge. External sources with AWS integrations can also be used as event triggers.

Event Buses

Event buses receive events from triggers. Event triggers and event rules both specify which bus to use so events can be separated logically. Event buses also have associated IAM policies that specify what can write to the bus and update or create event rules and event targets. Each event bus can support up to 100 rules. If you require more event rules, you must use another event bus. 

Event Rules

Event rules are associated with specific event buses. Each rule determines whether events meet certain criteria. When they do, EventBridge sends the event to the associated target. Each rule can send up to 5 different targets which process the event in parallel.

AWS provides templates to create rules based on data sources. Users can also set up custom rules which further filter data based on its contents. For a complete list of available filtering operations, see the AWS specification for content-based filtering.

Event Targets

Event targets are AWS endpoints triggered by events matching your configured pattern. Targets may just receive some of the event trigger data directly for processing.

For example, you can trigger an AWS Lambda function with the incoming event data, using Lambda to process the event further. Targets can also be specific commands like terminating an EC2 instance.

How to Set Up CloudWatch Events in EventBridge

Now that we have covered some parameters of CloudWatch Events, let’s walk through an example of how to set up an event trigger and target.

In this example, we will use the EventBridge interface to set up a rule. The EventBridge interface is very similar to the interface available in CloudWatch. The rule we make will trigger a Lambda when an API Gateway is hit with invalid input. DevOps teams commonly see invalid inputs when nefarious users are trying to get into your API.

1. Create a New Event Bus

This step is optional since AWS does provide a default event bus to use. In this example, we will create a new event bus to use with our rule. Since rules apply to only one event bus, it is common to group similar rules together on a bus. 

create new event bus in eventbridge

2. Name and Apply a Policy to the New Bus

To create your bus, add a name and a policy. There is an AWS template available for use by clicking the load template button, as shown below.

This template shows three common cases that could be used for permissions depending on the triggers and targets used. For more information about setting up the IAM policy, see the AWS security page for EventBridge.

The example below shows permissions for an account to write to this event bus. When ready, press the create button to finish creating your event bus.

create new event bus eventbridge

3. Navigate to the Rules Section in the Amazon EventBridge Service 

In this example, we will skip creating an event bus and use the default provided by AWS. Add a name and optionally add a description for the new rule.

create new rule eventbridge

4. Select Event Pattern 

Here there is an option between two types of rule: event pattern and schedule. Use event pattern when you want to trigger the rule whenever some specific event occurs. Use schedule when you want to trigger the rule periodically or using a cron.

5. Select Custom Pattern

Here there is an option between two types of pattern matching. AWS will route all data for the source through your event bus when you use pre-defined pattern by service

Since we want only specific events from the Lambda behind our API, we will choose custom pattern. The pattern below will look at event values sent from our Lambda function to the event bus. If the event matches our requirements, EventBridge sends the event to the target.

create custom pattern eventbridge

6. Select the Event Bus

Select the event bus for this rule. In this case, we will select our custom bus created in Step 2.

select event bus eventbridge

7. Select Targets

Select targets for your rule by selecting the target type and then the associated instance of the type. In this case, a Lambda function will be invoked when an event matching this rule is seen.

By selecting Matched events, the entire event content will be sent as the Lambda input. Note there is also the capability to set retry policies for events that cause errors in the target functions. After this step, press Create Rule to complete the EventBridge setup.

select targets eventbridge

Once the event bus and rule are created as above, writing to the EventBridge inside the API’s Lambda function will trigger your target Lambda. If using a serverless deployment, the AWS-SDK can be used to accomplish this. 

Processing in the target Lambda should track when errors occur. Developers can create metrics from the errors and track them using custom microservices or third-party tools like Coralogix’s metrics analytics platform.

You can also send raw data to Coralogix for review by directly writing to their APIs from EventBridge instead of hitting a Lambda first. EventBridge supports outputs that directly hit API Gateways, such as the one in front of Coralogix’s log analytics platform.

Wrap Up

Amazon enhanced CloudWatch Rules, creating a unique tool called EventBridge. EventBridge allows AWS users to process events from many different sources selectively. Processing data based on content is useful for processing large, disparate data sets.

Information tracked in EventBridge can also be used for gaining microservice observability. EventBridge uses triggers to send data to an event bus. Event rules are applied to each bus and specify which targets to invoke when an event matches the rule’s pattern. 

In the example above, EventBridge’s configuration will detect invalid API call events. This data is helpful, but at scale will need further processing to differentiate between a nefarious attack and simple errors.

Developers can send data to an external tool such as Coralogix to handle the analysis of the API data and to detect critical issues.

Why Are SaaS Observability Tools So Far Behind?

Salesforce was the first of many SaaS-based companies to succeed and see massive growth. Since they first started out in 1999, Software-as-a-Service (SaaS) tools have taken the IT sector and, well the world, by storm. For one, they mitigate bloatware by moving applications from the client’s computer to the cloud. Plus, the sheer ease of use brought by cloud-based, plug-and-play software solutions has transformed all sorts of sectors. 

Given the SaaS paradigm’s success in everything from analytics to software development itself, it’s natural to ask whether its Midas touch could improve the current state of data observability tools.

Heroku and the Rise of SaaS

Let’s start with a system that we’ve previously talked about, Heroku. Heroku is one of the most popular platforms for deploying cloud-based apps. 

Using a Platform-as-a-Service approach, Heroku lets developers deploy apps in managed containers with maximum flexibility. Instead of apps being hosted in traditional servers, Heroku provides something called dynos.

Dynos are like cradles for applications. They utilize the power of containerization to provide a flexible architecture that takes the hassle of on-premises configuration away from the developer. (We’ve previously talked about the merits of SaaS vs Hosted solutions.)

Heroku’s dynos make scalability effortless. If developers want to scale their app horizontally, they can simply add more dynos. Vertical scaling can be achieved by upgrading dyno types, a process Heroku facilitates through its intuitive dashboard and CLI.

Heroku can even take scaling issues off the developer’s hands completely with its auto-scaling feature. This means that software companies can focus on their mission, providing high-quality software at scale without worrying about the ‘how’ of scalability or configuration.

Systems like Heroku give us a tantalizing glimpse of the power and convenience a SaaS approach can bring to DevOps. The hassle of resource management, configuration, and deployment are abstracted away, allowing developers to focus solely on coding.

SaaS is making steady inroads into DevOps. For example, Coralogix (which integrates with Heroku and is also available as a Heroku add-on), operates with a SaaS approach, allowing users to analyze logs without worrying about configuration details.

Not So SaaS-y Tooling

It might seem that nothing is stopping SaaS from being applied to all aspects of observability tooling. After all, Coralogix already offers a SaaS log analytics solution, so why not just make all logging as SaaS-y as possible?

Log collection is the fly in this particular ointment.  Logging data is often stored in a variety of formats, reflecting the fact that logs may originate from very different systems.  For example, a Linux server will probably store logs as text data while Kubernetes can use a structured logging format or store the logs as JSON.

Because every system has its own logging format, organizations tend to collect their logs on-premises is a big roadblock to the smooth uptake of SaaS. In reality, the variety of systems, in addition to the option to build your own system, is symptomatic of a slower move toward observability in the enterprise. However, this range of options doesn’t mean that log analysis is limited to on-prem systems.

What’s important to note is that organizations are really missing out on SaaS observability tooling. Why is this the case, when SaaS tools and platforms are so widespread? The perceived complexity of varying formats, combined with potential cloud-centric security concerns, might have a role to play.

Moving to Cloud-Based Log Storage with S3 Bucket

To pave the way to Software as a Service log collection, we need to stop storing logs on-prem and move them to the cloud.  Cloud computing is the keystone of SaaS. Applications can be hosted on centralized computing resources and piped to thousands of clients.

AWS lets you store logs in the cloud with S3 Bucket.  S3 is short for Simple Storage Service. As the name implies, S3 Bucket is a service provided by AWS that is specifically designed to let you store and access data quickly and easily.

Pushing Logs to S3 with Logstash and FluentD

For those who aren’t already using AWS, output plugins allow users to push existing log records to S3.  Two of the most popular logging solutions are FluentD and Logstash, so we’ll look at those here. (Coralogix integrates with both FluentD and Logstash)

FluentD Plugin

FluentD contains a plugin called out_s3. This enables users to write pre-existing log records to the S3 Bucket.  Out_s3 has several cool features.

For one, it splits files using the time event logs were created. This means the S3 file structure accurately reflects the original time ordering of log records and not just when they were uploaded to the bucket.

Another thing out_s3 allows users to do is incorporate metadata into the log records.  This means each log record contains the name of its S3 Bucket along with the object key. Downstream systems like Coralogix can then use this info to pinpoint where each log record came from.

At this point, I should mention something that could catch new users out. FluentD’s plugin automatically creates files on an hourly basis. This can mean that when you first upload log records, a new file isn’t created immediately, as it would be with most systems.

While you can’t rely on new files being created immediately, you can change whether they are created more or less frequently by configuring the time key condition.

Logstash Plugin

Logstash’s output plugin is open source and comes under an Apache 2.0 license, meaning there are no restrictions on how you use it. It uploads batches of Logstash events in the form of temporary files, which by default are stored in the Operating System’s temporary directory.

If you don’t like the default save location, Logstash gives you a temporary_directory option that lets you stipulate a preferred save location.

Securing Your Logs

Logs contain sensitive information. A crucial question for those taking the S3 log storage route is making sure S3 Buckets are secure.  Amazon S3 default encryption enables users to ensure that new log file objects are encrypted by default.

If you’ve already got some logs in an S3 Bucket and they aren’t yet encrypted don’t worry. S3 has a couple of tools that let you encrypt existing objects quickly and easily.

Encryption through Batch Operations

One tool is S3 Batch Operations. Batch Operations are S3’s mechanism for performing operations on billions of objects at a time. Simply provide S3 Batch Operations with a list of the log files you want to encrypt and the API performing the appropriate operation.

Encryption can be achieved by using the copy operation to copy unencrypted files to encrypted files in the same S3 Bucket location.

Encryption through Copy Object API

An alternative tool is the Copy Object API. This tool works by copying a single object back to itself using SSE encryption and can be run using the AWS CLI.

Although Copy Object is a powerful tool, it’s not without risks. You’re effectively replacing your existing log files with encrypted versions so make sure all the requisite information and metadata is preserved by the encryption. 

For example, if you are copying log files larger than the multipart_threshold value, the Copy Object API won’t copy the metadata by default.  In this case, you need to specify what metadata you want using the parameter –metadata.

Integrating S3 Buckets with Coralogix

Hooray! Your logs are now firmly in the cloud with S3. Now, all you need to do is analyze them.  Coralogix can help you do this with the S3 to Coralogix Lambda.

This is an API that lets you send log data from your S3 Bucket to Coralogix, where the full power of machine learning can be applied to uncover insights.  To use it you need to define five parameters.

S3BucketName specifies the name of the S3 bucket storing the CloudTrail logs.

ApplicationName is a mandatory metadata field that is sent with each log and helps to classify it.

CoralogixRegion is the region in which your Coralogix account is located. CoralogixRegion can be Europe, US or India, depending on whether your Coralogix URL ends with .com, .us or .in.

PrivateKey is a parameter that can be found in your Coralogix account under Settings -> Send your logs. It is located in the upper left corner.

SubsystemName is a mandatory metadata field that is sent with each log and helps to classify it.

The S3 to Coralogix Lambda can be integrated with AWS’s automation framework through the Serverless Application Model. SAM is an AWS framework that provides resources for creating serverless applications, such as shorthand syntax for APIs and functions.

The code for the Lambda is also available at the S3 to Coralogix Lambda GitHub. As with Logstash, it’s open source under the Apache 2.0 License so there are no restrictions on how you use it.

To Conclude

Software as a Service is a paradigm that is transforming every part of the IT sector, including DevOps. It replaces difficult-to-configure on-premises architecture with uniform and consistent services that remove scalability from the list of an end user’s concerns.

Unfortunately, SaaS observability tooling is still falling behind the curve, but largely because organizations are still maintaining a plethora of systems (and therefore a variety of formats) on-prem. 

Storing your logs in S3 lets you bring the power and convenience of SaaS to log collection. Once your logs are in S3, you can leverage Coralogix’s machine learning analytics to extract insights and predict trends.

Unlocking Hidden Business Observability with Holistic Data Collection

Why do organizations invest in data observability?

Because it adds value. Sometimes we forget this when we’re building our observability solutions. We get so excited about what we’re tracking that we can lose sight of why we’re tracking it.

Technical metrics reveal how systems react to change. What they don’t give is a picture of how change impacts the broader business goals. The importance of qualitative data in business observability is often overlooked.

Data-driven insights which only include technical or quantitative modeling miss the big picture. Pulling data from holistic sources unlocks the full power of business observability.

Including holistic data collectors in your observability stack grants visibility not just into what’s happening, but why it happened, and the impact it has on your business outside of your systems.

What are Holistic Data Collectors?

Holistic data collectors draw from unusual or uncommon sources. They log information that wouldn’t usually be tracked, enabling businesses to get a holistic view of their systems. A holistic view means observability of all interconnected components.

A data strategy that includes the collection of holistic data empowers effective business observability. By logging as hidden pockets of data much clearer insight can be generated, and data-backed strategic decisions become much better informed.

The list of data sources it is possible to include is potentially limitless. With the proper application of collectors, any communication platform or user service can become a source of data and insight. Holistic data collectors exist for code repositories such as GitHub, collaboration tools like Teams or Slack, and even marketing data aggregators such as Tableau.

When and How to Use Holistic Data Collection

With some creative thinking and technical know-how, almost anything can be a source of holistic data. Common business tools, software, and platforms can be mined for useful data and analysis.

Below are some examples that exemplify and illustrate the usefulness of this approach to data-driven insight.

Microsoft Teams

Microsoft Teams has become a vital communication channel for the modern enterprise. As one of the most popular internal communication platforms on the market, Teams can be an invaluable source of holistic data from within your workforce.

Integrating webhooks into Teams enables you to monitor and track specific activity. Webhooks are a simple HTTP callback, usually in JSON format. They’re one of the simplest ways to connect your systems to an externally hosted channel such as Teams.

Pushing holistic, qualitative Teams data to a third party platform using webhooks enables correlation of non-numeric insight with finance and marketing figures and system metrics. 

As Teams is most often used internally, this is an invaluable asset for understanding how changes to your organization are reflected in the morale of your staff. Many developers use Teams to discuss issues and outages. Having visibility of your IT team’s responses identifies which tasks take the most of their time and what knowledge blindspots exist in their collective technical skillset.

PagerDuty

PagerDuty is transforming the way IT teams deal with incidents. Integrating holistic data collectors greatly expands the level of insight gained from this powerful tool.

High levels of criteria specificity on alerts and push notifications to enable effective prioritizing of support. IT teams can manage large sets of alerts without risking an overwhelming amount of notifications.

As with Teams, webhooks are one of the most common and simplest methods of collecting holistic data from PagerDuty. By collecting enriched event data around incidents, outages, and how your IT teams respond can be analyzed in the context of the wider business organization.

GitHub

Scraping GitHub provides great insight into the performance of your dev teams. What’s not as widely known is the business insight that can be gained by correlating GitHub commits on repositories.

Commits are changes to code in GitHub. Each commit comes with a comment message and log. Keeping GitHub commits under the eye of your monitoring stack reveals a fascinating hidden pocket of data that could change the way your teams approach outages and troubleshooting.

Outages occur for many reasons. Some require more work and code changes than others. Bad or lazy coding creates a lot of outages. Tracking and logging GitHub commits will reveal both the kinds of outages and the specific chunks of code that take up the most time for your engineers.

GitOps

Monitoring GitOps declarations and version edits pinpoint not only where weaknesses exist at an architectural level, but when they were implemented and whether the problem is isolated or part of a wider trend.  

Tableau

Tableau is an invaluable source of marketing data. Integrating Tableau with your log analytics platform opens up integral insight.

Digital marketing is an essential business aspect of modern enterprises. An effective digital marketing presence is key to success, and Tableau is the go-to tool for many organizations.

Tableau is useful for market strategy. It’s when Tableau data is included as part of a wider, holistic picture that organizational intelligence and insight can be gained from it. Scraping Tableau for data such as page bounce rates and read time allows you to see how technical measurements correlate with your marketing metrics.

Silo-Free Reporting

Say you’ve experienced a sudden dip in online sales despite an advertising drive. Your first reaction could be to blame the marketing material.

By including Tableau data in your analytics and reporting you can see that the advertising campaign was successful. Your website had a spike in visitors. This spike in traffic led to lag and a poor customer experience, visible due to an equally large spike in bounce rate. 

Scraping Tableau as a source of holistic data reveals that the sales dip was down to issues with your systems. Your strategy can then be to improve your systems so they can keep up with your successful marketing and large digital presence.

Jira

Integrating your analytics and monitoring platform with Jira can yield powerful results, both in alerting capabilities and collecting insight-generating data.

Using webhooks, your integrated platform can create Jira tickets based on predefined criteria. These criteria can be defined based on data from other sources in your ecosystem, as the issue is being raised and pushed from a third party platform.

Automating this process enables your IT team to deploy themselves both efficiently and with speed. It allows them to concentrate on resolving issues without getting bogged down in logging and pushing the alerts manually.

By having tickets raised by an ML-powered platform with observability over your whole infrastructure, your engineers won’t be blindsided by errors occurring in areas they may not have predicted.  

Use Case: Measuring the Business Impact of Third Party Outages with Holistic Collectors

Third party outages are one of the most common reasons for lost productivity and revenue. Many enterprises rely on third party software, systems, or services to deliver their own. Reliance on one or more third parties is the norm for many sectors and industries.

A third party will inevitably experience an outage at some point.  There’s no way to avoid this. Even the most reliable provider can fall victim to unforeseen circumstances such as power cuts or emergency server maintenance from time to time.

Third party outages can have huge ramifications. These services are often business-critical, and the financial costs of even brief outages can reach far past the hundred thousand dollar mark. 

Lost revenue is one way to measure the impact of such an outage. While financial data helps understand the initial impact of unforeseen downtime, alone it doesn’t provide full visibility of long-term consequences or the reaction of the business and consumers.

Collecting holistic data alongside the usual logging metrics helps to fill in the blanks. It allows a business to answer questions like:

  • Has this affected our public reputation?
  • Are users speaking positively about our response?
  • Did we respond faster or slower to this outage than usual?
  • Was our response effective at returning us to BAU as soon as possible?
  • Is this affecting the morale of our staff?
  • How much work is this making for my IT team?
  • Has web traffic taken a hit?

This leaves any business or organization in a much better position to control and mitigate the long-term impact of a third party outage. 

A study by Deloitte found that direct losses due to third party mismanagement have been found to directly cost businesses as much as $48m. This is before indirect losses are factored in. An outage could have minor immediate financial ramifications but damage long-term prospects through reputational damage. It would be almost impossible to gain the insight to prevent this using financial or systems metrics alone.

A Complete Holistic Data Solution

The Coralogix observability platform is a monitoring solution that enables holistic data collection from sources such as Teams, GitHub, PagerDuty, Slack, Tableau, and many others.

Collecting and logging data from multiple sources, both traditional and unorthodox, can be difficult to manage. Business intelligence and organizational insight are difficult to gain if information is stored and reported from dozens of sources in as many formats.

Coralogix’s observability platform provides a holistic system and organized view in a single pane. Using machine learning, our platform creates a comprehensive picture of your technical estate that comes with actionable business context. For visibility and insight that extends beyond the purely technical, the Coralogix platform is the solution your business needs.

The Untapped Power of Key Marketing Metrics

Marketing and Site Reliability teams rarely meet in most organizations. It’s especially rare outside the context of product marketing sessions or content creation. With observability now pivotal to success, we should be looking to bring the two together for technical and commercial gains.

In this piece, we’re going to explore the meaning of observability and its relevance to marketing metrics. We’ll cover which metrics you might want to observe, and how you might look to drive revenue through technical measurements and key marketing metrics.

Observability and Key Marketing Metrics

Without covering the topic of observability in its totality, we can say that observability is deep insight and overall visibility of distributed systems and applications. Monitoring is essentially data collection, and observability is the ability to use that to produce contextual insights and analysis.

What Does This Have To Do With Marketing?

There are a whole host of key marketing metrics that are collected by organizations in data aggregation tools like Tableau. These measurements include things like click rate, monthly traffic numbers, conversion rate, and time spent on page. 

With the principles of observability in mind, these different data points can be monitored and aggregated to provide insights. With key marketing metrics stored within your observability platform, you can understand what is and isn’t working for users based on otherwise siloed metrics.

What Do Technical Measurements Bring to Marketing?

Correlating technical measurements with marketing metrics can reveal problems on your site or platform. In this section, we’ll discuss a few scenarios which show the benefit of viewing technical or business measurements with added context from your key marketing metrics. 

Scenario 1: Poor Conversion

Picture this: you are the WebOps lead for a growing eCommerce business with high and growing numbers of web traffic week-on-week. While your traffic numbers are high and click-through rates remain strong, your conversion rates are much lower than projected. Conversion rates remain one of the key marketing metrics in eCommerce. 

Business teams often focus on traditional causes for low conversion (shipping fees, load times, etc.). By comparing conversion rates with technical measurements you can uncover the true root cause. 

With data pulled from Tableau (aggregating all of your marketing metrics) alongside your platform latency times and error logs for your payment solutions provider, you can see that the low conversion is easily attributable to slow checkout load times and failure to authorize a secure connection to your payments platform. 

Scenario 2: Low Engagement 

Engagement with your website or platform may be indicated in different ways. Clicking on content, downloading resources, or opening an application can all show that users or potential customers are engaging with your platform or product.

Bounce Rate is a key marketing metric that is a good indicator of engagement, showing the number of single-page sessions. The higher the bounce rate, the less users are exploring your site or platform. 

If you’re confident in your content, looking at technical measurements together with marketing metrics can help indicate what’s going wrong. For example, if you have a high bounce rate on a page that links to other parts of your platform. Comparing which pages have a high bounce rate with benchmark reports could alert you to errors or UI problems caused by recent changes. 

In this way, marketing metrics can act as a diagnostic tool for deeper technical problems, which otherwise may have been caught much later.

Mixing Technical Measurements and Marketing Metrics for Revenue Growth

In addition to using marketing and technical metrics for issue diagnosis, the same principles can be applied to grow revenue.

Data relating to page depth shows how much traffic to your site is distributed amongst different pages. While closely tied to bounce rate, page depth focuses more on time spent moving between different pages.

This marketing metric is great for understanding which of your pages are the most attractive to site visitors, and which are successful in pushing them to another part of the site. Serverside errors are some of the biggest technical blockers in achieving higher page depth.

By analyzing the correlation between purchases, bounce rate, and page depth, you can optimize your site for increased revenue. For example, if you have a landing page that generates high page depth, consider linking this to a product page or payment gateway. 

Summary

One of the best things about observability is that it provides an open foundation for further exploration.

By combining marketing metrics and technical measurements in one observability platform, you turn your site users into diagnostic tools. In a world where numerous releases per day are common for the modern enterprise, this is invaluable. 

On top of that, the cross-pollination of marketing and technical data may allow you to boost your revenue. From a ‘single pane of glass view’, you can see what pages work for you, why they work the best (serverside performance), and what that translates to in terms of revenue.

Coralogix is now working to enable this kind of cross-organization observability. It will provide the ability for collaboration between technical and marketing teams in a way that hasn’t been seen before. As always with observability, with the right platform helping you understand your data, anything is possible. 

PromQL Tutorial: 5 Tricks to Become a Prometheus God

For the seasoned user, PromQL confers the ability to analyze metrics and achieve high levels of observability. Unfortunately, PromQL has a reputation among novices for being a tough nut to crack.

Fear not! This PromQL tutorial will show you five paths to Prometheus godhood. Using these tricks will allow you to use Prometheus with the throttle wide open.

Aggregation

Aggregation is a great way to construct powerful PromQL queries. If you’re familiar with SQL, you’ll remember that GROUP BY allows you to group results by a field (e.g country or city) and apply an aggregate function, such as AVG() or COUNT(), to values of another field.

Aggregation in PromQL is a similar concept. Metric results are aggregated over a metric label and processed by an aggregation operator like sum().

Aggregation Operators

PromQL has twelve built in aggregation operators that allow you to perform statistics and data manipulation.

Group

What if you want to aggregate by a label just to get values for that label? Prometheus 2.0 introduced the group() operator for exactly this purpose. Using it makes queries easier to interpret and means you don’t need to use bodges.

Count those metrics

PromQL has two operators for counting up elements in a time series. Count() simply gives the total number of elements. Count_values() gives the number of elements within a time series that have a specified value. For example, we could count the number of binaries running each build version with the query: 

count_values("version", build_version)

Sum() does what it says. It takes the elements of a time series and simply adds them all together. For example if we wanted to know the total http requests across all our applications we can use:

sum(http_requests_total)

Stats

PromQL has 8 operators that pack a punch when it comes to stats. 

Avg() computes the arithmetic mean of values in a time series.

Min() and max() calculate the minimum and maximum values of a time series. If you want to know the k highest or lowest values of a time series, PromQL provides topk() and bottomk(). For example if we wanted the 5 largest HTTP requests counts across all instances we could write:

topk(5, http_requests_total)

Quantile() calculates an arbitrary upper or lower portion of a time series. It uses the idea that a dataset can be split into ‘quantiles’ such as quartiles or percentiles. For example, quantile(0.25, s) computes the upper quartile of the time series s.

Two powerful operators are stddev(), which computes the standard deviation of a time series and stdvar, which computes its variance.  These operators come in handy when you’ve got application metrics that fluctuate, such as traffic or disk usage.

By and Without

The by and without clauses enable you to choose which dimensions (metric labels) to aggregate along. by tells the query to include labels: the query sum by(instance) (node_filesystem_size_bytes) returns the total node_filesystem_size_bytes for each instance.

In contrast, without tells the query which labels not to include in the aggregation. The query sum without(job) (node_filesystem_size_bytes) returns the total node_filesystem_size_bytes for all labels except job.

a * b prometheus

Joining Metrics

SQL fans will be familiar with joining tables to increase the breadth and power of queries. Likewise, PromQL lets you join metrics. As a case in point, the multiplication operator can be applied element-wise to two instance vectors to produce a third vector.

Let’s look at this query which joins instance vectors a and b.

a * b

This makes a resultant vector with elements a1b1, a2b2… anbn .  It’s important to realise that if a contains more elements than b or vice versa, the unmatched elements won’t be factored into the resultant vector.

This is similar to how an SQL inner join works; the resulting vector only contains values in both a and b.

Joining Metrics on Labels

We can change the way vectors a and b are matched using labels. For instance, the query a * on (foo, bar) group_left(baz) b matches vectors a and b on metric labels foo and bar. (group_left(baz) means the result contains baz, a label belonging to b.

Conversely you can use ignoring to specify which label you don’t want to join on. For example the query a * ignoring (baz) group_left(baz) b joins a and b on every label except  baz. Let’s assume a contains labels foo and bar and b contains foo, bar and baz. The query will join a to b on foo and bar and therefore be equivalent to the first query.

Later, we’ll see how joining can be used in Kubernetes.

Labels: Killing Two Birds with One Metric

Metric labels allow you to do more with less. They enable you to glean more system insights with fewer metrics.

Scenario: Using Metric Labels to Count Errors

Let’s say you want to track how many exceptions are thrown in your application. There’s a noob way to solve this and a Prometheus god way.

The Noob Solution

One solution is to create a counter metric for each given area of code. Each exception thrown would increment the metric by one.

This is all well and good, but how do we deal with one of our devs adding a new piece of code? In this solution we’d have to add a corresponding exception-tracking metric.  Imagine that barrel-loads of code monkeys keep adding code. And more code. And more code.

Our endpoint is going to pick up metric names like a ship picks up barnacles.  To retrieve the total exception count from this patchwork quilt of code areas, we’ll need to write complicated PromQL queries to stitch the metrics together.

The God Solution

There’s another way. Track the total exception count with a single application-wide metric and add metric labels to represent new areas of code. To illustrate, if the exception counter was called “application_error_count” and it covered code area “x”, we can tack on a corresponding metric label.

application_error_count{area="x"}

As you can see, the label is in braces.  If we wanted to extend application_error_count’s domain to code area “y”, we can use the following syntax.

application_error_count{area="x|y"}

This implementation allows us to bolt on as much code as we like without changing the PromQL query we use to get total exception count. All we need to do is add area labels.

If we do want the exception count for individual code areas, we can always slice application_error_count with an aggregate query such as:

count by(application_error_count)(area)

Using metric labels allows us to write flexible and scalable PromQL queries with a manageable number of metrics.

Manipulating Labels

PromQL’s two label manipulation commands are label_join and label_replace.  label_join allows you to take values from separate labels and group them into one new label. The best way to understand this concept is with an example.

label_join(up{job="api-server",src1="a",src2="b",src3="c"}, "foo", ",", "src1", "src2", "src3")

In this query, the values of three labels, src1, src2 and src3 are grouped into label foo. Foo now contains the respective values of src1, src2 and src3 which are a, b, and c.

label_replace renames a given label. Let’s examine the query

label_replace(up{job="api-server",service="a:c"}, "foo", "$1", "service", "(.*):.*")

This query replaces the label “service” with the label “foo”. Now foo adopts service’s value and becomes a stand in for it.  One use of label_replace is writing cool queries for Kubernetes.

Creating Alerts with predict_linear

Introduced in 2015, predict_linear is PromQL’s metric forecasting tool.  This function takes two arguments. The first is a gauge metric you want to predict. You need to provide this as a range vector. The second is the length of time you want to look ahead in seconds.

predict_linear takes the metric at hand and uses linear regression to extrapolate forward to its likely value in the future. As an example, let’s use PromLens to run the query: 

predict_linear(node_filesystem_avail_bytes{job="node"}[1h], 3600).

It shows a graph which shows the predicted value an hour from the current time.

predict_linear promql

Alerts and predict_linear

The main use of predict_linear is in creating alerts. Let’s imagine you want to know when you run out of disk space.  One way to do this would be an alert which fires as soon as a given disk usage threshold is crossed. For example, you might get alerted as soon as the disk is 80% full. 

Unfortunately, threshold alerts can’t cope with extremes of memory usage growth. If disk usage grows slowly, it makes for noisy alerts. An alert telling you to urgently act on a disk that’s 80% full is a nuisance if disk space will only run out in a month’s time.

If, on the other hand, disk usage fluctuates rapidly, the same alert might be a woefully inadequate warning. The fundamental problem is that threshold-based alerting knows only the system’s history, not its future.

In contrast, an alert based on predict_linear can tell you exactly how long you’ve got before disk space runs out. Plus, it’ll even handle left curves such as sharp spikes in disk usage.

Scenario: predict_linear in action

This wouldn’t be a good PromQL tutorial without a working example, so let’s see how to implement an alert which gives you 4 hours notice when your disk is about to fill up. You can begin creating the alert using the following code in a file “node.rules”.

- name: node.rules

  rules:

  - alert: DiskWillFillIn4Hours

    expr: predict_linear(node_filesystem_free{job="node"}[1h], 4*3600) < 0

    for: 5m

    labels:

      severity: page

The key to this is the fourth line.

expr: predict_linear(node_filesystem_free{job="node"}[1h], 4*3600) < 0

This is a PromQL expression using predict_linear. node_filesystem_free is a gauge metric measuring the amount of memory unused by your application. The expression is performing linear regression over the last hour of filesystem history and predicting the probable free space.  If this is less than zero the alert is triggered.

The line after this is a failsafe, telling the system to test predict_linear twice over a 5 minute interval in case a spike or race condition gives a false positive.

Using PromQL’s predict_linear function leads to smarter, less noisy alerts that don’t give false alarms and do give you plenty of time to act.

Putting it All Together: Monitoring CPU Usage in Kubernetes

To finish off this PromQL tutorial, let’s see how PromQL can be used to create graphs of CPU-utilisation in a Kubernetes application.

In Kubernetes, applications are packaged into containers and containers live on pods. Pods specify how many resources a container can use. If a container uses more resources than its pod has, it ‘spills over’ into a second pod.

This means that a candidate PromQL query needs the ability to sum over multiple pods to get the total resources for a given container. Our query should come out with something like the following.

ContainerCPU utilisation per second
redash-redis 0.5
redash-server-gunicorn0.1

Aggregating by Pod Name

We can start by creating a metric of CPU usage for the whole system, called container_cpu_usage_seconds_total.  To get the CPU utilisation per second for a specific namespace within the system we use the following query which uses PromQL’s rate function:

rate(container_cpu_usage_seconds_total{namespace= “redash”[5m])

This is where aggregation comes in. We can wrap the above query in a sum query that aggregates over the pod name.

sum by(pod_name)(

 rate(container_cpu_usage_seconds_total{namespace= “redash”[5m])

)

So far, our query is summing the CPU usage rate for each pod by name.

Retrieving Pod Labels

For the next step, we need to get the pod labels, “pod” and “label_app”. We can do this with the query:

group(kube_pod_labels{label_app=~”redash-*”}) by (label_app, pod)

By itself, kube_pod_labels returns all existing labels. The code between the braces is a filter acting on label_app for values beginning with “redash-”.

We don’t, however, want all the labels, just label_app and pod. Luckily, we can exploit the fact that pod labels have a value of 1. This allows us to use group() to aggregate along the two pod labels that we want. All the others are dropped from the results.

Joining Things Up

So far, we’ve got two aggregation queries. Query 1 uses sum() to get CPU usage for each pod.  Query 2 filters for the label names label_app and pod.  In order to get our final graph, we have to join them up. To do that we’re going to use two tricks, label_replace() and metric joining.

The reason we need label replace is that at the moment query 1 and query 2 don’t have any labels in common.  We’ll rectify this by replacing pod_name with pod in query 1. This will allow us to join both queries on the label “pod”. We’ll then use the multiplication operator to join the two queries into a single vector.

We’ll pass this vector into sum() aggregating along label app. Here’s the final query:

sum(

group(kube_pod_labels{label_app=~”redash-*”}) by (label_app, pod)

*

on (pod)

group_right(label_app)

label_replace

sum by(pod_name)(

 rate(container_cpu_usage_seconds_total{namespace= “redash”[5m])

), “pod”, “$1”, “pod_name”, “(.+)”

)by label_app


Hopefully this PromQL tutorial has given you a sense for what the language can do.  Prometheus takes its name from a Titan in Greek mythology, who stole fire from the gods and gave it to mortal man.  In the same spirit, I’ve written this tutorial to put some of the power of Prometheus in your hands.

You can put the ideas you’ve just read about into practice using the resources below, which include online code editors to play with the fire of PromQL at your own pace.

PromQL Tutorial Resources

PromLens

This online editor allows you to get started with PromQL without downloading Prometheus. As well as tabular and graph views, there is also an “explain” view. This gives the straight dope on what each function in your query is doing, helping you understand the language in the process.

Grafana Fundamentals

This tutorial by Coralogix explains how to integrate your Grafana instance with Coralogix, or you can use our hosted Grafana instance that comes automatically connected to your Coralogix data.

Prometheus on Coralogix

This tutorial will demonstrate how to integrate your Prometheus instance with Coralogix, to take full advantage of both a powerful open source solution and one of the most cutting edge SaaS products on the market.

5 Technical Metrics You Need for Observability in Marketing

Metrics measuring user engagement on your website are crucial for observability in marketing. Metrics will help marketing departments understand which of your web pages do not provide value for your business. Once known, developers can look at the web page’s technical metrics and determine if updates are required.

Typically user engagement statistics, like the average time required to load your page, are stored separately from technical site logs. User engagement is often affected by a web page’s technical behavior; it is crucial to compare technical and key marketing metrics for observability.

Referring to both marketing and technical data in the same environment can give companies even more insight into why users show specific behavior trends when engaging with the website. 

Tools Exist to Record Marketing Data.

There are many tools available that record analytic data and give observability in marketing statistics required to troubleshoot website content and behavior. Tools like Tableau can provide observability in marketing directives by tracking user’s behavior. Different analytic tools can track marketing metrics differently, but more important than the tools you’ll use is your understanding of how to measure marketing success in observability. 

Let’s take a look at 5 top technical metrics for observability in marketing.

1. Bounce Rate

What is Bounce Rate?

Bounce rate is a metric that compares the number of users who hit your webpage with the number of users who take absolutely no action once they get there. The user has only reached the landing page and not engaged with it at all before leaving. A bounce rate between 26-40% is ideal, but 26-70% is typical of webpages.

What Does Having a High Bounce Rate Mean?

A high bounce rate means most of your website visitors leave before engaging with your content. The users have not found what they wanted from your landing page, or your page has not convinced them that the content is worth looking into more deeply.

What Can Cause a High Bounce Rate?

Many different things can cause a high bounce rate. The user may have gone to the wrong page, they may not have understood the page’s content and decided to go elsewhere, or your content might not have met their needs. Designers can solve each of these issues by editing your content, helping visitors gain value from your content. 

A slow loading time on your website can also cause a high bounce rate. Users who only visit the landing page may turn away if the page does not load efficiently, or if not all assets load soon after arrival.

Developers can record load times in technical logs, providing information to see if their poor-performing pages correlate with slowing load times. A blank page or page with a server-side error such as resources not being found could also cause users to leave your landing page without interacting with it.

Alternatively, developers may have deployed a new feature to your webpage, causing a scheduled page outage. This outage would raise your bounce rate for some time. It would be useful to align bounce rate with outages on a time graph to correlate these events.

What Does Having a Low Bounce Rate Mean?

A low bounce rate is generally a positive thing. However, typically when the bounce rates are too low, data from the website is not being collected properly. Well-designed and implemented web pages still usually have a bounce rate above 25%.

Low bounce rates generally mean that there is an issue measuring bounce rate. Check the setup of your analytics software to ensure it measures your metrics properly. Most analytic software will provide tips for troubleshooting your metrics to ensure they are tracking accurate data.

2. Page Depth

What is Page Depth?

Page depth, or pages per session, is a measurement of the number of pages in your website visited by a user during a session. Designers and developers use average page depth to understand how interested visitors are in your content.

Ideal page depth values will differ depending on your website and how many internal links you have. Search engines will see deep pages as less critical and do not show them in searches, so effective website designs tend to have a total page depth of less than 5

What Does Having a Low Average Page Depth Mean?

When you look at your page depth, you want to look at depth values for each group of pages and not necessarily for your entire website. Looking at sections will let you know which of your pages needs attention and which are performing.

A low page depth (compared to the number of available pages) per group means that users are not inspired to act on your pages, and you need to revise them to meet your goals.

What Can Cause a Low Average Page Depth?

Designers will need to identify what is blocking users from taking the next link you provide to a page. It could be that they lose interest in your content, so it needs to be revised to be more engaging. However, the issue could also be a server-side error like the webpage not being found or having too big of a data request. Look at page depth in conjunction with server-side errors in a graph to see if errors frequently occur on pages with low depth before editing content.  

3. Average Session Duration

What is the Average Session Duration?

Average session duration tells designers how long users spend interacting with your website. The measurement is typically between when the user first engages with your site until that session ends.

If the user returns later, that is considered a new session, and the session counts as a returning visitor. Exactly when that session begins and ends depends on your analysis-collecting software. Acceptable average session durations are typically considered anything over 3 minutes.

What Does Having a Low Average Session Duration Mean?

Today almost everything is accessible online. Websites and apps compete for users’ attention more than any other commodity. The time users spend on your page is valuable no matter what your product.

If this time is low, you need to look at your website’s goal and how you can convey value to your users immediately.  Likewise, users have little patience for error-filled websites, and you will lower your average session duration if you have prolonged or repeated technical issues. 

What Can Cause a Low Average Session Duration?

If your average session duration is low, compare it to technical metrics to determine if there is something wrong with your website rather than your content. Server errors causing web content not to load would easily cause users to leave your page since they don’t have a chance to see what value you are trying to provide. 

Another common way to increase average session duration is to add videos to web pages. Videos tend to have longer load times due to their size. Tracking this load time alongside your average session duration is also valuable to know if your video has the desired effect or causes users to bounce more readily from your website.

4. Returning Visitors

How Are Returning Visitors Measured?

The measurement of returning users is dependent on which analytics software you use. For example, Google Analytics creates a client identifier and places it in a cookie on the user’s device. When the user returns, Google Analytics recognizes the client identifier and logs the user as a returning visitor.

Returning visitors are crucial to track because they are more than 70% more likely to provide successful conversions for your company than first-time visitors. 

How Can You Improve Visitor Return Rates?

There are a few ways to improve visitor return to your website. You could send out emails, use social media, or create a push notification list. Push notifications will send users a message on their desktop or mobile device. When they click on the notification, the notification will automatically direct users to the pre-set webpage.

Suppose you are using a service for push notifications, like AWS Simple Notification Service (SNS). In that case, you will want to keep track of technical logs showing any errors in push notifications or your logic surrounding them.

If you set up notifications expecting a rise in visitor return rates, a simple item to check first would be if your service sent the message as expected. It is also useful to see if your return rates spike immediately following this communication with your users so you can see if they are effective at getting users to return to your site. Again, combining your return rate metrics with SNS logging would be useful here.  

5. Conversion Rate

What is the Site Conversion Rate?

Conversions are the ultimate goal of a webpage, what your website seeks to acquire from a user. A conversion is when a user takes some action that you require of them. Conversions could be having a user sign up for a service, make a purchase, download a whitepaper, schedule a demonstration, or some other activity.

Conversion rates compare the users who have completed the solicited action against all users who visited your page.

What Causes a Low Conversion Rate?

The cause of a low conversion rate somewhat depends on what the conversion activity is. For conversions involving a purchase, extra costs such as shipping are the biggest reason for abandoning a cart. But, no matter the action, there are common technical issues that can cause a low conversion rate. 

Users tend to leave pages that load slowly. Your site could lose 25% of its users if the load time is more than four seconds. Many customers expect that pages should take two seconds or less to load.  Correlating your conversion rate to page load time should tell you if this is the cause of the low conversion rate.

Web pages that crash or freeze also reduce conversion rates. Crashes can be caused by unhandled status codes being returned from any APIs used on your webpage, or by trying to load data that is not expected into your page. Freezes are often caused by your webpage being caught in an infinite loop or memory leak somewhere in the code behind your webpage.

These connectivity issues can especially be an issue with mobile devices where network connections can cause problems loading your page. Crashes can make users have a poor perception of not only your website but your brand, making them less likely to return and produce a conversion at a later time.

Logging these website errors alongside conversion rates in time can show correlations between conversion rate changes and issues with your webpage.

Summary

Observability in marketing metrics is crucial for understanding your website’s working pages and which pages need work. Seeing metrics alongside technical metrics will show marketing departments which pages designers should rewrite and which developers should improve technically. 

The Coralogix Observability Platform enables users to see anomalous technical issues in real-time which should indicate to any developers that they could be affecting the user experience of website visitors. Coralogix supports input from any data source and type including .NET, NodeJS, and Java which are commonly used in web development.

If combined with data marketing metric tools like Tableau, developers and designers can see the real impact that specific technical issues like page load time have had on their user experience.

Bounce rate, conversion rate, return visitors, average session duration, and page depth are all affected by page load time. Website errors and crashes also significantly reduce conversion rates. Showing page load time and error analyses in the same logging console as marketing metrics can help marketing departments pinpoint potential causes of less-than-ideal rates and work efficiently to fix their website and improve their business.