What is the Benefit of Including Security with Your Observability Strategy?

Observability strategies are needed to ensure stable and performant applications, especially when complex distributed environments back them. Large volumes of observability data are collected to support automatic insights into these areas of applications. Logs, metrics, and traces are the three pillars of observability that feed these insights. Security data is often isolated instead of combined with data collected by existing observability tools. This isolation leaves security teams to use separate tools and data collection independent of existing observability strategies. 

Combining security and observability data can benefit organizations by enhancing overall system resilience, identifying potential threats, and improving incident responses. 

By utilizing the internal visibility offered by observability and integrating it with security data, businesses can expand their monitoring capabilities across every aspect of their IT environment, establishing security observability. This single pane of glass, in turn, makes it easier to identify, analyze, and respond to suspicious activity and anomalies that can come from various attack vectors.

Application performance monitoring and security

Application performance monitoring (APM) uses software tools to detect and isolate application performance issues. Couple APM with observability techniques to assess application health by tracking relevant key performance indicators (KPIs) such as load, response time, and error rate. The results of these KPIs logically overlap with security metrics, so they can be used to detect security events. Security data, like logs from SIEM systems, can be integrated with the APM logs to detect security issues more efficiently. The SIEM data could include information like login attempts and authentication failures.

Consider a web application that provides financial services that uses APM tools to track user experience. The APM data would detect events like a user experiencing slow response times when accessing the application. Given it is below some threshold, DevOps teams would be alerted to this issue. When combined with security data, a matching alert might show a spike in authentication failures. When these two data are correlated, the ops team can quickly discern that the slowdown and the authentication errors are associated with a potential brute-force attack against user accounts.

Real user monitoring and security

Real user monitoring (RUM) collects data about user interaction with applications. RUM data detects poor user experience, telling DevOps teams there is some issue in the stack. Tools collect details about user interaction, such as page load times and click-through rates. When combined with observability data, RUM helps teams identify issues’ root causes so teams can quickly fix and reduce effects on user experience. Combining this data further with security data like logs from web application firewalls or intrusion detection systems would help teams identify when the problem is not with the stack directly but due to a security breach. 

RUM metrics will detect when users experience a sudden increase in page load times while security logs simultaneously show a surge in requests with potentially nefarious payloads. Combining security and observability data would correlate these events, revealing that the detected performance degradation is likely linked to a distributed denial of service (DDoS) attack. An early response due to a linked alarm allows response teams to quickly implement security measures to mitigate the ongoing threat.

Infrastructure monitoring and security

Infrastructure monitoring collects performance data from your technology infrastructure, including servers, networks, containers, virtual machines, and databases. This monitoring aims to identify bottlenecks or anomalies in near real-time so maintenance can occur quickly, improving reliability and providing optimal performance. When combined with security metrics, infrastructure monitoring can further enhance the security of underlying IT infrastructures.

Infrastructure monitoring commonly collects CPU usage, memory consumption, and other performance-related metrics. Security metrics like SIEM logs contain information on detected security incidents like firewall events. If an unusual spike in network traffic is detected through infrastructure monitoring and security events show high numbers of suspicious login attempts across multiple servers, these events could be correlated, indicating a potential distributed brute-force attack. Early detection allows for quick incident response and implements security measures to thwart the attack and resume real-user performance by optimizing resources.

Anomaly detection and security

Anomaly detection in observability analyzes patterns in observability data to help predict issues. The purpose is to quickly identify and notify teams of unexpected data patterns like CPU or memory usage surges, a spike in erroneous transactions, or a sudden drop in web traffic. Algorithms are available within observability tools that track such changes, including thresholding, outlier detection, and machine learning algorithms that learn your system’s typical behavior. Giving these algorithms access to security data gives them more context to detect anomalies and identify security threats as a potential cause. 

Cybersecurity teams are responsible for monitoring network traffic. Anomaly detection can detect unusual patterns in network traffic to identify potential security incidents. Machine learning algorithms recognize standard patterns and flag deviations that indicate anomalous activity. If this anomaly detection system is integrated with security data such as firewall logs and intrusion detection system logs, the ability to identify anomalies indicative of security threats increases. Unusual spikes in network traffic and security events like failed authentication attempts raise suspicion of a potential brute-force attack or a compromised user account. Early detection from these insights means the security team can mitigate the incident by blocking suspicious IP addresses or strengthening authentication controls.

Identity Management and security

Identify management systems maintain a repository of user identities. These repositories include user profiles, roles, and permissions and are the authoritative source for managing user identities. Access controls must be defined within the system based on principles of least privilege, where users only have access to what they need and use and nothing more. Policies must be configured to restrict access to sensitive resources based on assigned roles. Anomaly detection can be used in identity management systems to monitor authentication events, looking for unusual patterns such as multiple failed login attempts. Observability tools can also detect if users do not use their accessible data so that permissions can be revoked for unnecessary data.

Combining identity management observability with security data such as SIEM logs and authentication server logs allows for the correlation of events. A sample case could be if the anomaly detection systems flag a spike in failed login attempts while simultaneously having security logs showing unauthorized attempts to a sensitive database associated with that user account. Correlating these events signals a security incident and a compromised user account. Correlating these events allows for early detection and incident response to prevent malicious users from accessing sensitive data.

Summary

Full stack observability is critical to provide fast incident responses and prevent future incidents from occurring. When security threats cause incidents, there is an added urgency where fast incident thwarting will reduce damage. Combining security and observability data to give a single pane of glass allows teams to detect malicious actions faster than looking at this data separately. This is the case for many different monitoring techniques like APM, RUM, and infrastructure monitoring.

Coralogix is an observability SaaS tool offering full-stack observability and security observability with SIEM, CSPM alongside managed detection and response services. Teams can easily combine observability and security data easily using Coralogix’s single observability offering.

Why Observability Architecture Matters in Modern IT Spaces

Observability architecture and design is becoming more important than ever among all types of IT teams. That’s because core elements in observability architecture are pivotal in ensuring complex software systems’ smooth functioning, reliability and resilience. And observability design can help you achieve operational excellence and deliver exceptional user experiences. 

In this article, we’ll delve into the vital role of observability design and architecture in IT environments. We’ll also showcase how a well-crafted observability strategy drives continuous success.

What is observability architecture?

Observability architecture refers to the systematic design and structure of a framework that lets you gain insights into the behavior, performance, and health of complex IT systems and applications. It encompasses both technical and strategic considerations to ensure effective implementation of observability practices in the IT ecosystem. 

At its core, observability architecture aims to create a holistic view for IT teams so they can better understand how different components interact, identify potential issues and make informed decisions to optimize system performance. Let’s dive into the technical aspects and design considerations that draw a robust observability framework: 

  • Data collection: Gather metrics, logs, traces, and events from diverse sources.
  • Centralized storage: Efficiently store data for real-time analysis and retrospective study. 
  • Data correlation: Link data sources for a holistic view, enriching insights. 
  • Real-time analysis: Visualize health, trends, and anomalies using dashboards.
  • Distributed tracing: Follow request paths in microservices, optimizing performance.
  • Anomaly detection: Use thresholds, ML for swift alerts on deviations. 
  • DevOps integration: Seamlessly integrate observability into DevOps pipelines.
  • Scalability: Design for scalability with growing data volumes. 
  • Security: Implement data protection and comply with regulations. 
  • Continuous improvement: Evolve with feedback, optimizing operations.

A well-designed observability architecture combines these technical elements stated above to monitor, troubleshoot, optimize and enhance the performance and reliability of systems.

Types of observability architecture 

There are two common types of observability architecture, microservice-based systems and event-driven architecture. 

  1. Microservice-based systems

Microservice architecture is a popular design pattern where an application is divided into a set of loosely coupled and independent services. Each microservice performs a specific business function and communicates with others through APIs.

Observability architecture in microservice-based systems typically involves:

  • Distributed tracing: Due to the distributed nature of microservices, distributed tracing is crucial to track transactions across various services. It enables end-to-end visibility into the flow of requests and responses, helping identify performance bottlenecks and dependencies.
  • Metrics collection: Each microservice generates metrics related to its performance, resource utilization and error rates. Observability architecture involves collecting and aggregating these metrics to gain insights into the overall health of the system.
  • Centralized logging: Logging plays a vital role in understanding the behavior of individual microservices. Centralized logging allows teams to access logs from all microservices in one place, simplifying troubleshooting and monitoring.
  • Service mesh: Observability architecture in microservice-based systems often includes a service mesh that provides transparent service-to-service communication. Service mesh also enables observability features like distributed tracing and traffic monitoring.
  1. Event-driven architecture 

Event-driven architecture (EDA) is an approach where components communicate by producing and consuming events. EDA allows for decoupled and asynchronous communication between different parts of the system. Observability architecture in event-driven systems typically involves:

  • Event tracing: In an event-driven system, events are at the core of communication. Observability architecture includes event tracing to track the flow of events and understand how events are processed and propagated.
  • Event stream processing: Observability architecture in event-driven systems may include event stream processing to analyze and process large streams of events in real time. Event stream processing also helps identify patterns, trends, and anomalies.
  • Message queues: Message queues are often used in event-driven systems to manage the flow of events and ensure reliable event delivery. Observability architecture may include monitoring message queues for performance and reliability.
  • Event logging and auditing: Logging events and auditing their processing is essential for understanding the behavior of an event-driven system. Observability architecture involves capturing and analyzing event logs for troubleshooting and compliance purposes. 

In both microservice-based and event-driven architectures, observability architecture aims to provide comprehensive insights into the system’s behavior, performance and health. The tools and practices used for observability may vary based on the specific requirements and complexities of each architecture.

12 advantages of observability architecture

Modern software architecture presents several issues that organizations need to address to ensure the success of their projects. Such challenges include managing complexity, ensuring security, optimizing performance, integrating diverse technologies and adapting to rapid technological advancements.

Using an observability architecture and design is one way to overcome these challenges. Here’s some advantages of end-to-end observability architecture in an IT environment.

  1. Holistic view

 Provides a complete perspective of applications and infrastructure interactions.

  1. Swift issue detection

Rapidly identifies anomalies, minimizing downtime.

  1. Efficient troubleshooting

Pinpoints root causes for effective issue resolution.

  1. Enhanced user experience

Optimizes performance for seamless interactions.

  1. Proactive optimization

Identifies performance bottlenecks for proactive improvements.

  1. Comprehensive insights

Understands system behavior for optimal resource allocation.

  1. Collaborative approach

Promotes teamwork between development and operations.

  1. Accurate root cause analysis

Pinpoints exact event sequences for accurate analysis.

  1. Data-driven decisions 

Supports informed choices based on user behavior and trends.

  1. Continuous improvement 

Encourages iterative enhancements through real-time insights.

  1. Regulatory compliance 

Ensures adherence to compliance standards.

  1. Efficient capacity planning 

Facilitates resource allocation based on utilization patterns.

Observability architecture is the backbone of modern IT systems, offering a strategic framework for holistic insights into system behavior, swift issue detection, and proactive optimization.

By integrating technical components with a focus on user-centric design, organizations can achieve operational excellence, enhance user experiences, and drive continuous improvement. Embracing observability architecture is paramount in navigating the complexities of IT environments, ensuring resilience, and delivering optimal performance in a dynamic digital landscape.

For more information about observability, read our full-stack observability guide.

Parquet File Format: The Complete Guide

How you choose to store and process your system data can have significant implications on the cost and performance of your system. These implications are magnified when your system has data-intensive operations such as machine learning, AI, or microservices.

And that’s why it’s crucial to find the right data format. For example, Parquet file format can help you save storage space and costs, without compromising on performance. 

This article will help you better understand the different types of data, and the characteristics and advantages of Parquet file format. To get a complete picture of Parquet’s impact on your unique application stack, get a full-stack observability platform for your business.

What is the Parquet file format?

Parquet file format is a structured data format that requires less storage space and offers high performance, compared to other unstructured data formats such as CSV or JSON. 

Parquet files support highly efficient compression and encoding schemes, resulting in a file optimized for query performance and minimizing I/O operations. Parquet file formats maximize the effectiveness of querying data using serverless technologies like Amazon Athena, BigQuery, and Azure Data Lakes. For example, Apache Parquet, licensed under the Apache software foundation, is built from scratch using the Google shredding and assembly algorithm, and is available to all. 

Parquet file format characteristics 

The Parquet file format stands out for its special qualities that change how data is structured, compressed and used. Let’s deep dive into the main features that make Parquet different from regular formats.

  • Column-based format

The key difference between a CSV and Parquet file format is how each one is organized. A Parquet file format is structured by row, with every separate column independently accessible from the rest. 

Since the data in each column is expected to be of the same type, the parquet file format makes encoding, compressing and optimizing data storage possible.

  • Binary format

Parquet file formats store data in binary format, which reduces the overhead of textual representation. It’s important to note that Parquet files are not stored in plain text, thus cannot be opened in a text editor.

  • Data compression

Parquet file formats support various compression algorithms, such as Snappy, Gzip, and LZ4, resulting in smaller file sizes, compared to uncompressed formats like CSV. You can expect a size reduction of nearly 75% for your data in Parquet files from other formats.

  • Embedded metadata

Parquet file formats include metadata that provide information about the schema, compression settings, number of values, location of columns, minimum value, maximum value, number of row groups and type of encoding. 

Embedded metadata helps in efficiently reading and processing the data. Any program that’s used to read the data can also access the metadata to determine what type of data is expected to be found in a given column.

  • Splittable and parallel processing

Parquet file formats are designed to be splittable, meaning they can be divided into smaller chunks for parallel processing in distributed computing frameworks like Apache Hadoop and Apache Spark.

Parquet file format vs CSV

While CSV is widely used in major organizations, CSV and Parquet file formats are suitable for different use cases. Let’s look at the differences between these two specific formats in order to help you choose a data storage format.

  • Storage efficiency

Parquet file format is a columnar storage format, which means that data for each column is stored together. The storage mechanism enables better compression and typically results in smaller file sizes compared to row-based formats.

CSV is a row-based format, where each row is represented as a separate line in the file. The format does not offer compression, often resulting in larger file sizes.

  • Query performance

CSVs need you to read the entire file to query just one column, which is highly inefficient.

On the other hand, Parquet’s columnar storage and efficient compression makes it well-suited for analytical queries that only need to access specific columns. This compression leads to faster query performance when dealing with large datasets. A  recent survey by Green Shield Canada found that with the parquet file format, they were able to process and query data 1,500 times faster than with CSVs. 

  • Schema evolution

Parquet file format supports schema evolution by default, since it’s designed with the dynamic nature of computer systems in mind. The format allows you to add new columns of data without having to worry about your existing dataset. 

CSV files on the other hand, do not inherently support schema evolution, which can be a limitation if your data schema changes frequently.

  • Compatibility and usability

Parquet is designed for machines and not for humans. Using Parquet file format in your project may require additional libraries or converters for compatibility with some tools or systems.

CSV is a simple and widely supported format that can be easily read and written by humans and almost any data processing tool or programming language, making it very versatile and accessible.

Advantages of Parquet File Format

Parquet’s major selling point is the direct impact it has on business operations. For instance, storage costs, computation costs and analytics. In this section, we will examine how Parquet helps with these three factors.

  • Save storage costs

A Parquet file format is built to support flexible compression options and efficient encoding schemes. Data can be compressed by using one of the several codecs available; as a result, different data files can be compressed differently. 

Therefore, Parquet is good for storing big data of any kind (structured data tables, images, videos, documents). This specific way of storage translates to hardware savings on cloud storage space. 

  • Save computation costs

As the data type for each column is quite similar, the compression of each column is straightforward, making queries even faster. When querying columnar storage you can skip over the non-relevant data very quickly. As a result, aggregation queries are less time-consuming compared to row-oriented databases.

  • Efficient analytics and high-speed transactions

Parquet files are well-suited for Online Analytical Processing (OLAP) use cases and reporting workloads. With Parquet, you receive fast data access for specific columns, and improved performance in distributed data processing environments.

Parquet is often used in conjunction with traditional Online Transaction Processing (OLTP) databases. OLTP databases are optimized for fast data updates and inserts, while Parquet complements them by offering efficient analytics capabilities.

Parquet file format for better data storage

The Parquet file format is one of the most efficient data storage formats in the current data landscape, with multiple benefits such as less storage, compression, faster query performance, as mentioned above. If your system requires efficient query performance, storage effectiveness, and schema evolution, the Parquet file format is a great choice.

Pair Parquet with a strong application monitoring software like Coralogix to understand the true impact of data structures. Read our full-stack observability guide to get started.

Coralogix vs Splunk: Support, Pricing and More

Splunk has become one of several players in the observability industry, offering a set of features and a specific focus on legacy and security use cases. That being said, how does Splunk compare to Coralogix as a complete full-stack observability solution?

Let’s dive into the key differences between Coralogix vs Splunk, including customer support, pricing, cost optimization, and more.

Logs, Metrics, Traces and Alerting

Coralogix and Splunk support ingesting logs, metrics, and traces. While these three data types are common across most SaaS observability platforms, Coralogix uses a unique data streaming analytics pipeline called Streama to analyze data in real-time and provide long-term trend analysis without indexing. Streama opens the door for cost-optimization that is simply impossible on other architectures. 

The Bottom Line – The Difference in Cost

Splunk prices are difficult to find without booking a meeting and receiving a quote, however they do publish prices on AWS marketplace for their logs, which is what we’ve used as a comparison. Splunk comes in at almost 4x the cost of Coralogix.

This means Coralogix represents a 66% – 76% cost saving, for 100GB of Logs per day. This is powered by Coralogix’s revolutionary TCO Optimizer and the Streama© architecture.

Data Correlation and Usability 

Coralogix and Splunk ingest logs, metrics, and traces from many different sources, but Coralogix excels at bringing all this data together in a single, cohesive journey, which lets users to sail between data types seamlessly. 

Coralogix Flow Alerts

Coralogix alerting has unique features like Coralogix Flow Alerts, allowing users to orchestrate their logs, metrics, traces, and security data into a single alert that tracks multiple events over time. Using Flow Alerts, customers can track the change in their system.

Archiving and Archive Query

Both Splunk and Coralogix offer users the ability to archive their data into an S3 bucket. Doing so has a huge cost impact because S3 offers very low retention costs and opens the door for long-term retention of data, for historical analysis or compliance purposes.

However, only Coralogix enables customers to still query their remote archive, without reindexing for no additional cost. Splunk requires users to reindex their data into high performance storage, before it can be analyzed. Therefore, so-called cost savings made through archiving in Splunk are potentially delayed costs, rather than true savings. 

Cost Optimizations

  1. Coralogix

Coralogix users start by indexing the majority of their data, but over time, they tend to transfer more data to the archive. This is because it can be queried in seconds, at no additional cost.

This functionality means customers can store the majority of their data in S3, and pay at most $0.023 / GB for storage. Coupled with the Compliance ingest costs in Coralogix, $0.17 / GB, the GB cost for ingest and storage is $0.193 / GB for the first month and $0.023/GB every month after that. Customers can cut costs by between 40% and 70%.

Compared to Splunk, Coralogix cost optimization rests entirely with the customer. Cost optimization with a Splunk deployment requires a complex analysis of different pricing plans, data ingestion approaches, risky archiving and reindexing decisions, which could incur huge costs in the future.

  1. Splunk

For Splunk customers, there are multiple pricing plans based on ingestion volume, compute, license length and more. All of this makes for unclear, unpredictable costs that are difficult to optimize.

Coralogix doesn’t charge by cloud resources, rather by ingestion volume. More than that, Coralogix allows customers to assign use cases to traces and logs, helping drive instant cost savings via the TCO Optimizer. These decisions are flexible and reversible, and entirely risk free. 

Customer Support 

Splunk does not offer global 24/7 support, even to Premium customers. Only for the most severe incidents (P1 & P2) is 24/7 support guaranteed. Otherwise, customer care is only presented during office hours.

The problem with this model is simple. Incidents are often miscategorised initially. If an incident is first identified as P3 when in fact it grows into a P1, this has the potential for bureaucratic time wasting while an incident costs you money. 

Coralogix offers all customers a median 30-second response time, an SLA measured in minutes, and 24/7 support. Coralogix also offers a median resolution time of 43 minutes. Even with the most complete support that Splunk can offer, they only work through issues 10 minutes faster than Coralogix does at resolving them. 

Out-of-the-box Dashboards

Splunk offers some great infrastructure monitoring tools, but it is lacking in dashboards focused on specific technologies. There are no prebuilt dashboards that help to solve the biggest issues in Kubernetes or Serverless architectures. Instead, they rely on more flexible, generic dashboards. 

Coralogix has built dashboards for Kubernetes Monitoring, Serverless monitoring and more, while also supporting open source dashboarding solutions like Grafana. Coralogix also provides a custom dashboarding solution for Coralogix users.

The platform’s reuse of open source dashboards, like JSON definitions, and the time-to-value of premade dashboards makes its offerings the best of both worlds. In addition, tools like DataMap give customers the power of total visualization flexibility. 

Business Observability: Everything Fintech Companies Want to Know

Fintech companies operate in a complex technological and regulatory environment. They rely heavily on cloud-native technologies and microservices architectures to handle financial transactions and data, often at a massive scale. To maximize application reliability, fintech companies need full visibility into their software systems and applications.

An agile monitoring solution like observability is crucial to improving performance and user experience. Observability practices like metrics collection, distributed tracing, and log management provide companies with full visibility into the health and performance of their applications.

This article will go over what business observability is, why it’s important and five steps you need to take to achieve business observability.

What is business observability?

Business observability is how you collect and manage business data in an automated way. Business observability grants you visibility into what’s happening across your organization and teams. 

With business observability, your day-to-day operations can run smoothly. Your business will begin to monitor and track your KPIs in real-time, identify trends and opportunities, and make data-informed decisions. 

How to achieve your business observability goals

1. Data security

Data security is a major concern for fintech providers. For instance, a recent poll from Imperva reports that 45% of consumers will no longer use a service after it has experienced a data breach. Therefore, ensuring the security of sensitive financial data of your users is critical to your business continuity.

With business observability, you can monitor transactions and transaction data in real-time to uncover fraud or other security incidents. Observability data provides audit trails to monitor access and changes to sensitive customer data, making it easier to identify, isolate, and reduce the blast radius of compromised transactions or system functions. 

And most importantly, observability data helps in root cause analysis to determine the causes of any incidents. By tracking the pathways of suspicious transactions using logs and metrics, you can better identify threats and deal with them before they compromise your system.

2. Industry regulations and compliance 

The finance industry is highly regulated due to the sensitivity of the financial data handled by companies in this industry. Fintech companies need to manage their data properly to ensure they are in compliance with all regulations—and business observability is the perfect solution. 

With observability tools, you can easily demonstrate compliance with data privacy requirements like the GDPR by tracing how data is collected, stored and used throughout the customer journey. Business observability tools also help you monitor data access and usage to determine who is accessing what data and how they are using it. The observability data provides visibility into unauthorized access or data misuse. 

Coralogix’s alerting function allows you to set up alerts for policy violations, making it easy to detect when data governance policies, such as unauthorized access, are violated. In the event of a data breach, observability data makes it easy to determine the scope and impact of the incident. Providers can use this data to meet regulatory requirements for reporting and responding to incidents in the specified timeframe.

3. Personalized user experience 

While security is always at the forefront of considerations for fintech companies, customer experience usually comes as an afterthought, according to a recent report by Zendesk. In an industry where reliability is paramount to business success, fintech companies have to rethink their approach to user experience. 

Business observability can help you understand how customers navigate their platforms through analyzing logs and metrics from usage. Armed with this type of data, you can tailor the user interface for different segments of your customer base.

Business observability data also helps identify error spikes, increase in customer support usage, and identify customer service gaps. These datasets represent opportunities to improve how users interact with your application. 

Observability platforms can also help identify consumer trends and match your customers with the right offers, products and solutions. With Coralogix, you can collate customer complaints from multiple touchpoints like online forms, customer service calls and even branch visits. This data can be matched to customer profiles to make it easier to resolve issues without compromising data fidelity.

4. System reliability 

System reliability is another core component of user experience, especially for fintech providers. Downtimes, even if they are from minor issues like system errors or major ones like data breaches are bad news for any fintech company.

Users do not want to bank with a company that can be offline at the moment they need to make a vital transaction. Therefore, fintech providers need to reduce the occurrence of outages and also minimize their impact if they do occur.

Observability tools provide preventive monitoring that can help you identify and prevent threats before they turn into incidents.Observability data enables development teams to gain insight into how changes and updates can impact the system, allowing them to rollback risky changes before they cause downtimes or impact the customers.

By using data to identify subtle changes that can compound over time, business observability allows fintech companies to move from reactive problem solving to proactive application performance management.

5. Managing third-party dependencies

Fintech companies rely on a variety of third-party systems to function efficiently, such as payment processors, banking APIs, identity verification systems, and cloud providers. A failure in any of these systems will impact the overall performance of the company.

Fintech providers can use observability tools to monitor the interdependency layers and identify failures before they impact the system. Business observability data can also be used to ensure that third-party service providers are meeting the stipulated agreements in their SLAs. By monitoring usage data around these third-party services, fintech companies can gain insight into how their users are interacting with them and optimize them.

Take fintech to the next level 

Data visibility is the key to future success for fintech providers. Business observability provides fintechs with the functionality needed for uptime, compliance, and reliability, enabling them to move from reactive firefighting to proactive issue fixing.

Can Observability Push Gaming Into the Next Sphere?

The gaming industry is an extensive software market segment, reaching over $225 billion US in 2022. This staggering number represents gaming software sales to users with high expectations of game releases. User acquisition takes up a large part of software budgets, with $14.5 billion US spending globally in 2021. User retention is critical to the success of any game, especially where monetization requires driving in-app purchases and ad revenue. User engagement is also a key performance indicator of the success of your game which can be measured using a log monitoring tool. User retention and engagement require a great game concept and a great user experience. 

Delivering a stellar user experience takes time, which is something gaming industries are always fighting against. User retention is highly dependent on releasing new features fast. When new features are released early, they can introduce bugs that can negatively impact user experience. In such a cut-throat, time-sensitive market, there is no time to thoroughly test the complex new features before release, which is standard in many other software sectors. Instead, gaming software depends on user feedback and tools to assist teams in fixing problems.

Observability tools are invaluable to game developers, customer experience specialists, and LiveOps teams. With machine-learning-enabled analytics and alerting, teams can identify and respond proactively. The speed of patch releases can be increased, and the number of released bugs can reduce. The benefit of observability is an excellent user experience, enabling increased user retention.  

Understanding Observability

Observability is the ability to understand the inner workings of a software system quickly so you can ask questions about the system without any prior knowledge. The benefit of observability is that teams are made aware of potential issues earlier. Observability tools can help teams fix issues before the user experience suffers. It is more than simply a combination of logs, traces, and metrics in a single visualization. The proper observability tool can show defects in software before release without developers predicting any particular issue.

Observability tools make use of data from throughout the gaming software. Information about the stability and health of the software infrastructure, databases, network interactions, and the game software function should all be collected. They can be shown in full context, including the user’s device and operating system. With robust monitoring data collection, observability tools can predict and pinpoint problems effectively. When developers do not need to think up every enumeration for a possible failure, feature releases with enormous tool sprawl become more successful with the benefit of observability.

Minimizing Downtime

An overarching benefit of observability in gaming software is the minimization of downtime. Downtime in gaming is anytime the user is not actively engaged in some activity. This includes necessary times for loading content or allowing other players to take turns, but it also has unintentional problems like network outages, unoptimized software, and scheduled maintenance. Game designers tend to minimize this time since it increases user retention. It is evident when a game’s server or a console’s network is having problems, and many gamers flock to third-party sites to report the issues.

Observability tools offer faster reporting of issues than standard monitoring or user monitoring can offer. LiveOps teams can be alerted to issues quickly and respond proactively to problems requiring intervention. 

Downtime is widespread immediately after new feature releases. These releases tend to include complex software and infrastructure compiled by different teams across a company. With such sprawl, it is difficult to predict every potential issue, and inevitably, some bugs are not explicitly tested for in integration testing. Observability benefits quick feature releases by predicting and catching issues early, even when not explicitly tested, by analyzing observability data. 

The gaming industry must also support its software on many different hardware and software platforms. When all monitoring data is combined, it can be difficult to compile in a way that will catch issues specific to only a particular device. Teams must also be able to identify problems that occur in regional networks against issues in a particular server or cloud provider. Further, each user could have a different game version, complicating troubleshooting. When given complete contextual information, including location, software version, and device, observability tools can isolate issues that would otherwise be lost in the noise when the problem is isolated to a particular subset of users. These insights will allow your team to deliver a better gaming experience with fewer errors and downtime.

Maximizing Insights

Insights from observability data can show user behaviors that otherwise would go unnoticed. These insights can provide feedback on game design to better-improve game development and inform new features. 

Insights can come from subtle patterns in user data. For example, suppose a statistically significant number of users tend to leave the game at a particular point in the storyline or gaming experience. In that case, there may be a usability or software issue to be addressed. Other unusual behavior patterns can also be analyzed to improve the gaming experience. Keeping users engaged in the game will increase the number of in-game transactions they complete.

Building Observability into Games

Observability tools must be well-integrated into your software and infrastructure to be effective. No matter which analytics tool is chosen, it will need access to three data types: logs, metrics, and distributed traces. Each of these should have sufficient contextual data to be combined for analysis in the observability tool. 

While integrating a new observability tool into a large codebase seems like an enormous undertaking, it can be done iteratively. Results will still benefit the areas integrated with the tool, so best to start with a single platform that is suspected of having bugs. Continue integrating with other platforms until the entire codebase has instrumented observability. 

Many options for observability tools exist which could integrate with gaming platforms. Considerations when choosing an observability tool should include:

  • Cost

The cost of observability tools can include many different factors. Many solutions require a licensing fee and charge for features in analyzing data to gain insights. With high-volume software like gaming, a large part of the cost will be data storage. 

  • Implementation time 

Time to instrument your software to record observability data. Extra time to implement the observability tool is also needed with open-source solutions. Third-party solutions take some extra burden off your in-house team by maintaining the observability tool.

  • Speed of acquiring valuable insights 

Since the gaming industry is exceptionally fast-paced, consider the time required between game release and when you will have actionable insights displayed. The shorter this time, the faster you can use this data to improve your game.

Wrapping Up

The gaming industry requires robust observability solutions to find and fix issues efficiently before users are adversely affected. Observability tools can benefit the gaming industry by improving the process of finding issues throughout a game’s code, infrastructure, and network. The benefits of observability also reach game design decisions by finding areas where users display abnormal behavior. 

Observability must be instrumented in all game platforms but can be integrated iteratively. Each of the three data types should be included for each platform to gain valuable insights. Choose an observability tool where data can be collected and analyzed to empower LiveOps to handle bugs effectively. Consider a tool that reduces the cost of ample data storage, reduces implementation time, and optimizes the speed to deliver valuable insights into your game. 

Tracing vs. Logging: What You Need To Know

Log tracking, trace log, or logging traces…

Although these three terms are easy to interchange (the wordplay certainly doesn’t help!), compare tracing vs. logging, and you’ll find they are quite distinct. Logs monitoring, traces, and metrics are the three pillars of observability, and they all work together to measure application performance effectively. 

Let’s first understand what logging is.

What is logging?

Logging is the most basic form of application monitoring and is the first line of defense to identify incidents or bugs. It involves recording timestamped data of different applications or services at regular intervals. Since logs can get pretty complex (and massive) in distributed systems with many services, we typically use log levels to filter out important information from these logs. The most common levels are FATAL, ERROR, WARN, DEBUG, INFO, TRACE, and ALL. The amount of data logged on each log level also varies based on how critical it is to store that information for troubleshooting and auditing applications.

Most logs are highly detailed with relative information about a particular microservice, function, or application. You’ll need to collate and analyze multiple log entries to understand how the application functions normally. And since logs are often unstructured, reading them from a text file on your server is not the best idea. 

But we’ve come far with how we handle log data. You can easily link your logs from any source and in any language to Coralogix’s log monitoring platform. With our advanced data visualization tools and clustering capabilities, we can help you proactively identify unusual system behavior and trigger real-time alerts for effective investigation.

Now that you understand what logging is let’s look at what is tracing and why it’s essential for distributed systems.

What is tracing?

In modern distributed software architectures, you have a dozen — if not hundreds of applications calling each other. Although analyzing logs can help you understand how individual applications perform, it does not track how they interact with each other. And often, especially in microservices, that’s where the problem lies. 

For instance, in the case of an authentication service, the trigger is typically a user interaction — such as trying to access data with restricted access levels. The problem can be in the authentication protocol, the backend server that hosts the data, or how the server sends data to the front end.

Thus, seeing how the services connect and how your request flows through the entire architecture is essential. That provides context to the problem. Once the problematic application is identified, the appropriate team can be alerted for a faster resolution.

This is where tracing comes in — an essential subset of observability. A trace follows a request from start to end and how your data moves through the entire system. It can record which services it interacted with and each service’s latency. With this data, you can chain events together to analyze any deviations from normal application behavior. Once the anomaly is pinpointed, you can link log data from events you’ve identified, the duration of the event, and the specific function calls that caused the event — thereby identifying the root cause of the error within a few attempts.

Okay, so now that we understand the basics of what is tracing, let’s look at when you should use tracing vs. logging.

When should you use tracing vs. logging?

Let’s understand this with an example. Imagine you’ve joined the end-to-end testing team of an e-commerce company. Customers complain about intermittent slowness while purchasing shoes. To resolve this, you must identify which application is triggering the issue — is it the payment module? Is it the billing service? Or is it how the billing service interacts with the fulfillment service?

You require both logging and tracing to understand the root cause of the issue. Logs help you identify the issue, while a trace helps you attribute it to specific applications. 

An end-to-end monitoring workflow would look like this: Use a log management platform like Coralogix to get alerts if any of your performance metrics fail. You can then send a trace that emulates your customer behavior from start to end. 

In our e-commerce example, the trace would add a product to the cart, click checkout, add a shipping address, and so on. While doing each step, it would record the time it took for each service to respond to the request. And then, with the trace, you can pinpoint which service is failing and then go back to the logs to find any errors.

Logging is essential for application monitoring and should always be enabled. In contrast, trying to trace continuously means that you’d bogging down the system with unnecessary requests, which can cause performance issues. It’s better to send sample requests if the logs show behavior anomalies. 

So, to sum up, if you have to choose tracing vs. logging for daily monitoring, logging should be your go-to! And conversely, if you need to debug a defect, you can rely on tracing to get to the root cause faster.  

Tracing vs. Logging: Which one to choose?

Although distributed architectures are great for scale, they introduce additional complexity and require heavy monitoring to provide a seamless user experience. Therefore, we wouldn’t recommend you choose tracing vs. logging — instead, your microservice observability strategy should have room for both. While logging is like a toolbox you need daily, tracing is the handy drill that helps you dig down into issues you need to fix. 

An Introduction to Kubernetes Observability

If your organization is embracing cloud-native practices, then breaking systems into smaller components or services and moving those services to containers is an essential step in that journey. 

Containers allow you to take advantage of cloud-hosted distributed infrastructure, move and replicate services as required to ensure your application can meet demand, and take instances offline when they’re no longer needed to save costs.

Once you’re dealing with more than a handful of containers in production, a container orchestration platform becomes practically essential. Kubernetes, or K8s for short, has become the de-facto standard for container orchestration, with all major cloud providers offering K8s support and their own Kubernetes managed service.

With Kubernetes observability, you can automate your containers’ deployment, management, and scaling, making it possible to work with hundreds or thousands of containers and ensure reliable and resilient service. 

Fundamental to the design of Kubernetes is its declarative model: you define what you want the state of your system to be, and Kubernetes works to ensure that the cluster meets those requirements, automatically adding, removing, or replacing pods (the wrapper around individual containers) as required.  

The self-healing design can give the impression that observability and monitoring are all taken care of when you deploy with Kubernetes. Unfortunately, that’s not the case. While some things are handled automatically – like replacing failed cluster nodes or scaling services – Kubernetes observability still needs to be built in and used to ensure the health and performance of a K8s deployment.

Log data plays a central role in creating an observable system. By monitoring logs in real-time, you gain a better understanding of how your system is operating and can be proactive in addressing issues as they emerge, before they cause any real damage. This article will look at how Kubernetes observability can be built into your Kubernetes-managed cluster, starting at the bottom of the stack.

Observability for K8s infrastructure

As a container orchestration platform, Kubernetes handles the containers running your application workloads but doesn’t manage the underlying infrastructure that hosts those containers. 

A Kubernetes cluster consists of multiple physical and/or virtual machines (the cluster nodes) connected over a network. While Kubernetes will take care of deploying containers to the nodes (according to the declared configuration) and packing them efficiently, it cannot manage the nodes’ health.

Your cloud provider is responsible for keeping servers online and providing computing resources on demand in a public cloud context. However, to avoid the risk of a huge bill, you’ll want to keep an eye on your usage – and potentially set quotas – to prevent auto-scaling and elastic resources from running wild. If you’ve set quotas, you’ll need to monitor your usage and be ready to provision additional capacity as demand grows.

If you’re running Kubernetes on a private cloud or on-premise infrastructure, monitoring the health of your servers – including disk space, memory, and CPU – and keeping them patched and up-to-date is essential. 

Although Kubernetes will take care of moving pods to healthy nodes if a machine fails, with a fixed set of resources, that approach can only stretch so far before running out of server nodes. To use Kubernetes’ self-healing and auto-scaling features to the best effect, you must ensure sufficient cluster nodes are online and available at all times.

Using Kubernetes’ metrics and logs

Once you’ve considered the observability of the servers hosting your Kubernetes cluster, the next layer to consider is the Kubernetes deployment itself. 

Although Kubernetes is self-healing, it is still dependent on the configuration you specify; by getting visibility into how your cluster is being used, you can identify misconfigurations, such as faulty replica sets, and spot opportunities to streamline your setup, like underused nodes.

As you might expect, the various components of Kubernetes each emit log messages so that the inner workings of the system can be observed. This includes:

  • kube-apiserver – This serves the REST API that allows you, as an end-user, to communicate with the cluster components via kubectl or a GUI application, and enables communication between control plane components over gRPC. The API server logs include details of error messages and requests. Monitoring these logs can alert you to early signs of the server needing to be scaled out to accommodate increased load or issues down the pipeline that are slowing down the processing of incoming requests.
  • kube-scheduler – The scheduler assigns pods to cluster nodes according to configuration rules and node availability. Unexpected changes in the number of pods assigned could signify a misconfiguration or issues with the infrastructure hosting the pods.
  • kube-controller-manager – This runs the controller processes. Controllers are responsible for monitoring the status of the different elements in a cluster, such as nodes or endpoints, and moving them to the desired state when needed. By monitoring the controller manager over time, you can determine a baseline for normal operations and use that information to spot increases in latency or retries. This may indicate something is not working as expected.

The Kubernetes logging library, klog, generates log messages for these system components and others, such as kubelet. Configuring the log verbosity allows you to control whether logs are only generated for critical or error states or lower severity levels too. 

While you can view log messages from the Kubernetes CLI, kubectl, forwarding logs to a central platform allows you to gain deeper insights. By building up a picture of the log data over time, you can identify trends and compare these to the latest data in real-time, using it to identify changes in cluster behavior.

Monitoring a Kubernetes-hosted application

In addition to the cluster-level logging, you need to generate logs at the application level for full observability of your system. Kubernetes observability ensures your services are available, but it lacks visibility or understanding of your application logic. 

Instrumenting your code to generate logs at appropriate severity levels makes it possible to understand how your application is behaving at runtime and can provide essential clues when debugging failures or investigating security issues.

Once you’ve enabled logging into your application, the next step is to ensure those logs are stored and available for analysis. By their very nature, containers are ephemeral – spun up and taken offline as demand requires. 

Kubernetes stores the logs for the current pods and the previous pods on a given node, but if a pod is created and removed multiple times, the earlier log data is lost. 

As log data is essential for determining what normal behavior looks like, investigating past incidents and for audit purposes, it’s a good idea to consider shipping logs to a centralized platform for storage and analysis.

The two main patterns for shipping logs from Kubernetes are to use either a node logging agent or a sidecar logging agent:

  • With a node logging agent, the agent is installed on the cluster node (the physical or virtual server) and forwards the logs for all pods on that node.
  • With a sidecar logging agent, each pod holds the application container together with a sidecar container hosting the logging agent. The agent forwards all logs from the container.

Once you’ve forwarded your application logs to a log observability platform, you can start analyzing the log data in real-time. Tracking business metrics, such as completed transactions or order quantities, can help to spot unusual patterns as they begin to emerge. 

Monitoring these alongside lower-level application, cluster, and infrastructure health data makes it easier to correlate data and drill down into the root cause of issues.

Summary

While Kubernetes offers many benefits when running distributed, complex systems, it doesn’t prevent the need to build observability into your application and monitor outputs from all levels of the stack to understand how your system is behaving. 

With Coralogix, you can perform real-time analysis of log data from each part of the system to build a holistic view of your services. You can forward your logs using Fluentd, Fluent-Bit, or Filebeat, and use the Coralogix Kubernetes operator to apply log parsing and alerting features to your Kubernetes deployment natively using Kubernetes custom resources.

This Win is for Our Customers – We’ve Just Raised $142 Million in a Series D Round

While 2020 and 2021 were significant years for us, we’ve entered this year ready to give more to our users!

We’re delighted to share we have raised $142 million in a Series D funding round!

So what does that mean for you exactly? Over the past few months, our team has been working hard to build custom mapping for metrics, an advanced tracing UI, and more into our platform. The world is our oyster, and we can’t wait for you to see what else we have planned!

Streama Technology

Our Streama Technology produces real-time insights and trend analysis for all observability data with no reliance on storage or indexing, solving the challenges of building, running, and securing modern infrastructures and applications. You can read more about our Streama technology and see how it works for yourself!

“Our approach at Coralogix is to solve the fundamental challenges of ever-growing data volumes and system complexity. Our technology breaks the unit economics of observability to provide our customers with a cost-effective way to centralize and scale across the R&D organization. With this round of funding, we will be expanding our offering into further markets as we continue our journey to provide harmonious observability,” said Ariel Assaraf, CEO of Coralogix.

Our users typically see a 40 to 70 percent cost reduction while expanding their coverage and improving data insights via our Streama technology. The technology uses Kafka streams and independent Kubernetes services to analyze data in-stream. Coralogix uses components within the stream to store the system state and produce time-relative insights such as trend analysis, dynamic alerting, and anomaly detection with no dependence on any external datastore. All of the data is stored in the client’s S3 bucket, or any object storage, with direct query capabilities via our UI or CLI, giving the customer infinite retention.

Full-Stack Observability

We are committed to maintaining our status as a best-in-class observability tool – in 2022, we’ve taken the next steps in our evolution to provide users infinite insights into their logs, metrics, tracing, and security. 

Logs

Get full feature functionality and infinite retention without ever having to index your raw data for infinite insights with transparent and predictable costs. Generate trackable metrics from your logs on the fly using the Logs2Metrics feature. Automatically cluster millions of individual logs into templates to more quickly pinpoint issues. Lastly, with Coralogix Actions, you can easily jump to third-party vendors based on query results/values under specific keys identified via the Coralogix platform.

Metrics

Collect all metric data without the overhead, performance, and scaling issues of other solutions so that you can centralize alerts and visualizations, correlate across systems and environments, and leverage open platforms advanced mapping visualizations. With our new DataMap feature, you’ll be able to create visual mappings of your infrastructure, log-based, and business metrics data to represent and monitor the structure and health of your business. The feature offers you full customization based on any perspective or goal. 

Tracing 

Our tracing features allow you to pinpoint issues, drill down into spans and visualize data flows, resolve issues faster, and work with open-source instrumentation & visualization tools. The Coralogix Tracing UI can be used for querying, filtering, and visualizing your tracing data in the same screen as all your logs, metrics, and security events. You can filter traces by applications, subsystems, and duration, allowing you to easily identify traces that are above a given threshold or fall within a min-max range. 

We are an official contributor to OpenTelemetry so your tracing data can easily be sent to the Coralogix platform using our exporter. Plus, in addition to offering our own UI, you can visualize your tracing data in a hosted-Jaeger instance or bring your own and use Coralogix as a data source.

Security

The Coralogix security features offer you a modern solution that enables security as a code and meets the highest security and compliance standards. With the platform, you’ll get full data coverage, cloud SIEM, cloud security posture, vulnerability assessments, network and host security monitoring, and access to a 24×7 security resource center. 

What Are Our Plans

In short, it’s been an exciting few months for us and the platform. As we move forward, and with this new round of funding, we are more committed than ever to providing our users with the best solution there is.

To that end, we will continue growing our go-to-market, product, and R&D teams and expand into 10 new markets. As we expand, we will be focused on maintaining our world-class support and adding to and improving our full-stack observability platform.

We understand the importance of efficiency, cost-effectiveness, and reliability, and we vow always to give our users the best there is!

Thank You to Our Investors!

We’re proud to partner with Advent International and Brighton Park Capital, who co-led this round, with participation from Revaia Ventures and existing investors Greenfield Partners, Red Dot Capital Partners, Eyal Ofer’s O.G. Tech, StageOne Ventures, Joule Capital Partners, and Maor Investments. 

3 Key Benefits to Web Log Analysis

Whether it’s Apache, Nginx, ILS, or anything else, web servers are at the core of online services, and web log monitoring and analysis can reveal a treasure trove of information. These logs may be hidden away in many files on disk, split by HTTP status code, timestamp, or agent, among other possibilities. Web access log monitoring is typically analyzed to troubleshoot operational issues, but there is so much more insight that you can draw from this data, from SEO to user experience. Let’s explore what you can do when you really dive into web log analysis.

1. Spotting errors and troubleshooting with web log analysis

Right now, online Internet traffic is exceeding 333 Exabytes per month. This has been growing year on year since the founding of the Internet. With this increase in traffic comes the increased complexity of operational observability. Your web access logs are crucial in the fight to maintain operational excellence. While the details vary, some fields you can expect in all of your web logs include:

  • Latency
  • Source IP address
  • HTTP status code
  • Resource requested
  • Request and response size in bytes

These fields are fundamental measures in building a clear picture of your operational excellence. You can use these fields to capture abnormal traffic arriving at your site, which may indicate malicious activity like “bad bot” web scraping. You could also detect an outage in your site by looking at a sudden increase in errors in your HTTP status codes

2. SEO diagnostics with web logs

68% of online activity begins with a user typing something into a search engine. This means that if you’re not harnessing the power of SEO, you’re potentially missing out on a massive volume of potential traffic and customers. Despite this, almost 90% of content online receives no organic traffic from Google. An SEO-optimized site represents a serious edge in the competitive online market. Web access logs can give you an insight into several key SEO dimensions that will provide you with clear measurements for the success of your SEO campaigns.

42.7% of online users are currently using an ad-blocker, which means that you may see serious disparities between the traffic to your site and the impressions you’re seeing on the PPC ads that you host. Your web log analysis can alert you to this disparity very easily by giving you a clear overall view of the traffic you’re receiving because they are taken on the server-side and don’t depend on the client’s machine to track usage.

You can also verify the IP addresses connected to your site to determine whether Googlebot is scraping and indexing your site. This is crucial because it won’t just tell you if Googlebot is present but also which pages it has accessed, using a combination of the URI field and IP address in your web access logs. 

3. Site performance and user experience insights from web log analysis

Your web access logs can also give you insight into how your site performs for your user. This is different from the operational challenge of keeping the site functional and more of a marketing challenge to keep the website snappy, which is vital. Users make decisions about your site in the first 50ms, and if all they see is a white loading page, they’re not going to make favorable conclusions. 

The bounce rate increases sharply with increased latency. If your page takes 4 seconds to load, you’ve lost 20% of your potential visitors. Worse, those users will view an average of 3.4 fewer pages than they would if the site took 2 seconds to load. Every second makes a difference. 

Your web access logs are the authority on your latency because they tell you the duration of the whole HTTP connection. You can then combine these values with more specific measurements, like the latency of a database query or disk I/O latencies. By optimizing for these values, you can ensure that you’re not losing any of your customers to slow latencies.

Your web access logs may also give you access to the User-Agent header. This header can tell you the browser and operating system that initiated the request. This is essential because it will give you an idea of your customers’ devices and browsers. 52.5% of all online traffic comes from smartphones, so you’re likely missing out if you’re not optimizing for mobile usage.  

Wrapping up

Web access log analysis is one of the fundamental pillars of observability; however, the true challenge isn’t simply viewing and analyzing the logs, but in getting all of your observability data into one place to correlate them with one another. Your Nginx logs are powerful, but if you combine them with your other logs, metrics, and traces from CDNs, applications, and more, they form part of an observability tapestry that can yield actionable insights across your entire system.

Using Synthetic Endpoints to Quality Check your Platform

Quality control and observability of your platform are critical for any customer-facing application. Businesses need to understand their user’s experience in every step of the app or webpage. User engagement can often depend on how well your platform functions, and responding quickly to problems can make a big difference in your application’s success. 

AWS monitoring tools can help companies simulate and understand the user experience. It will help alert businesses to issues before they become problems for customers.

The Functionality of AWS Canaries

AWS Canaries are an instantiation of Amazon’s CloudWatch Synthetics. They are configurable scripts that automatically execute to monitor endpoints and APIs. They will follow the flow and perform the actions as real users. The results from the Canary mimic what a real user would see at any given time, allowing teams to validate their customer’s experience.

Tracked metrics using AWS Canaries include availability and latency of your platform’s endpoints and APIs, load time data, and user interface screenshots. They can also monitor linked URLs and website content. AWS Canaries can also check for changes to your endpoint resulting from authorized code changes, phishing, and DDoS attacks.

How AWS Canaries Work

AWS Canaries are scripts that monitor your endpoints and APIs. The scripts follow the same flows that real customers would follow to hit your endpoints. Developers can write Canary scripts using either Node.js or Python, and the scripts run on a headless Google Chrome Browser using Puppeteer for Node.js scripts or Selenium for Python scripts. Canaries run scripts against your platform and log results in one of AWS’s observability tools, such as CloudWatch or XRay. From here, developers can export the data to other tools like Coralogix’s metrics platform for analysis.

AWS Canaries and Observability

AWS Canaries and X-Ray

Developers can set AWS Canaries to use X-Ray for specific runtimes. When X-Ray is enabled, traces indicate latency requests, and the Canaries send failures to X-Ray. These data are explicitly grouped for canary calls, making separating actual calls from AWS Canary calls against your endpoints easier. 

Enabling traces can increase the canary’s runtime by up to 7%. Also, set IAM permissions that allow the canary to write to X-Ray.

AWS Canaries and EventBridge

AWS EventBridge can notify various AWS Canary events, including status changes or complete runs. AWS does not guarantee delivery of all Canary events to EventBridge, instead of sending on a best effort basis. Cases where data does not arrive in EventBridge, are expected to be rare. 

Canary events can trigger different EventBridge rules and, therefore, different subsequent functions or data transfers to third-party analytics tools. Functions can be written that allow teams to troubleshoot when a canary fails, investigate error states, or monitor workflows. 

AWS Canaries and CloudWatch Metrics

AWS Canaries will automatically create Cloudwatch metrics. Metrics published include the percentage of entirely successful and failed canary runs, duration of canary tuns, and the number of responses in the 200, 400, and 500 ranges. 

Metrics can be viewed on the CloudWatch Metrics page. They will be present under the CloudWatchSynthetics namespace.

AWS Canaries and Third-Party Analytics

Since AWS Canaries automatically write CloudWatch metrics, metrics streams can be used to deliver data to third-party tools like Coralogix’s scaling metrics platform. CloudWatch metrics streams allow users to send only specific namespaces or all namespaces to third parties. After the stream is created, new metrics will be sent automatically to the third party without limit. AWS does charge based on the number of metrics updates sent.

Creating a Canary

When creating a Canary, developers choose whether to use an AWS-provided blueprint, the inline editor, or import a script from S3. Blueprints provide a simple way to get started with this tool. Developers can create Canaries using several tools such as the Serverless framework, the AWS CLI, or the AWS console. Here we will focus on creating a Canary using a Blueprint with the AWS console.

Create a Canary from the AWS Console

Navigate to the AWS CloudWatch console. You will find Synthetics Canaries in the right-hand menu under the Application monitoring section. This page loads to show you any existing Canaries you have. You can see, at a glance, how many of the most recent runs have passed or failed across all your canaries. You can also select Create canary from this page to make a new AWS Canary and test one of your endpoints.

Use a Blueprint to create a Canary quickly

In the first section, you select how to create your new Canary. Here we will choose to use an AWS blueprint, but the inline editor and importing from S3 are also options. There are six blueprints currently available. Each is designed to get developers started on a different common problem. We will use the API canary, which will attempt to call the AWS-deployed API periodically. This option is useful when you want to test an API you own in AWS API Gateway or some other hosted service.

Name your Canary

Next, choose a name for your canary. It does not need to match the name of the API, but this will make it easier to analyze the results if you are testing a large number of endpoints. 

Link your Canary to an AWS API Gateway

Next, AWS gives the option to load data directly from API Gateway. If you select the checkbox, the page expands, allowing you to choose your API Gateway, stage, and Hostname. Options are loaded from what is deployed in the same AWS environment. Developers are not required to select an AWS API Gateway and can still test other endpoints with the Canaries; the information is simply loaded manually. 

Setup your HTTP request

Next, add an HTTP request to your canary. If you have loaded from AWS API Gateway, a dropdown list is provided with resources attached to the endpoint. Choose the resource, method, and add any query strings or headers. If there is any authorization for your endpoint, it should be added here. 

Schedule your Canary

After the HTTP setup is complete, choose the schedule for the AWS Canary. It determines how often the Canary function will hit your endpoint. You can choose between running periodically indefinitely, using a custom CRON expression, or just running the canary once. When selecting this, remember that this is adding traffic to your endpoint. This could incur costs depending on your setup.

Configure log retention in AWS S3

AWS next allows developers to configure where Canary event logs are stored. They will automatically be placed into S3. From the console, developers can choose which S3 bucket should store the data and how long it should be kept. Developers can also choose to encrypt data in the S3 bucket. For exporting data to third-party tools, developers can set up a stream on the S3 bucket. This can trigger a third-party tool directly, send data via AWS Kinesis, or trigger AWS EventBridge to filter the data before sending it for analysis. 

Setup an alarm

Developers can choose to set up a CloudWatch alarm on the Canary results. This is especially useful in production environments to ensure your endpoints are healthy and limit outages. The same results may also be obtained through third-party tools that enable machine learning not only to see when your endpoint has crossed a pre-set threshold but can also detect irregular or insecure events.

Enable AWS XRay

Developers can choose to send results of Canaries to AWS XRay. To send this data, check the box in the last section of the console screen. Enabling XRay will incur costs on your AWS account. It will also allow you another mode of observing your metrics and another path to third-party tools that can help analyze the health of your platform.

Summary

Canaries are an observability tool in AWS. Canaries give a method of analyzing API Endpoint behavior by periodically testing them and recording the results. Measured values include ensuring the endpoints are available, that returned data is expected, and that the delay in the response is within required limits.

AWS Canaries can log results to several AWS tools, including AWS Cloudwatch, AWS XRay, and AWS EventBridge. Developers can also send Canary results to third-party tools like the Coralogix platform to enable further analysis and alert DevOps teams when there are problems.

Coralogix’s Streama Technology: The Ultimate Party Bouncer

Coralogix is not just another monitoring or observability platform. We’re using our unique Streama technology to analyze data without needing to index it so teams can get deeper insights and long-term trend analysis without relying on expensive storage. 

So you’re thinking to yourself, “that’s great, but what does that mean, and how does it help me?” To better understand how Streama improves monitoring and troubleshooting capabilities, let’s have some fun and explore it through an analogy that includes a party, the police, and a murder!

Grab your notebook and pen, and get ready to take notes. 

Not just another party 

Imagine that your event and metric data are people, and the system you use to store that data is a party. To ensure that everyone is happy and stays safe, you need a system to monitor who’s going in, help you investigate, and remediate any dangerous situations that may come up. 

For your event data, that would be some kind of log monitoring platform. For the party, that would be our bouncer.

Now, most bouncers (and observability tools) are concerned primarily with volume. They’re doing simple ticket checks at the door, counting people as they come in, and blocking anyone under age from entering. 

As the party gets more lively, people continue coming in and out, and everyone’s having a great time. But imagine what happens if, all of a sudden, the police show up and announce there’s been a murder. Well, shit, there goes your night! Don’t worry, stay calm – the bouncer is here to help investigate. 

They’ve seen every person who has entered the room and can help the police, right?

Why can’t typical bouncers keep up?

Nothing ever goes as it should, this much we know. Crimes are committed, and applications have bugs. The key, then, is how we respond when something goes wrong and what information we have at our disposal to investigate.

Suppose a typical bouncer is monitoring our party, and they’re just counting people as they come in and doing a simple ID check to make sure they’re old enough to enter. In that case, the investigation process starts only once the police show up. At this point, readily-available information is sparse. You have all of these people inside, but you don’t have a good idea of who they are.

This is the biggest downfall of traditional monitoring tools. All data is collected in the same way, as though it carries the same potential value, and then investigating anything within the data set is expensive. 

The police may know that the suspect is wearing a black hat, but they still need to go in and start manually searching for anyone matching that description. It takes a lot of time and can only be done using the people (i.e., data) still in the party (i.e., data store). 

Without a good way to analyze the characteristics of people as they’re going in and out, our everyday bouncer will have to go inside and count everyone wearing a black hat one by one. As we can all guess, this will take an immense amount of time and resources to get the job done. Plus, if the suspect has already left, it’s almost like they were never there.

What if the police come back to the bouncer with more information about the suspect? It turns out that in addition to the black hat, they’re also wearing green shoes. With this new information, this bouncer has to go back into the party and count all the people with black hats AND green shoes. It will take him just as long, if not longer, to count all of those people again.

What makes Streama the ultimate bouncer?

Luckily, Streama is the ultimate bouncer and uses some cool tech to solve this problem.

Basically, Streama technology differentiates Coralogix from the rest of the bunch because it’s a bouncer that can comprehensively analyze the people as they go into the party. For the sake of our analogy, let’s say this bouncer has Streama “glasses,” which allow him to analyze and store details about each person as they come in.

Then, when the police approach the bouncer and ask for help, he can already provide some information about the people at the party without needing to physically go inside and start looking around.

If the police tell the bouncer they know the murderer had on a black hat, he can already tell them that X number of people wearing a black hat went into the party. Even better, he can tell them that without those people needing to be inside still! If the police come again with more information, the bouncer can again give them the information they need quite easily.  

In some cases, the bouncer won’t have the exact information needed by the police. That’s fine, they can still go inside to investigate further if required. By monitoring the people as they go in, though, the bouncer and the police can save a significant amount of time, money, and resources in most situations.

Additional benefits of Streama

Since you are getting the information about the data as it’s ingested, it doesn’t have to be kept in expensive hot storage just in case it’s needed someday. With Coralogix, you can choose to only send critical data to hot storage (and with a shorter retention period) since you get the insights you need in real-time and can always query data directly from your archive.

There are many more benefits to monitoring data in-stream aside from the incredible cost savings. However, that is a big one.

Data enrichment, dynamic alerting, metric generation from log data, data clustering, and anomaly detection occur without depending on hot storage. This gives better insights at a fraction of the cost and enables better performance and scaling capabilities. 

Whether you’re monitoring an application or throwing a huge party, you definitely want to make sure Coralogix is on your list!