Centralized Log Management: Why It’s Essential for System Security in a Hybrid Workforce

Remote work increased due to Covid-19. Now heading into 2023, remote or hybrid workplaces are here to stay. Surveys show 62% of US workers report working from home at least occasionally, and 16% of companies worldwide are entirely remote. With a hybrid workforce, security breaches from sources were less typical with in-office work. 

While working remotely, employees must consider many things they would not be concerned about within an office. This includes using personal devices for business purposes, using an unsecured network for work, and even leaving a device unattended or worrying about who is behind you at the coffee shop. There are many new avenues for cybercriminals to attack, showing why cybercrimes have increased by 238% since the pandemic’s start. Security threats from human error, misconfigured cloud infrastructure, and trojans rose in 2021 while work-from-home was in full swing. The rise in security breaches proves system security is essential for businesses, large and small, to avoid big payouts to recover from breaches.

There is an increased prevalence of cybercriminals taking advantage of the transition to remote work. Companies must implement new and improved security measures, such as log monitoring, to reduce the chances of successful infiltration. 

Use IAM to Secure Access

To prevent cybercrimes, companies need to secure their employees’ work-from-home networks. Identity access management (IAM) can be used to secure home networks while enabling easy access to data required for their role. Ideally, the IAM solution is implemented with least-privilege access, so the employee only has access to what they need and nothing more. 

When employees need access to critical data, ensure it is not simply downloaded to their company device. Instead, either store the data in the cloud where it can be accessed without download. Monitoring logs and the data is accessed is necessary to ensure bad actors are not gaining access. Authentication events can be logged and monitored for this purpose. If data does require a download, companies should provide employees with additional tools like virtual private networks (VPN), so they can access the company network remotely. 

Log Access and Authentication Events

With remote work, employees use individual networks rather than a company network to access their required work. Corporate networks can set up a perimeter at an office, allowing only trusted devices. With remote work, this perimeter is easier to breach, and cyber criminals are taking advantage. Once they enter the network, they can take nefarious actions like ransomware attacks. 

Using a VPN is a secure way for employees to connect to a corporate network. But, they are only secure if appropriately implemented with multi-factor authentication and up-to-date security protocols. So, even when using a VPN, bad actors may gain access to your network.

To reduce the risk of a security breach, logs and log analysis can be used to detect a bad actor in your network. Logging authentication and authorization events allow for data analysis. Machine-learning analytics can detect bad actors in your system so you can take action to prevent downtime and ransomware attacks.

Centralize Log Storage to Enable Fast Analysis

Extra logging needs to be enabled to better secure networks that allow remote access. The logs also need to be monitored for the logging to be useful in preventing security breaches. This is extremely difficult when logs are stored separately, forcing IT teams to monitor logs in multiple locations. Centralized log storage and management make getting the insights you need to detect security breaches easier. 

Once logs are combined, IT teams can adequately monitor events. They can also use the logs to assess security risks, respond to incidents, investigate past events, and run a secure software development lifecycle. Centralized logs also lend well to custom dashboard setups that allow IT professionals to monitor logs more efficiently. 

Centralize logs from different parts of your system to ensure they can be analyzed appropriately. This includes logs from IAM tools, network devices, and VPNs. Once logs are combined, they can be analyzed by machine learning tools to detect specific security breaches. These analyses can detect issues as they happen to hasten responses and mitigate risk to your stored data and product. 

Example: Detecting Ransomware Through Log Management

When clicking on a malicious link, ransomware can be downloaded to an employee’s computer. The goal of the download is to install without the employee’s knowledge. Ransomware sends information to another server controlled by cybercriminals. The cybercriminals can then use the server to direct the infected employee device or encrypt data. 

Since the employee’s computer needs to connect to this external server for the ransomware to run, an attack can be detected by monitoring network traffic on the employee’s computer. Depending on the ransomware, different logs may be relevant to detect the security breach including web proxy logs, email logs, and VPN logs. Since different log formats can be used to detect the breach, combining them into a single location can assist IT teams in detecting the security risk. 

Summary

The increase in remote workers has changed how cybercriminals can attack company servers. Ransomware, malware, data theft, and trojans have all significantly increased since the start of the Covid-19 pandemic. Companies must find new ways to mitigate these security risks for remote workers. 

Implementing safeguards is critical to a company’s security. Use IAM to authenticate users and limit their access to only what they need to work. Using VPN is essential for remote workers who need access to sensitive data. 

Since there is always risk of security breaches, centralized log management can mitigate risks even when stringent methods are used. By collecting logs in a single location, analytics can be employed to quickly detect security breaches so IT teams can take corrective action sooner. SaaS offerings like Coralogix can provide centralized log management and analytics to detect security breaches.

Cloud Configuration Drift: What Is It and How to Mitigate it

More organizations than ever run on Infrastructure-as-Code cloud environments. While migration brings unparalleled scale and flexibility advantages, there are also unique security and ops issues many don’t foresee.

So what are the major IaC ops and security vulnerabilities? Configuration drift.

Cloud config drift isn’t a niche concern. Both global blue-chips and local SMEs have harnessed Coded Infrastructure. However, many approach their system security and performance monitoring the same way they would for a traditional, hardware-based system.

Knowing how to keep the deployed state of your cloud environment in line with the planned configuration management tools is vital. Without tools and best practices to mitigate drift, the planned infrastructure vs. as-is code inevitably becomes very different. This creates performance issues and security vulnerabilities.

Luckily, IaC integrity doesn’t have to be an uphill struggle. Keep reading if you want to keep config drift at a glacial pace.  

What is config drift?

In simple terms, configuration drift is when the current state of your infrastructure doesn’t match the IaC configuration as determined by the code.

Even the most carefully coded infrastructure changes after a day of real-world use. Every action creates a change to the code. This is manageable at a small scale, but it becomes a constant battle when you have 100+ engineers, as many enterprise-level teams do. Every engineer is making console-based changes, and every change causes drift. While many of these changes are small, they quickly add up at the operational scale of most businesses. The same flexibility that prompted the great enterprise migration to the cloud can also cause vulnerability.

Config changes in your environment will be consistent, both deliberate and accidental, especially in large organizations where multiple teams (of varying levels of expertise) are working on/in the same IaC environment. Over time these changes mount up and lead to drift.

Why does cloud-config drift happen?

Cloud infrastructure allows engineers to do more with fewer human hours and pairs of hands. Environments and assets can be created and deployed daily (in the thousands of scale demands). Many automatically update, bringing new config files and code from external sources. Cloud environments are constantly growing and adapting, with or without human input.

However, this semi-automated state of flux creates a new problem. Think of it as time travel in cinema; a small action in the past makes an entirely different version of the present. With IaC, a slight change in the code can lead to a deployed as-is system that’s radically unmatched by the planned configuration your engineers are working from.

Here’s the problem; small changes in IaC code always happen. Cloud environments are flexible and create business agility because the coded infrastructure is so malleable. Drift is inevitable, or at least it can feel that way if you don’t have a solution that adequately mitigates it.

What makes config drift a performance and security risk

Traditional monitoring approaches don’t work for cloud environments. Monitoring stacks could be mapped to config designs with minimal issues in a monolithic system. It would be difficult for a new machine or database to appear without engineers noticing when they’d require both physical hardware and human presence to install. The same can’t be said for coded infrastructure.

If your system visibility reflects plans and designs, instead of the actual deployed state, the gap between what your engineers see and what’s actually happening widens every hour. Unchecked config drift doesn’t create blind spots; it creates deep invisible chasms.

For performance, this causes problems. Cloud systems aren’t nearly as disrupted by high-activity peaks as physical systems of decades ago, but they’re not entirely immune. The buildup of unoptimized assets and processes leads to noticeable performance issues no matter how airtight your initial config designs.

Security that doesn’t have full system visibility is a risk that shouldn’t need explaining but is exactly what config drift leads to. Config drift doesn’t just open a back door for cybercriminals; it gives them keys to your digital property.

Common key causes of IaC config drift

Configuration drift in IaC can feel unavoidable. However, key areas are known to create drift if best practices and appropriate tooling aren’t in place.

Here are some of the most common sources of config drift in cloud environments. If your goal is to maintain a good security posture and a drift disruption-free IaC system, addressing the following is an excellent place to start.

Automated pipelines need automated discovery

Automation goes hand-in-hand with IaC. While automated pipelines bring the flexibility and scale necessary for a 21st-century business, they’re a vulnerability in a cloud environment if you rely on manual discovery and system mapping.

Once established, a successful automated pipeline will generate and deploy new assets with little-to-no human oversight. Great for productivity, potentially a nightmare if those assets are misconfigured, or there are no free engineering hours to confirm new infrastructure is visible to your monitoring and security stacks.

IaC monitoring stacks need to incorporate AI-driven automated discovery. It reduces the need for manual system mapping. Manual discovery is tedious on a small scale. It becomes a full-time commitment in a large cloud environment that changes daily. 

More importantly, automated discovery ensures new assets are visible from the moment they’re deployed. There’s no vulnerable period wherein a recently deployed asset is active but still undiscovered by your monitoring/security stacks. Automated discovery doesn’t just save time, it delivers better results and a more secure environment.

An automated pipeline is only one poorly written line of code away from being a conveyor belt of misconfigured assets. Automated discovery ensures pipelines aren’t left to drag your deployed systems further and further from the configured state your security and monitoring stacks operate by.

Resource tagging isn’t to make sysadmin’s lives easier

The nature of automated deployment means untagged assets are an ever-present risk. Vigilance should be taken, especially when operating at scale. This is where real-time environment monitoring becomes a security essential.

Every incorrect or absent tag drifts your deployed state further from the planned config. Due to the scale of automated deployment, it’s rarely a single asset too. Over time the volume of these unaccounted-for “ghost” resources in your system multiplies.

This creates both visibility and governance issues. Ghost resources are almost impossible to monitor and pose significant challenges for optimization and policy updates. Unchecked, these clusters of invisible, unoptimized resources create large security blind spots and environment-wide config drift.

A real-time monitoring function that scans for untagged assets is crucial. Platforms like Coralogix alert your engineers to untagged resources as they’re deployed. From here, they can be de-ghosted with AI/ML automated tagging or removed entirely. Either way, they’re no longer left to build up and become a source of drift or security posture slack.

Undocumented changes invite configuration drift, no exception

Change is constant in coded infrastructure. Documenting them all, no matter how small/trivial, is critical.

One undocumented change probably won’t bring your systems to a halt (although this can and has happened). However, a culture of lax adherence to good practice rarely means just one undocumented alteration. Over time all these unregistered manual changes mount up.

Effective system governance is predicated on updates finding assets in a certain state. If the state is different, these updates won’t be correctly applied (if at all). As you can imagine, an environment containing code that doesn’t match what systems expect to find means the deployed state moves further from the predefined configuration with every update.

A simple but effective solution? AI/ML-powered alerting. Engineers can easily find and rectify undocumented changes without issue if your stack includes functionality to bring it to their attention. Best practice and due diligence are key, but they rely on people. For the days when human error rears its head, real-time monitoring and automated alerts stop undocumented manual changes from building up to drift-making levels.

That being said, allow your IaC to become living documentation

While AI/ML-powered alerting should still be part of your stack, a culture shift away from overreliance on documentation also goes a long way toward mitigating IaC drift. With coded infrastructure, you can always ask yourself, “do I need this documented outside the code itself?”

Manually documenting changes was essential in traditional systems. Since IaC cloud infrastructure is codified, you can drive any changes directly through the code. Your IaC assets contain their own history; their code can record every change and alteration made since deployment. What’s more, these records are always accurate and up to date.

Driving changes through the IaC allows you to harness code as a living documentation of your cloud infrastructure, one that’s more accurate and up-to-date than a manual record. Not only does this save time, but it also reduces the drift risk that comes with manual documentation. There’s no chance of human error, meaning a change is documented incorrectly (or not at all). 

Does config drift make IaC cloud environments more hassle than they’re worth?

No, not even remotely. Despite config drift and other IaC concerns (such as secret exposure through code), cloud systems are still vastly superior to the setups they replaced.

IaC is essential to multiple technologies that make digital transformations and cloud adoption possible. Beyond deployment, infrastructures built and managed with code bring increased flexibility, scalability, and lower costs. By 2022 these aren’t just competitive advantages; entire economies are reliant on businesses operating at a scale only possible with them.

Config drift isn’t a reason to turn our back on IaC. It just means that coded infrastructure requires a contextual approach. The vulnerabilities of an IaC environment can’t be fixed with a simple firewall. You need to understand config drift and adapt your cybersecurity and engineering approaches to tackle the problem head-on.

Observability: the essential concept to stopping IaC config drift

What’s the key takeaway? Config drift leads to security vulnerabilities and performance problems because it creates blind spots. If your monitoring stacks can’t keep up with the speed and scale of an IaC cloud environment, they’ll soon be overwhelmed.

IaC environments are guaranteed to become larger and more complex over time. Drift is an inevitable by-product of use. Every action generates new code, and changes code that already exists. Any robust security or monitoring for an IaC setting needs to be able to move and adapt simultaneously. An AI/ML-powered observability and visibility platform, like Coralogix, is a vital component of any meaningful IaC solution, whether for security, performance, or both.

In almost every successful cyberattack, vulnerabilities were exploited outside of engineer visibility. Slowing drift and keeping the gap between your planned config and your deployed systems keep these vulnerabilities to a manageable, mitigated minimum. Prioritizing automated AI-driven observability of your IaC that grows and changes as your systems do is the first step towards keeping them drift-free, secure, and operating smoothly.

4 Different Ways to Ingest Data in AWS OpenSearch

AWS OpenSearch is a project based on Elastic’s Elasticsearch and Kibana projects. Amazon created OpenSearch from the last open-source version of ElasticSearch (7.10) and is part of the AWS monitoring system. The key differences between the two are topics for another discussion, but the most significant point to note before running either distribution is the difference in licenses. ElasticSearch now runs under a dual-license model, and OpenSearch remains open-source. 

Like Elasticsearch, OpenSearch can store and analyze observability data, including logs, metrics, and traces. Elasticsearch primarily uses LogStash to load data, and OpenSearch users can choose from several services to ingest data into indices. Which service is best-suited for OpenSearch ingestion depends on your use case and current setup. 

Ingestion Methods for AWS OpenSearch

Data can be written to OpenSearch using the OpenSearch client and a compute function such as AWS Lambda. To write to your cluster directly, the data must be clean and formatted according to your OpenSearch mapping definition. This requirement may not be ideal for writing observability data with formats other than JSON or CSV. 

Data must also be batched appropriately so as not to overwhelm your defined OpenSearch cluster. The cluster setup significantly impacts the cost of the OpenSearch service and should be configured as efficiently as possible. Each of the methods described below requires the cluster to be running before starting to stream data.

AWS allows users to stream data directly from other AWS services into an OpenSearch index without an intermediate step through a compute function. 

AWS Kinesis Firehose

AWS Kinesis is a streaming service that collects, processes, and analyzes data in real time. It is a scalable service and will scale itself up and down based on current requirements. AWS Kinesis Firehose uses the AWS Kinesis streaming service. That also allows users to extract and transform data within the Kinesis queue itself before outputting the data to another service. 

Firehose can also automatically write data to other AWS services like AWS S3 and AWS OpenSearch before outputting the streamed data. Firehose can also send data directly to third-party vendors working with AWS to provide observability services, like Coralogix. The Kinesis stream output data can be handled separately from the automatic writes to other services.

Firehose uses an AWS Lambda to produce any changes requested to the streamed data. Developers can set up a custom Lambda to process streamed data or use one of the blueprint functions provided by AWS. These changes are not required but are helpful for observability data which is often not formatted. Recording data with a JSON format makes analytics simpler for any third-party tools you may utilize. Some tools like Coralogix’s log analytics platform also have built-in parsers that can be used if changing data at the Kinesis level is not ideal. 

Kinesis Firehose is an automatically scalable service. If your platform requires large volumes of data to flow into OpenSearch, this is a wise choice for infrastructure. It will likely be more cost-effective than other AWS services, assuming none are already used. 

AWS CloudWatch

AWS CloudWatch is a service that collects logs from other AWS services and makes them available for display. Compute functions like AWS Lambda and AWS Fargate can send log data to CloudWatch for troubleshooting by DevOps teams. From CloudWatch, data can be sent to other services within AWS or to observability tools to help with troubleshooting. Logs are essential for observability but are most helpful when used in concert with metrics and traces. 

To send log data to OpenSearch, developers first need to set up a subscription. CloudWatch subscriptions consist of a log stream, the receiving resource, and a subscription filter. Each log stream must have its subscription filter set up. A log stream is a set of logs from a single Lambda or Fargate task. Different invocations of the same function do not need a new setup. The receiving resource is the service to which the logs will be sent; in this case, an OpenSearch cluster. Lastly, the subscription filter is a simple setup inside CloudWatch that determines which logs should be sent to the receiving service. You can add filters, so only some of the logs with particular keywords or data present are recorded in OpenSearch. 

Developers may set up a filter where many logs are written to OpenSearch with this setup. The cost of CloudWatch can be complex to calculate, but the more you write, the more it will cost, and the price can increase very quickly. Streaming data to another service will only increase the costs of running your platform. Before using this solution, determine if the cost is worth it compared to other solutions presented here. 

LogStash

Logstash is a data processing pipeline developed by Elastic. It can ingest, transform, and write data to an Elasticsearch or OpenSearch cluster. When using Elastic, the ELK stack includes Logstash automatically, but in AWS, Logstash is not automatically set up. AWS uses the open-source version of Logstash to feed data into OpenSearch. A plugin needs to be installed and deployed on an EC2 server. Developers then configure Logstash to write directly to OpenSearch. AWS provides details on the configuration that sends data to AWS OpenSearch. 

Since Logstash requires the setup of a new server on AWS, it may not be a good production solution for AWS users. Using one of the other listed options may be less expensive, especially if any other listed services are already in use. It can also reduce the amount of engineering setup required. 

AWS Lambda

AWS Lambda is a serverless compute function that allows developers to quickly build custom functionality on the cloud. Developers can use Lambda functions with the OpenSearch library to write data to an OpenSearch cluster. Writing to OpenSearch from Lambda opens the opportunity to write very customized data to the cluster from many different services.

Many AWS services can trigger Lambdas, including DynamoDB streams, SQS, and Kinesis Firehose. Triggering a Lambda to write data directly also means developers can clean and customize data before it is written to OpenSearch. Having clean data means that observability tools can work more efficiently to detect anomalies in your platform. 

An everyday use case might be the need to update a log in OpenSearch with metadata whenever a DynamoDB entry is written or updated. Developers can configure a stream to trigger a Lambda on changes to DynamoDB, and this stream could send either new data alone or new and old data. A data model with pertinent metadata is formed from this streaming information, and Lambda can write it directly to OpenSearch for future analysis. 

AWS IoT

AWS IoT is a service that allows developers to connect hardware IoT devices to the cloud. IoT core supports different messaging protocols like MQTT and HTTPS to publish data from the device and store it in various AWS cloud services. 

Once data is in AWS IoT, developers can configure rules to send the data to other services for processing and storage. The OpenSearch action will take MQTT messages from IoT and store them in an OpenSearch cluster.  

Machine Learning and Observability

When putting logs into OpenSearch, the goal is to get better insights into how your SaaS platform functions. DevOps teams can catch errors, delays in processing, or unexpected behaviors with the correct setup in observability tools. Teams can also set up alerts to notify one another when it’s time to investigate errors. 

Instantiating log analysis or machine learning in AWS OpenSearch is not a simple task, and there is no switch to turn on and gain insight into your platform. It takes significant engineering resources to use OpenSearch for observability with machine learning, and teams would need to build a custom solution. If this type of processing is critical for your platform, consider using an established system like Coralogix that can provide log analysis and alerting to inform when your system is not performing at its best. 

Summary

AWS OpenSearch is an AWS-supported, open-source alternative to Elasticsearch. Being part of the AWS environment, it can be fed data by multiple different AWS services like Kinesis Firehose and Lambda. Developers can use OpenSearch to store various data types through customized mappings, including observability data. DevOps teams can query logs using associated Kibana dashboards or AWS compute functions to help with troubleshooting and log analysis. For a fast setup of machine learning log analytics without needing specialized engineering resources, consider also utilizing the Coralogix platform to maintain your system. 

Proactive Monitoring vs. Reactive Monitoring

Log monitoring is a fundamental pillar of modern software development. With the advent of modern software architectures like microservices, the demand for high-performance monitoring and alerting shifted from useful to mandatory. Combine this with an average outage cost of $5,600 per minute, and you’ve got a compelling case for investing in your monitoring capability. However, many organizations are still simply reacting to incidents as they see them, and they never achieve the next stage of operational excellence: proactive monitoring. Let’s explore the difference between reactive and proactive monitoring and how you can move to the next level of modern software resilience.

Proactive vs. Reactive: What’s the difference?

Reactive monitoring is the classic model of software troubleshooting. If the system is working, leave it alone and focus on new features. This represents a long history of monitoring that simply focused on responding quickly to outages and is still the default position for most organizations that are maintaining in-house software.

Proactive monitoring builds on top of your reactive monitoring practices. It uses many of the same technical components. Still, it has one key difference: rather than waiting for your system to reach out with an alarm, it allows you to interrogate your observability data to develop new, ad hoc insights about your platform’s operational and commercial success.

Understanding Reactive Monitoring

Reactive monitoring is the easiest to explain. If your database runs out of disk space, a reactive monitoring platform would inform you that your database is no longer working. Suppose a customer calls and explains the website is no longer working. You may use your reactive monitoring toolset to check for apparent errors.

Reactive monitoring looks like alerts that monitor key metrics. If these metrics exceed a threshold, this will trigger an alarm and inform your engineers that something is broken. You may also capture logs, metrics, and traces, but never look at them until one of your alarms has told you you need to. These behaviors are the basic tenets of reactive monitoring. 

So what are the limitations of reactive monitoring?

The limitations of reactive monitoring are clear, but a reactive-only strategy has some more subtle consequences. The obvious implication is that you’re reacting to incidents rather than preventing them. This leads to service disruptions and customer impact. However, it also means more time troubleshooting issues. Interruptions can constitute up to 6 hours of your time on a typical working day. These interruptions, plus potentially expensive outages, can add to a lot of lost revenue and reputational damage that may impact your ability to attract talented employees and new customers.

What is Proactive Application Monitoring?

So what is proactive monitoring? We’ll go further into this, but proactive monitoring looks like this:

  • Multiple alert levels for different severity events. For example, google advises three levels – notify, ticket, and page.
    • Notify is simply a piece of information, like the database uses more CPU than usual. 
    • Ticket is an issue that isn’t pressing but should be dealt with soon. 
    • A page alert is an actual alarm when something is broken. 
  • Predictive alarms tell you when something is going to happen rather than when it has already happened. Prometheus supports this with its predict_linear functionality, for example. This is a straightforward implementation but illustrates the idea perfectly.
  • Interrogating your observability data regularly to understand how your system is behaving. For example, using Lucene to query your elasticsearch cluster or PromQL to generate insights from your Prometheus data. 

Machine learning is a powerful component of proactive monitoring

AIOps has found its place in the observability tech stack, and machine learning-driven alerts are a cornerstone of proactive monitoring. You’re setting alarms around your “known knowns” and your “known unknowns” with a traditional alert. These are the issues that you’re either aware of, for example, a spike in database CPU usage following increased user traffic, or an issue you know about but haven’t diagnosed yet, like a sudden slow down in HTTP latency. 

You’re more broadly looking for anomalous behavior with a machine-learning alert. You can raise an alert if your system begins to exhibit behavior that it hasn’t shown before, especially around a specific metric or type of log. This is incredibly powerful because it adds a safety net to your existing traditional alerts and can detect new issues that fall into the “unknown unknown” category. An example might be an error that only manifests when a series of other criteria are true, like time of day, number of on-site users, and load on the system. 

These issues are challenging to detect and drive reactive monitoring behaviors – “we’ll just wait until it happens next time.” With a machine learning alert, you can catch these incidents in their early stages and analyze the anomalous behavior to gain new and powerful insights into the behavior of your system.

Summary

Proactive monitoring is the difference between fire fighting and forward-fixing. Proactive monitoring approaches will give your system a voice and not rely on incidents or outages before your system is surfacing data. When this approach is coupled with a machine learning strategy, you’ve got a system that informs you of undesirable (but not critical) behavior, plus alarms that will tell you about new, potentially unwanted behavior that you never considered in the first place. This allows you to leverage your observability data to help you achieve your operational and commercial goals. 

Intro to AIOps: Leveraging AI and Machine Learning in DevOps

AIOps is a DevOps strategy that brings the power of machine learning to bear on observability and system management. It’s not surprising that an increasing number of companies are now adopting this approach.  

AIOps first came onto the scene in 2015 (coincidentally the same year as Coralogix) and has been gaining momentum for the past half-decade. In this post, we’ll talk about what AIOps is, and why a business might want to use it for their log analytics.

AIOps Explained

AIOps reaps the benefits of fantastic advances in AI and machine learning in recent decades.  Because enterprise applications are complex, yet predictable systems, AI and machine learning can be used with great effect to analyze their data and extract patterns. The AIOps Manifesto spells out five dimensions of AIOps

  1. Data set selection – machine learning algorithms can parse vast quantities of noisy data and provide Ops teams with a curated sample of clean data.  It’s then much easier to extract trustworthy insights and make effective business decisions.
  2. Pattern discovery – this generally occurs after a data set has been appropriately curated. It involves using a variety of ML techniques to extract patterns. This can be rule-based or neural networks that involve supervised and unsupervised learning.
  3. Inference – AIOps uses a range of inference algorithms to draw conclusions from patterns found in the data. These algorithms can make causal inferences about system processes ‘behind the data.’  Combining expert systems with pattern-matching neural networks creates highly effective inference engines.
  4. Communication – For AIOps to be of value it’s not enough for the AI to have the knowledge, it needs to be able to explain findings to a human engineer! AIOps has a variety of strategies for doing this including visualization and natural language summaries.
  5. Automation – AIOps achieves its power by automating problem-solving and operational decisions. Because modern IT systems are so complex and fast-changing, automated systems need to be intelligent. They need machine learning to respond to quickly changing conditions in an adaptive fashion.

Why IT needs AIOps

As IT has advanced, it has shouldered more and more of the essential processes of business organizations.  Not only has technology become more sophisticated, it has also woven itself into business practice in increasingly intricate ways.

The ‘IT department’ of the ‘90s, responsible for a few niche business applications, has virtually gone. 21st century IT lives in the cloud. Enterprise applications are virtual, consisting of thousands of ephemeral components.  Businesses are so dependent on them that many business processes are IT processes.

This means that DevOps has had to upgrade. Automation is essential to managing the fast-changing complexity of modern IT. AIOps is an idea whose time has come. 

How companies are using AIOps

Over the past decade, AIOps has been adopted by many organizations. In a recent survey, OpsRamps found that 68% of surveyed businesses were experimenting with AIOps due to its potential to eliminate manual labor and extract data insights.

William Hill, COTY, and KPN are three companies that have chosen the way of AIOps and their experience makes fascinating reading:

AIOps Case Study: William Hill

William Hill started using AIOps to combat game and bonus abuse. As a betting and gaming company, their revenues depended on people playing by the rules and with so many customers, a human couldn’t keep track of the data.

William Hill’s head of Capacity and Monitoring Engineering, Andrew Longmuir explains the benefits of adopting AIOps.  First, it helped with automation, and in particular what Andrew calls “silo-busting”. AI and machine learning allowed William Hill to integrate nonstandard data sources into their toolchain.

Andrew uses the analogy of a jigsaw. Unintegrated data sources are like missing pieces of a puzzle. Using machine learning allows William Hill to bring them back into the fold and create a complete picture of the system.

Second, AIOps enables William Hill’s team to solve problems faster.  Machine learning can be used to window data streams, reducing alert volumes, and eliminating operational noise.  It can also detect correlations between alerts, helping the team prevent problems before they arise.

Finally, incorporating AI and Machine Learning into William Hill’s IT strategy has even improved their customer experience. This results from them leveraging insights extracted from their analytics data to improve the design of their website.

Andrew has some words of wisdom for other organizations considering AIOps. He recommends focusing on a use case that is central to your company.  Teams need to be willing to trial multiple different solutions to find the optimum setup.

AIOps Case Study: COTY

COTY adopted AIOps to take the agility and scalability of their IT strategy to the next level. COTY is a major player in the cosmetics space, with clients that include Max Factor and Calvin Klein.  As a dynamic business, they relied on flawless and versatile performance from their IT infrastructure to manage everything from payrolls to wireless networks.

With over 4,000 servers and a cloud-based infrastructure, COTY’s IT system is far too complex for traditional DevOps strategies to handle. To deal with it they’ve chosen AIOps.

AIOps has improved the way COTY handles and analyzes data. Data sources are integrated into a ‘data lake’, and machine learning algorithms can crunch its contents to extract patterns.

This has allowed them to minimize noise, so their operations department isn’t bombarded with irrelevant and untrustworthy information. 

AIOps has transformed the way COTY’s DevOps team thinks about visibility. Instead of a traditional events-based model, they now use a global, service-orientated model.  This allows the team to analyze their business and IT holistically.

COTY’s Enterprise Management Architect, Dan Ellsweig, wants to take things further. Dan is using his AIOps toolchain to create a dashboard for executives to view. For example, the dashboard might show the CTO what issues are being dealt with at a particular point in time.

AIOps Case Study: KPN

KPN is a Dutch telecoms business with operating experience in many European countries.  They adopted AIOps because the amount of data they were required to process was more than a human could handle.

KPN’s Chief Product Owner Software Tooling, Arnold Hoogerwerf, explains the benefits of using AIOps. First, leveraging AI and machine learning can increase automation and reduce operational complexity. This means that KPN’s DevOps team can do more with the same number of people.

Secondly, AI and machine learning can speed up the process of investigating problems. With traditional strategies, it may take weeks or months to investigate a problem and find the root cause. The capacity of AI tools to correlate multiple data sources allows the team to make crucial links in days that otherwise would have taken weeks.

Finally, Hoogerwerf has a philosophical reason for using AIOps.  He believes that while data is important, it’s even more important to keep sight of what’s going on behind the data.

Data on its own is meaningless if you don’t have the knowledge and wisdom with which to interpret it.

Implementing AIOps with Coralogix

Although the three companies we’ve looked at are much larger than the average business, AIOps is not just for big companies. The increasing number of platforms and vendors supporting AIOps tooling means that any business can take advantage of what AIOps has to offer.

The Coralogix platform launched two years after the birth of AIOps and our philosophy has always paralleled the principles of AIOps.  As Coralogix’s CEO Ariel Assaraf explains, organizations are burdened with the need to analyze increasing quantities of data. They often can’t do this with existing infrastructure, resulting in more than 99% of data remaining completely untapped.

In this context, the Coralogix platform is a game-changer. It allows organizations to analyze data without relying on storage or indexing. This enables significant cost savings and greater data coverage. Adding machine learning capabilities on top of that makes Coralogix much more powerful than any alternative in the market. Instead of cherry-picking data to analyze, stateful stream analysis occurs in real-time.  

How Coralogix can help with pattern discovery

One of the five dimensions of AIOps is pattern discovery. Due to the ability of machine learning to analyze large quantities of data, the Coralogix platform is tailor-made for discovering patterns in logs. As a case in point, gaming company AGS uses Coralogix to analyze 100 million logs a day.

The patterns extracted have allowed their DevOps team to reduce MTTR by 70% and their development team to create enhanced user experiences that have tripled their user base.

Another case is the neural science and ML company Biocatch. With exponentially increasing log volumes, their plight was a vivid illustration of the complexity that 21st century DevOps teams increasingly face.

Coralogix could handle these logs by clustering entries into patterns and finding connections between them. This allowed Biocatch to handle bugs and solve problems much faster than before.

How Coralogix can communicate insights

Once patterns have been extracted, DevOps engineers receive automated insights and alerts about anomalies in the system behavior.  Coralogix achieves this by integrating with a variety of dashboards and visualization solutions such as Prometheus and CloudWatch.

Coralogix also implements a smarter alerting system that flags anomalies to DevOps engineers in real time.  Conventional alerting systems require DevOps engineers to set alerting thresholds manually. However, as we saw at the start of this article, modern IT is too complex and fast-changing for this approach to work.

Coralogix solves this with dynamic alerts. These use machine learning to adjust thresholds in response to data.  This enables a much more effective approach to anomaly detection, one that is tailored to the DevOps landscape of the 21st century.

Wrapping Up

The increasing complexity and volumes of data faced by modern DevOps teams mean that humans can no longer handle IT operations without help.  AIOps aims to leverage AI and machine learning with a view to converting high-volume data streams into insights that human engineers can act on.

AIOps fits with Coralogix’s own approach to DevOps, which is to use machine learning to help organizations effectively use the increasing volumes of data they generate.  Observability should be for the many, not just a few.