Fintech Industry: Are Your IT, DevOps, and Engineering Teams Siloed?

What is a silo, and why is it bad?

The Cambridge English Dictionary defines a silo as “a part of a company, organization, or system that does not communicate with, understand, or work well with other parts.” Siloing can exist at various organizational levels: siloed departments, siloed teams within a department, and even siloed engineers within a team.

In any industry, siloing can cause issues with alignment, communications, and overall delivery, but in fintech, there are additional risks.

The financial sector, by its nature, is heavily regulated and fault-intolerant. This means that developing anything in the field naturally costs more and comes with higher risks than building in other industries. The more costly mistakes and misalignments are to fix, the more critical it is to address siloing quickly and decisively.

Fintech is also fast-moving & highly competitive. An organization where everyone is working towards a common mission has a competitive advantage over one which is siloed, with each department working towards its own isolated goals.

Siloing in the age of remote work

Remote working has become more prevalent in the last couple of years, posing its own siloing and communication risks.

When workers are colocated, especially in an open-plan and cross-functional office setting, you get communication ‘by osmosis:’ people overhear and interject into conversations organically. In a remote setting, relying exclusively on formal meetings to discuss upcoming work often leads people to be left out of discussions that aren’t related to their team.

Compounding this issue, in a remote setting, it’s far easier to tune out of a meeting because something more interesting is happening or because of video call fatigue. This can lead to people missing out on vital information.

These issues may sound insurmountable, but with a few simple alterations to your workflow, they’re quite simple to address (more on that later).

How to identify siloing in Engineering & DevOps

The most crucial step towards addressing siloing is to identify it. There are some tell-tale signs that your fintech organization is siloed, but for this article, we’re going to focus on the red flags around the tech department.

Let’s start at the individual level. How do we know if engineers & DevOps staff are siloed from each other?

“Talk to Mike”

The first warning sign is “talk to Mike.” “Mike” might have a different name in your team, but if every query about a product or system in an organization is redirected towards one or two long-term employees, this is a warning sign that knowledge sharing isn’t happening effectively.

Silence on Slack

The next warning sign is that the Slack (or Discord or IRC) channels are quiet. This generally means the entire discussion happens in meetings where not everyone is involved or in private messages where not everyone is included.

Rubber-Stamp Pull Requests

Finally, ‘rubber stamp’ pull requests. If the expectation is that pull requests are reviewed or approved in ten minutes, then it’s doubtful that engineers or DevOps are properly reading them and understanding them. Rushed peer reviewers will quickly check for the most critical issues, rather than understand and question architectural and functional decisions.

What about at a higher level? How do we know if teams are siloed from each other?

This is quite simple, ask the engineers on one team what another team’s current goal is. If they don’t know, it’s probably because there isn’t enough communication between the teams.

Understanding whether the entire tech department is siloed is slightly more nuanced. There are three major symptoms that you can look out for, though.

“The Business”

The first, very common, red flag is that engineers refer to other departments as ‘the business.’ This phrase may sound harmless, but it suggests that your tech function sees itself as outside of ‘the business,’ just delivering functions it has no input into.

Surface-Level Business Knowledge

The second is that teams know ‘what’ they’re building but don’t know ‘why.’ Not just ‘why is this useful?’ but ‘why is this the highest priority for the company right now?’

Lack of User-Centricity

The third is that teams know ‘why’ they’re building it, from the company’s perspective, but don’t know ‘why,’ from the user’s perspective – how the feature adds value for the people using it.

How to break silos in a fintech organization

Having identified the fact that there are silos, the most crucial tool to breaking down walls – whether between individuals, teams, functions, or departments – is better communication. This doesn’t mean doubling up on video calls, which will cause more irritation for the employees we depend on to support the changes.

We’ll list some changes you can make from easiest to hardest. This isn’t an exhaustive list but a solid foundation for better communication.

Include Everyone in the Discussion

First, where possible, everyone should be encouraged to discuss things in the most public relevant forum. Instead of an engineer asking Mike how the futures trading product works in a private message, discussing in a public channel allows the entire team to share and learn simultaneously.

Introduce Living Documentation

Following up on this, introducing and encouraging the use of assets like a company wiki can be extremely valuable. The easiest way to communicate something comprehensively is to write it down once instead of writing or saying it over and over again.

Documentation takes time to create and must be actively championed, but repeating the same thing in tens or hundreds of conversations takes even more time and gets tiresome.

Regular & Useful Updates

Next, enforce regular company-wide updates. These can be a meeting or an e-mail with a short update from each team and department on what’s been accomplished recently and what it’s working on now. This is also an excellent opportunity for leadership to explain what’s coming up and why it creates value for the customer.

Knowledge Sharing Sessions

Knowledge sharing sessions can be another valuable option, where one employee (or contractor) delivers a presentation about something they’ve learned recently to the entire tech department. This offers a forum to talk about new technologies, new applications of technology, or even things like environmental impact or accessibility.

Collaborative Development

Then there’s mob programming. Have a 90-minute session once a week where an entire team works together to build a feature or fix a bug. This helps create more team communication and allows everyone to discuss things beyond the feature at hand.

Ubiquitous Language

There’s a lot of value in establishing a ubiquitous language. This is a concept taken from Domain-Driven Design, which simply means that everyone in the business speaks the same language. If the business defines a ‘quotation,’ an ‘option,’ or an ‘FX trade,’ that should mean the same in engineering.

This isn’t just about naming variables and functions. It’s about everyone growing their understanding of the business and its customers.

Get-Togethers

Several fully remote companies have succeeded in having a ‘team week,’ where the team gets together in one location for a few days to work together during the day and bond as a team during the evening. This can help break down barriers between teams and departments as engineers & DevOps staff socialize with people outside of tech.

Cross-Functional (or Multi-Discipline) Teams

Finally, building true cross-functional teams is a great way to increase knowledge sharing across disciplines. Essentially, they’re teams built from multiple disciplines – a product owner who understands the product from a business perspective, a quality assurance expert or developer-in-test, a platform specialist or DevOps, a UX/UI specialist, etc.

If you want knowledge to move further, swap one person out of each team regularly. This will help bring knowledge gained from one team to another team. It’s important not to do this too regularly, though, or there will be a loss in productivity from people having to keep onboarding repeatedly.

In summary

The key to breaking down silos is building an organization where people are excited to talk to each other. If your organization is heavily siloed, that will not be an overnight fix, and you will need to attach value to, and prioritize, communication and knowledge sharing.

Understanding that tech is not an independent function from sales, treasury, legal, or any other department is essential. Engineering and DevOps staff should always consider the user experience. Automattic, the creator of WordPress, famously requires all of its staff, from the CEO down, to spend a week of each year working on front-line customer support!

The world of fintech is fast-moving, with plenty of pitfalls and traps to fall into. A team working together towards a common goal, where everyone feels heard, is the most equipped to thrive in the industry.

What is Infrastructure as Code?

Cloud services were born at the beginning of 2000 with companies such as Salesforce and Amazon paving the way. Simple Queuing Service (SQS) was the first service to be launched by Amazon Web Services in November 2004. It was offered as a distributed queuing service and it is still one of the most popular services in AWS. By 2006 more and more services were added to the offering list. 

Almost 20 years later, Infrastructure as Code (IaC) emerged as a process to manage and provision cloud services, virtual machines, bare metal servers, and other configuration resources. Coding, as opposed to the usual clicking on the dashboard of the selected cloud provider, is now the preferred way of provisioning these resources.  Treating infrastructure as code and using the same tools as in software application development appealed to the developers due to the efficiency of the building, deploying and monitoring applications and their infrastructure.

IaC has improved editing and distributing the infrastructure code files, as well as ensuring that the same configuration is shared across different environments without discrepancies between them. IaC files are also used for official infrastructure documentation. Some tools translate these files and produce architectural diagrams. Cloudformation Designer is a great example that comes as part of AWS Cloudformation service. Draw.io (with its Visual Studio Code integration) and Brainboard can also support the design of your IaC documentation.

Another important advantage is version control. The infrastructure files can now be committed and pushed to your favourite Control Version System (CVS), such as Git. As a result of IaC, development teams can integrate the infrastructure files in CI/CD pipelines and automate deploying the required servers, security groups, databases, and other cloud resources required for the goal architecture! This article provides an IaC introduction, explains the difference between Declarative versus Imperative approaches in IaC, discusses the benefits of IaC, where relevant tools are also listed, and finally speaks about IaC, DevOps and CI/CD.

Declarative vs. imperative approaches to IaC

Declarative or imperative are the two approaches for executing your Infrastructure as Code files. Both terms result in providing a set of instructions to the existing platform. Imperative systems are closer to the way our brain thinks and therefore it’s a set of commands that needs to be executed to achieve the final stage/goal. Programming languages, such as Java, Python, and .NET follow this approach too. Building tools, such as Chef and partially Ansible follow the imperative approach.

On the other hand, the declarative approach defines the desired state of the system e.g. resources you need, and specific configuration for these resources. This approach is also idempotent, which means that no matter how many times you will make an identical request, your system will consist of the exact same resources. AWS Cloudformation and Hashicorp Terraform are the leading IaC technologies using the declarative approach. However, it’s important to understand what is the original need of using Infrastructure as Code in the first place.

Benefits of IaC

After many years of managing and provisioning infrastructure manually, IaC arrived to change this non-efficient process, although it can be for any type of resources, such as cloud computing, data centres, virtual machines, and software components. Therefore, while the required software is released continuously to production daily, the infrastructure can also be adapted to the same frequency of deployment. IaC also helps with the increasing demand for scalability of your organization’s infrastructure. To add to the benefits, IaC helps with the infrastructure documentation and reduces the errors due to the embedded linting commands e.g., terraform plan.

Benefits:

  • Documentation
  • Automation
  • Cost reduction
  • More, even daily, deployments 
  • Error elimination
  • Infrastructure consistency across different environments i.e. UAT, Stage, Prod.

Several tools have been developed to help teams with the efficient adoption of IaC. 

IaC tools

Server automation and configuration management tools can often be used to achieve IaC. There are also solutions specifically for IaC.

Some of the most popular tools are listed below:

In the most recent years, all these tools have incorporated the notion of automation. Avoiding manual work is what DevOps is aiming for. But where does IaC fit in with the CI/CD pipeline and DevOps? 

IaC, CI/CD, and DevOps

As previously mentioned, IaC is recommended to be part of a CI/CD pipeline to enable automated deployments and make the development teams more efficient. CI/CD falls under DevOps and, more specifically under the pillar of automation. 

CI/CD relies on ongoing automation and continuous monitoring throughout the application lifecycle, from integration and testing to delivery and deployment. Once a development team opts in for automating the infrastructure deployment, any manual intervention is cut off, usually by setting up the correct privileges for all team members. The manual configuration might be possible, but any manual change will not persist until the next automated IaC deployment. 

The requirement for DevOps alignment between teams has to be met to achieve the benefits of IaC and hence avoid inconsistencies at different levels i.e., code, team communication, and deployments.

IaC helps with this alignment between various teams as they speak the same “language” of infrastructure as code. 

IaC also removes the need to maintain individual deployment environments with unique configurations that can’t be reproduced automatically and ensure that the production environment will be consistent.

Conclusion

Infrastructure as Code is a concept that will remain due to the major benefits we have described above. There is currently a lot of ongoing work on the topic focusing on the expansion of IaC technologies. The most recent advancement is writing IaC in the language of preference i.e. Java, NodeJs, and more, as it is mainly written in YAML, JSON, or technology-specific (Hashicorp Configuration Language) files. So, the current focus is on keeping the benefits and making it simpler for more people to adopt.

We’re Thrilled To Share – Coralogix has Received AWS DevOps Competency

At Coralogix, we believe in giving companies the best of the best – that’s what we strive for with everything we do. With that, we are happy to share that Coralogix has received AWS DevOps Competency! 

Coralogix started working with AWS observability in 2017, and our partnership has grown immensely in the years since. So, what is our new AWS DevOps Competency status, and what does it mean for you?

As stated by Ariel Assaraf, CEO of Coralogix, “The dramatic surge in data volume is forcing companies to choose between cost and coverage. Achieving the AWS DevOps Competency designation validates our stateful streaming analytics technology, which combines real-time data analysis with stateful analytics—and decouples that from storage—to reduce costs and improve performance for customers.”

This designation recognizes that Coralogix’s stateful Streama© technology helps modern engineering teams gain real-time insights and trend analysis for logs, metrics, and security data with no reliance on storage or indexing.

What does AWS Competency Mean For You

What does DevOps competency mean? AWS is enabling scalable, flexible, and cost-effective solutions from startups to global enterprises. To support the seamless integration and deployment of these solutions, AWS established the AWS Competency Program to help customers identify Consulting and Technology APN Partners with deep industry experience and expertise.

What’s the benefit for you, as an engineer? Building a sustainable, durable, and secured cloud system isn’t easy and requires extensive skills and a sustained effort. Knowing that a vendor was audited by AWS helps to know they are investing in the right skills and effort to provide the best and most secure possible service for you. So you can work with a peace-of-mind knowing that you’re in good hands. 

Why should engineers care? This designation recognizes that our technology helps modern engineering teams gain real-time insights and trend analysis for logs, metrics, and security data with no reliance on storage or indexing.

About AWS Competency Program

AWS is enabling scalable, flexible, and cost-effective solutions from startups to global enterprises. To support the seamless integration and deployment of these solutions, AWS established the AWS Competency Program to help customers identify Consulting and Technology APN Partners with deep industry experience and expertise.

For more information, you can read the full press release here. You can also learn more about AWS solutions within Coralogix here

The latest Github outage and how it impacts observability

Every now and then, issues occur that disrupt the very fabric of global software engineering. Chief amongst them is the recent mass outage of Github. Github is a fundamental building block in software productivity, hosting over 190 million code repositories. Github hosts our code and libraries, runs build pipelines, and much more. It is a central hub of activity and it is consumed by tens of thousands of organizations.

But is Github a part of your system?

It’s tempting to consider Github as being outside of your software system. It isn’t a service or library developed by your team, it’s just one of the many services that you depend on, right? Well, not exactly.

Github holds a key function in the productivity of your engineering teams. If your team uses Github to host code, they literally can not make code changes when Github experiences an outage.

If your entire development team went on strike, it would be considered an existential threat to your organizational objectives. A Github outage of this magnitude has a very similar impact in terms of developer output.

It gets worse.

github outage details

You’ll notice that GitHub Pages was also part of the outage. GitHub Pages literally hosts your code for you. There are a not insignificant number of websites that have DNS records pointed directly at a Github Pages site. This means that a GitHub outage is also tantamount to an Availability Zone (AZ) outage in AWS. The infrastructure on which you depend has fallen away beneath you.

Github is a fundamental aspect of your system

Github is a bedrock of both your software engineering lifecycle and your production system’s ability to function. If your teams are unable to commit and push code changes, they’re unable to respond to outages. If they’re unable to respond to outages, those outages will only get worse. 

The challenge for organizations now is simple: how do you monitor Github and other third-party tools that have become first-class citizens in your observability mission? Some of these tools reveal APIs that allow you to programmatically discover their operational status. Alas, others are somewhat more mercurial.

The Github status page belongs to a class of low volume, high value data that is too often overlooked. We analyze terabytes of operating system logs to better understand our system, but we skip over the data that provides us with context. The status of Github is as fundamental to observability as the status of your AWS availability zone or the network connection for your data center. It is essential.

The goal is to create a general solution, that can consume data from disparate sources and bring them into one place so that you can correlate many different conspiring events into a coherent timeline that describes the what and the why of your system. 

So how can you be ready for the next outage?

Contextual data is the hidden goldmine within your organization, costing very little to store and analyze but providing a great deal of value. Coralogix provides a comprehensive suite of features to tackle this challenge. 

CPU utilization is important to understand what your system is doing, but contextual data can provide you with the why and allow you to craft alerts that deal with the complex realities of your system. 

Contextual data is more than just Github. It’s Slack messages, CI/CD logs and events, third-party status pages, and much more. While they are siloed, they are hidden. If they are hidden, they aren’t valuable. Exposing this data is the next step in complete observability.

How to Mitigate DevOps Tool Sprawl in Enterprise Organizations

There’s an insidious disease increasingly afflicting DevOps teams. It begins innocuously. A team member suggests adding a new logging tool. The senior dev decides to upgrade the tooling. Then it bites. 

You’re spending more time navigating between windows than writing code. You’re scared to make an upgrade because it might break the toolchain.

The disease is tool sprawl.  It happens when DevOps teams use so many tools that the time and effort spent navigating the toolchain is greater than the savings made by new tools.  

Tool Sprawl: What’s the big problem?

Tool sprawl is not something to be taken lightly.  A 2016 DevOps survey found that 53% of large organizations use more than 20 tools.  In addition, 53% of teams surveyed don’t standardize their tooling.

It creates what Joep Piscaer calls a “tool tax”, increased technical debt, and reduced efficiency which can bog down your business and demoralize your team.

Reduced speed of innovation

With tool sprawl, a DevOps team is more likely to have impaired observability as data between different tools won’t necessarily be correlated.  This ultimately reduces their ability to detect anomalous system activity and locate the source of a fault and increases both Mean Time To Detection (MTTD) and Mean Time To Repair (MTTR).

Also, an overabundance of tools can result in increased toil for your DevOps team. Google’s SRE org defines toil as “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”

Tool sprawl creates toil by forcing DevOps engineers to continually switch between lots of different tools which may or may not be properly integrated.  This cuts into the time spent doing useful and productive work such as coding during the day.

Finally, tool sprawl reduces your system’s scalability. This is a real blocker to businesses that want to go to the next level. They can’t scale their application and may have trouble expanding their user base and developing innovative features.

Lack of integration and data silos

A good DevOps pipeline is dependent on a well-integrated toolchain.  When tool sprawl is unchecked, it can result in a poorly integrated set of tools.  DevOps teams are forced to get round this by implementing ad-hoc solutions which decrease the resilience and reliability of the toolchain.

This reduces the rate of innovation and modernization in your DevOps architecture. Engineers are too scared to make potentially beneficial upgrades because they don’t want to risk breaking the existing infrastructure.

Another problem created by tool sprawl is that of data silos. If different DevOps engineers use their own dashboards and monitoring tools, it can be difficult (if not impossible) to pool data. This reduces the overall visibility of the system and consequently reduces the level of insights available to the team.

Data silos also cause a lack of collaboration.  If every ops team is looking at a different data set and using their own monitoring tool, they can’t meaningfully communicate.

Reduced team productivity

Engineers add tools to increase productivity, not to reduce it. Yet having too many actually has the opposite effect. 

Tool sprawl can seriously disrupt the creative processes of engineers. Being forced to pick their way through a thicket of unstandardized and badly integrated tooling breaks their flow, reducing their ability to problem solve. This makes them less effective as engineers and reduces the team’s operational excellence.

Another impairment to productivity is the toxic culture created by a lack of collaboration and communication between different parts of the team. In the previous section, we saw how data silos resulted in a lack of team collaboration.

The worst case of this is that it can lead to a culture of blame. Each part of the team, cognizant only of the information on its part of the system, tries to rationalize that information and treat its view as correct.

This leads to them neglecting other parts of the picture and blaming non-aligned team members for mistakes.

The “Dark Side” of the toolchain

In Star Wars, all living things depended on the Force. Yet the Force was double-edged; it had a light side and a dark side. Similarly, a DevOps pipeline depends on an up-to-date toolchain that can keep pace with the demands of the business.

Yet in trying to keep their toolchain beefed-up, DevOps teams constantly run the risk of tool sprawl. Tooling is often upgraded organically in response to the immediate needs of the team. As Joep warns though, poorly upgrading tooling can create more problems than it solves. It adds complexity and operational burdens.

Solving the problem of tool sprawl

Consider your options carefully

One way that teams can prevent tool sprawl is by thinking much more carefully about the pros and cons of adding a new tool.  As Joep explains, tools have functional and non-functional aspects. Many teams become sold on a new tool based on the functional benefits it brings. These could include allowing the team to visualize data or increasing some aspect of observability.

What they often don’t really think about are the tool’s non-functional aspects.  These can include performance, ease of upgrading, and security features.

 If a tool was a journey the function would be its destination and its non-functional aspects would be the route it takes. Many teams are like complacent passengers, saying “wake me when we get there” while taking no heed of potential hazards along the way. 

Instead, they need to be like ship captains, navigating the complexities of their new tool with foresight and avoiding potential problems before they sink the ship.

Before incorporating a tool into their toolchain, teams need to think about operational issues. These can be anything from the number of people needed to maintain the tool to the repo new versions are available in.

Teams also need to consider agility. Is the tool modular and extensible? If so, it will be relatively easy to enhance functionality downstream. If not, the team may be stuck with obsolescent tooling that they can’t get rid of.

Toolchain detox

Another tool sprawl mitigation strategy is to opt for “all-in-one” tools that let teams achieve more outcomes with less tooling. A recent study advocates for using a platform vendor that possesses multiple monitoring, analytics and troubleshooting capabilities.

Coralogix is a good example of this kind of platform.  It’s an observability and monitoring solution that uses a stateful streaming pipeline and machine learning to analyze and extract insights from multiple data sources.  Because the platform leverages artificial intelligence to extract patterns from data, it has the ability to combat data silos and the dangers they bring.

Trusting log analytics to machine learning makes it possible to avoid human limitations and ingest data from all over the system.  This data can be pooled and algorithmically analysed to extract insights that human engineers might not have reached.

In addition, Coralogix can be integrated with a range of external platforms and solutions.  These range from cloud providers like AWS and GCP to CI/CD solutions such as Jenkins and CircleCI.

While we don’t advise pairing down your toolchain to just one tool, a platform like Coralogix goes a long way toward optimizing IT costs and mitigating tool sprawl before it becomes a problem.

The tool consolidation roadmap

For those who are currently wrestling with out-of-control tool sprawl, there is a way out! The tool consolidation roadmap shows teams how to go from a fragmented or ad hoc toolchain to one that is modern and uses few unnecessary tools. The roadmap consists of three phases.

Phase 1 – Plan

Before a team starts the work of tool consolidation, they need to plan what they’re going to do. The team needs first to ascertain the architecture of the current toolchain as well as the costs and benefits to tool users.

Then they must collectively decide what they want to achieve from the roadmap. Each component of the team will have its own desirable outcome and the resulting toolchain needs to cater to everybody’s interests.

Finally, the team should draw up a timeframe outlining the tool consolidation steps and how long they will take to implement.

Phase 2 – Prepare

The second phase is preparation. This requires the team to draw up a comprehensive list of use cases and map them onto a list of potential solutions. The aim of this phase is to really hash out what high-level requirements the final solution needs to satisfy and flesh these requirements out with lots of use cases.

For example, the DevOps team might want higher visibility into database instance performance.  They may then construct use cases around this: “as an engineer, I want to see the CPU utilization of an instance”.

The team can then research and inventory possible solutions that can enable those use cases.

Phase 3 – Execute

Finally, the team can put its plan into action. This step involves several different components. Having satisfied themselves that the chosen solution best enables their objectives, the team needs to deploy the chosen solution.

This requires testing to make sure it works as intended and deploying to production.  The team needs to use the solution to implement any alerting and event management strategies they outlined in the plan.

As an example, Coralogix has dynamic alerting. This enables teams by alerting them to anomalies without requiring them to set a threshold explicitly.

Last but not least, the team needs to document its experience to inform future upgrades, as well as training all team members on how to get the best out of the new solution. (Coralogix has a tutorials page to help with this.)

Wrapping Up

A DevOps toolchain is a double-edged sword. Used well, upgraded tooling can reduce toil and enhance the capacity of DevOps engineers to solve problems. However, ad hoc upgrades that don’t take the non-functional aspects of new tools into account lead to tool sprawl.

Tool sprawl reverses all the benefits of a good toolchain. Toil is increased and DevOps teams spend so much time navigating the intricacies of their toolchain that they literally cannot do their job properly.

Luckily, tool sprawl is solvable. Systems like Coralogix go a long way towards fixing a fragmented toolchain, by consolidating observability and monitoring into one platform.  We’ve seen how teams in the thick of tool sprawl can extricate themselves through the tool consolidation roadmap.

Tooling, like candy, can be good in moderation but bad in excess. 

Intro to AIOps: Leveraging AI and Machine Learning in DevOps

AIOps is a DevOps strategy that brings the power of machine learning to bear on observability and system management. It’s not surprising that an increasing number of companies are now adopting this approach.  

AIOps first came onto the scene in 2015 (coincidentally the same year as Coralogix) and has been gaining momentum for the past half-decade. In this post, we’ll talk about what AIOps is, and why a business might want to use it for their log analytics.

AIOps Explained

AIOps reaps the benefits of fantastic advances in AI and machine learning in recent decades.  Because enterprise applications are complex, yet predictable systems, AI and machine learning can be used with great effect to analyze their data and extract patterns. The AIOps Manifesto spells out five dimensions of AIOps

  1. Data set selection – machine learning algorithms can parse vast quantities of noisy data and provide Ops teams with a curated sample of clean data.  It’s then much easier to extract trustworthy insights and make effective business decisions.
  2. Pattern discovery – this generally occurs after a data set has been appropriately curated. It involves using a variety of ML techniques to extract patterns. This can be rule-based or neural networks that involve supervised and unsupervised learning.
  3. Inference – AIOps uses a range of inference algorithms to draw conclusions from patterns found in the data. These algorithms can make causal inferences about system processes ‘behind the data.’  Combining expert systems with pattern-matching neural networks creates highly effective inference engines.
  4. Communication – For AIOps to be of value it’s not enough for the AI to have the knowledge, it needs to be able to explain findings to a human engineer! AIOps has a variety of strategies for doing this including visualization and natural language summaries.
  5. Automation – AIOps achieves its power by automating problem-solving and operational decisions. Because modern IT systems are so complex and fast-changing, automated systems need to be intelligent. They need machine learning to respond to quickly changing conditions in an adaptive fashion.

Why IT needs AIOps

As IT has advanced, it has shouldered more and more of the essential processes of business organizations.  Not only has technology become more sophisticated, it has also woven itself into business practice in increasingly intricate ways.

The ‘IT department’ of the ‘90s, responsible for a few niche business applications, has virtually gone. 21st century IT lives in the cloud. Enterprise applications are virtual, consisting of thousands of ephemeral components.  Businesses are so dependent on them that many business processes are IT processes.

This means that DevOps has had to upgrade. Automation is essential to managing the fast-changing complexity of modern IT. AIOps is an idea whose time has come. 

How companies are using AIOps

Over the past decade, AIOps has been adopted by many organizations. In a recent survey, OpsRamps found that 68% of surveyed businesses were experimenting with AIOps due to its potential to eliminate manual labor and extract data insights.

William Hill, COTY, and KPN are three companies that have chosen the way of AIOps and their experience makes fascinating reading:

AIOps Case Study: William Hill

William Hill started using AIOps to combat game and bonus abuse. As a betting and gaming company, their revenues depended on people playing by the rules and with so many customers, a human couldn’t keep track of the data.

William Hill’s head of Capacity and Monitoring Engineering, Andrew Longmuir explains the benefits of adopting AIOps.  First, it helped with automation, and in particular what Andrew calls “silo-busting”. AI and machine learning allowed William Hill to integrate nonstandard data sources into their toolchain.

Andrew uses the analogy of a jigsaw. Unintegrated data sources are like missing pieces of a puzzle. Using machine learning allows William Hill to bring them back into the fold and create a complete picture of the system.

Second, AIOps enables William Hill’s team to solve problems faster.  Machine learning can be used to window data streams, reducing alert volumes, and eliminating operational noise.  It can also detect correlations between alerts, helping the team prevent problems before they arise.

Finally, incorporating AI and Machine Learning into William Hill’s IT strategy has even improved their customer experience. This results from them leveraging insights extracted from their analytics data to improve the design of their website.

Andrew has some words of wisdom for other organizations considering AIOps. He recommends focusing on a use case that is central to your company.  Teams need to be willing to trial multiple different solutions to find the optimum setup.

AIOps Case Study: COTY

COTY adopted AIOps to take the agility and scalability of their IT strategy to the next level. COTY is a major player in the cosmetics space, with clients that include Max Factor and Calvin Klein.  As a dynamic business, they relied on flawless and versatile performance from their IT infrastructure to manage everything from payrolls to wireless networks.

With over 4,000 servers and a cloud-based infrastructure, COTY’s IT system is far too complex for traditional DevOps strategies to handle. To deal with it they’ve chosen AIOps.

AIOps has improved the way COTY handles and analyzes data. Data sources are integrated into a ‘data lake’, and machine learning algorithms can crunch its contents to extract patterns.

This has allowed them to minimize noise, so their operations department isn’t bombarded with irrelevant and untrustworthy information. 

AIOps has transformed the way COTY’s DevOps team thinks about visibility. Instead of a traditional events-based model, they now use a global, service-orientated model.  This allows the team to analyze their business and IT holistically.

COTY’s Enterprise Management Architect, Dan Ellsweig, wants to take things further. Dan is using his AIOps toolchain to create a dashboard for executives to view. For example, the dashboard might show the CTO what issues are being dealt with at a particular point in time.

AIOps Case Study: KPN

KPN is a Dutch telecoms business with operating experience in many European countries.  They adopted AIOps because the amount of data they were required to process was more than a human could handle.

KPN’s Chief Product Owner Software Tooling, Arnold Hoogerwerf, explains the benefits of using AIOps. First, leveraging AI and machine learning can increase automation and reduce operational complexity. This means that KPN’s DevOps team can do more with the same number of people.

Secondly, AI and machine learning can speed up the process of investigating problems. With traditional strategies, it may take weeks or months to investigate a problem and find the root cause. The capacity of AI tools to correlate multiple data sources allows the team to make crucial links in days that otherwise would have taken weeks.

Finally, Hoogerwerf has a philosophical reason for using AIOps.  He believes that while data is important, it’s even more important to keep sight of what’s going on behind the data.

Data on its own is meaningless if you don’t have the knowledge and wisdom with which to interpret it.

Implementing AIOps with Coralogix

Although the three companies we’ve looked at are much larger than the average business, AIOps is not just for big companies. The increasing number of platforms and vendors supporting AIOps tooling means that any business can take advantage of what AIOps has to offer.

The Coralogix platform launched two years after the birth of AIOps and our philosophy has always paralleled the principles of AIOps.  As Coralogix’s CEO Ariel Assaraf explains, organizations are burdened with the need to analyze increasing quantities of data. They often can’t do this with existing infrastructure, resulting in more than 99% of data remaining completely untapped.

In this context, the Coralogix platform is a game-changer. It allows organizations to analyze data without relying on storage or indexing. This enables significant cost savings and greater data coverage. Adding machine learning capabilities on top of that makes Coralogix much more powerful than any alternative in the market. Instead of cherry-picking data to analyze, stateful stream analysis occurs in real-time.  

How Coralogix can help with pattern discovery

One of the five dimensions of AIOps is pattern discovery. Due to the ability of machine learning to analyze large quantities of data, the Coralogix platform is tailor-made for discovering patterns in logs. As a case in point, gaming company AGS uses Coralogix to analyze 100 million logs a day.

The patterns extracted have allowed their DevOps team to reduce MTTR by 70% and their development team to create enhanced user experiences that have tripled their user base.

Another case is the neural science and ML company Biocatch. With exponentially increasing log volumes, their plight was a vivid illustration of the complexity that 21st century DevOps teams increasingly face.

Coralogix could handle these logs by clustering entries into patterns and finding connections between them. This allowed Biocatch to handle bugs and solve problems much faster than before.

How Coralogix can communicate insights

Once patterns have been extracted, DevOps engineers receive automated insights and alerts about anomalies in the system behavior.  Coralogix achieves this by integrating with a variety of dashboards and visualization solutions such as Prometheus and CloudWatch.

Coralogix also implements a smarter alerting system that flags anomalies to DevOps engineers in real time.  Conventional alerting systems require DevOps engineers to set alerting thresholds manually. However, as we saw at the start of this article, modern IT is too complex and fast-changing for this approach to work.

Coralogix solves this with dynamic alerts. These use machine learning to adjust thresholds in response to data.  This enables a much more effective approach to anomaly detection, one that is tailored to the DevOps landscape of the 21st century.

Wrapping Up

The increasing complexity and volumes of data faced by modern DevOps teams mean that humans can no longer handle IT operations without help.  AIOps aims to leverage AI and machine learning with a view to converting high-volume data streams into insights that human engineers can act on.

AIOps fits with Coralogix’s own approach to DevOps, which is to use machine learning to help organizations effectively use the increasing volumes of data they generate.  Observability should be for the many, not just a few.

What is GitOps, Where Did It Come From, and Why Should You Care?

“What is GitOps?” – a question which has seen increasing popularity on Google searches and blog posts in the last three years. If you want to know why then read on.

Quite simply, the coining of GitOps is credited to one individual, and pretty smart guy, Alexis Richardson. He’s so smart that he’s built a multi-award-winning consultancy, Weaveworks, and a bespoke product, Flux, around the GitOps concept.

Through the course of this post, we’re going to explore what GitOps is, its relevance today, why you should be aware of it (and maybe even implement it), and what GitOps and Observability mean together. 

What is GitOps?

The concept is quite simple. A source control repository holds everything required for a deployment including code, descriptions, instructions, and more. Every time you change what’s in the repository, an engineer can pull the change to alter or update the deployment. This principle can also be applied for new deployments. 

Of course, it’s a little more complex than that. GitOps relies on a single source of truth, automation, and software agents, to identify deltas between what is deployed and what is in source control. Then, through more automation, reconcilers, and a tool like Helm, clusters are updated or changes are rolled back, depending on a host of factors.

What’s crucial to understand is that GitOps and containerization, specifically with Kubernetes for orchestration and management, are symbiotic. GitOps has huge relevance for other cloud-native technologies but is revolutionary for Kubernetes.

The popularity of GitOps is closely linked to some of the key emerging trends in technology today: rapid deployment, containerization with Kubernetes, and DevOps.

Is GitOps Better Than DevOps?

GitOps and DevOps aren’t necessarily mutually exclusive. GitOps is a mechanism for developers to be far more immersed in the operations workflow, therefore making DevOps work more effectively. 

On top of that, because it relies on a central repository from which everything can be monitored and logged, it brings security and compliance to the heart of development and operations. Therefore, GitOps is also an enabler of good DevSecOps practices

The Four Principles of GitOps

Like any methodology, GitOps has some basic principles which define best practice. These four principles, defined by Alexis Richardson, capture the essence of ‘GitOps best practices’:

1. The Entire System Can Be Described Declaratively 

This is a simple principle. It means that your whole system can be described or treated through declarative commands. Therefore, applications, infrastructure, and containers are not only defined in code, but are also declared with code as well. All of this is version controlled within your central repository.

2. The Canonical Desired System State Versioned in Git

Building on the first principle, as your system is wholly defined within a single source of truth, like Git, you have one place where all changes and declarations are stored. ‘Git Revert’ makes rollbacks, upgrades, and new deployments seamless. You can make your entire workflow more secure by using an SSH key to certify changes that enforce your organization’s security requirements. 

3. Approved Changes Can Be Automatically Applied

CI/CD with Kubernetes can be difficult. This is largely attributed to the complexity of Kubectl, and the need to have your cluster credentials allocated to each system alteration. With the principles of GitOps above, the definition of your system is kept in a closed environment. That closed environment can be permissioned so pulled changes are automatically applied to the system. 

4. Software Agents for Correctness and Convergence 

Once the three above principles are applied, this final principle ensures that you’re aware of the delta (or divergence) between what is in your source control repository and what is deployed. When a change is declared, as described in the first principle, these software agents will automatically pick up on any changes in your repository and reconcile your cluster to match. 

When used with your repository, software agents can perform a vast array of functions. They ensure that there are automated fixes in place for outages, act as a QA process for your operations workflow, protect against human error or manual intervention, and can even take self-healing to another level.

GitOps and Scalability

So far, we have examined the foundational aspects of GitOps. These in themselves go some of the way in showing the inherent benefits GitOps has to offer. It brings additional advantages to organizations seeking simple scalability, in more than one way.

Traditionally, monolithic applications with multiple staging environments can take a while to spin up. If this causes delays, scaling horizontally and providing elasticity for critical services becomes very difficult. Deployment pipelines can also be single-use, wasting DevOps engineers’ time. 

Migrating to a Kubernetes-based architecture is a great step in the right direction, but this poses two problems. First, you have to build all-new deployment pipelines, which is time consuming. Second, you have to get the bulk of your DevOps engineers up to speed on Kubernetes administration. 

Scaling Deployment 

In a traditional environment, adding a new application means a new deployment pipeline, creating new repositories, and being responsible for brand new workflows. 

What GitOps brings to the table is simplicity in scalability. All you need to do is write some declarative code in your central repository, and let your software agents take care of the rest.

GitOps gives engineers, and developers in particular, the ability to self-serve when it comes to CI/CD. This means that engineers can employ both automation and scalability in their continuous delivery in much less time and without waiting for operational resources (human or technical).

Scaling Your Team

Because the principles of GitOps rely on a central repository and a pull-to-deploy mentality, it empowers your developers to act as DevOps engineers. 

Whereas you may have previously needed a whole host of Kubernetes administrators, that isn’t the case with GitOps. Version controlled configs for Kubernetes mean it’s possible for a skilled developer to roll out changes instantaneously. 

They can even roll them back again with a simple Git command. Thus, GitOps helps to create true “t-shaped” engineers, which is particularly valuable as your team or organization grows.

GitOps and Observability 

If you’ve read and understood the principles of GitOps above, then it should come as no surprise that observability is a huge consideration when it comes to adopting GitOps. But how does having a truly observable system relate to your GitOps adoption?

GitOps requires you to detail the state of the system that you want, and therefore your system has to be designed in a way that allows it to be truly understood. For example, when changes to a Kubernetes cluster are committed to your central repository, it’s imperative that that cluster remains understood at all times. This ensures that the divergence between the observed and desired cluster state can be measured and acted upon.

This really requires a purpose-built, cloud-native, and fully-wrapped SaaS observability platform. Coralogix’s Kubernetes Operator provides a simple interface between the end-user and a cluster, automatically acknowledging and monitoring the creation and cessation of resources via your code repository. With its ability to map, manage, and monitor a wide variety of individual services, Coralogix is the natural observability partner for an organization anywhere on its GitOps journey.

Summary

Hopefully, you can now answer the question posited at the start of this article. GitOps requires your system to be code-defined, stored in a central repository, and to be cloud-native to allow its pull-to-deploy functionality.

In return, you get a highly scalable, highly elastic, system that empowers your engineers to spend more time developing releases and less time deploying them. With built-in considerations for security and compliance, there’s no doubt that it’s worth adopting. 

However, be mindful that with such free-flowing spinning up and down of resources, an observability tool like Coralogix is a must-have companion to your GitOps endeavor.

DevSecOps vs DevOps: What are the Differences?

The modern technology landscape is ever-changing, with an increasing focus on methodologies and practices. Recently we’re seeing a clash between two of the newer and most popular players: DevOps vs DevSecOps. With new methodologies come new mindsets, approaches, and a change in how organizations run. 

What’s key for you to know, however, is, are they different? If so, how are they different? And, perhaps most importantly, what does this mean for you and your development team?

In this piece, we’ll examine the two methodologies and quantify their impact on your engineers.

DevOps: Head in the Clouds

DevOps, the synergizing of Development and Operations, has been around for a few years. Adoption of DevOps principles has been common across organizations large and small, with elite performance through DevOps practices up 20%.

The technology industry is rife with buzzwords, and saying that you ‘do DevOps’ is not enough. It’s key to truly understand the principles of DevOps.

The Principles of DevOps

Development + Operations = DevOps. 

There are widely accepted core principles to ensure a successful DevOps practice. In short, these are: fast and incremental releases, automation (the big one), pipeline building, continuous integration, continuous delivery, continuous monitoring, sharing feedback, version control, and collaboration. 

If we remove the “soft” principles, we’re left with some central themes. Namely, speed and continuity achieved by automation and monitoring. Many DevOps transformation projects have failed because of poor collaboration or feedback sharing. If your team can’t automate everything and monitor effectively, it ain’t DevOps. 

The Pitfalls of DevOps

As above, having the right people with the right hard and soft skills are key for DevOps success. Many organizations have made the mistake of simply rebadging a department, or sending all of their developers on an AWS course and all their infrastructure engineers on a Java course. This doesn’t work – colocation and constant communication (either in person, via Slack or Trello) are the first enablers in breaking down silos and enabling collaboration. 

Not only will this help your staff cross-pollinate their expertise, saving on your training budget, but it enables the organic and seamless workflow. No two organizations or tech teams are the same, so no “one size fits all” approach can be successfully applied.

DevSecOps: The New Kid On The Block

Some people will tell you that they have been doing DevSecOps for years, and they might be telling the truth. However, DevSecOps as a formal and recognized doctrine is still in its relative infancy. If DevOps is the merging of Development and Operations, then DevSecOps is the meeting of Development, Security, and Operations. 

Like we saw with DevOps adoption, it’s not just as simple as sending all your DevOps engineers on a security course. DevSecOps is more about the knowledge exchange between DevOps and Security, and how Security can permeate the DevOps process. 

When executed properly, the “Sec” shouldn’t be an additional consideration, because it is part of each and every aspect of the pipeline.

What’s all the fuss with DevSecOps?

The industry is trending towards DevSecOps, as security dominates the agenda of every board meeting of every big business. With the average cost of a data breach at $3.86 million, it’s no wonder that organizations are looking for ways to incorporate security at every level of their technology stack.

You might integrate OWASP vulnerability scanning into your build tools, use Istio for application and container-level security and alerting, or just enforce the use of Infrastructure as Code across the board to stamp out human error.

However, DevSecOps isn’t just about baking Security into the DevOps process. By shifting security left in the process, you can avoid compliance hurdles at the end of the pipeline. This ultimately allows you to ship faster. You also minimize the amount of rapid patching you have to do post-release, because your software is secure by design.

As pointed out earlier, DevOps is already a successful methodology. Is it too much of a leap to enhance this already intimidating concept with security as well? 

DevOps vs DevSecOps: The Gloves Are Off

What is the difference between DevOps and DevSecOps? The simple truth is that in the battle royale of DevOps vs DevSecOps, the latter, newer, more secure contender wins. Not only does it make security more policy-driven, more agile, and more enveloping, it also bridges organizational silos that are harmful to your overall SDLC.

The key to getting DevSecOps right lies in two simple principles – automate everything and have omnipotent monitoring and alerting. The reason for this is simple – automation works well when it’s well-constructed, but it still relies on a trigger or preceding action to prompt that next function. 

Every single one of TechBeacon’s 6 DevSecOps best practices relies on solid monitoring and alerting – doesn’t that say a lot?

Coralogix: Who You Want In Your Corner

Engineered to support DevSecOps best practices, Coralogix is the ideal partner for helping you put security at the center of everything

Alerts API allows you to feed ML-driven DevOps alerts straight into your workflows, enabling you to automate more efficient responses and even detect nefarious activity faster. Easy to query log data combined with automated benchmark reports ensure you’re always on top of your system health. Automated Threat Detection turns your web logs into part of your security stack. 

With battle-tested software and a team of experts servicing some of the largest companies in the world, you can rely on Coralogix to keep your guard up.

How to Implement Effective DevOps Change Management

A decade ago, DevOps teams were slow, lumbering behemoths with little automation and lots of manual review processes. As explained in the 2020 State of DevOps Report, new software releases were rare but required all hands on deck. Now, DevOps teams embrace Agile workflows and automation.  They release often, with relatively few changes.  High-quality DevOps change management is no longer a nice-to-have, it’s a must. 

For a lot of DevOps teams, this is easier said than done. For the State of DevOps Report, over 2000 DevOps professionals were surveyed on topics including the implementation of change management.

The survey findings uncovered the existence of four main change management styles stark contrasts in terms of characteristics and effectiveness. Successful DevOps teams were three times more likely than others to have superior change management.

We’ve divided those styles into the good, the bad, and the ugly. Let’s take a look.

The Good

 First, we’ll look at two approaches that both exemplify high-quality change management by putting the emphasis on engineering.

Engineering-Focused

DevOps teams at cutting edge technology companies tend to implement engineering-driven change management. The stand-out feature of this strategy is that it leverages automation to facilitate a rapid deployment cycle.

Changes as Code

Rather than using traditional documentation to record software changes, engineering-driven DevOps tries to express as much as possible in code.

For example, infrastructure as code enables DevOps teams to apply the same principles to application infrastructure that they would to software. They can take full advantage of Git-style versioning, central repositories to push their changes to, and Continuous Integration.

As well as harnessing the full power of automation in the CI/CD cycle, changes as code allow DevOps teams to sidestep ITIL review processes. Because traditional review processes are manual and labor-intensive, cutting them out of the loop is like letting the brakes off CI/CD.

This investment in automation pays off.  The State of DevOps Report found that 75% of companies surveyed had efficiency that was either high or very high and 67% had high or very high implementation success.

Operationally Mature

DevOps teams at large, established companies face a different set of challenges than those at smaller startup companies. These teams often work in traditional sectors, like energy, which came relatively late to the digital party.

Because these sectors have a mature customer base and strong regulatory requirements, DevOps teams must strike a balance between managing rapid change and taking as few risks as possible. High-levels of automation offer operationally-mature change management, which is often complemented by heavy use of manual, established review processes.

The State of DevOps found that while operationally mature companies score low in efficiency (68% had very low efficiency), they score much higher in performance sentiment. 73% of respondents had performance sentiment that was high or very high.

While traditional approval processes cause reductions in efficiency, it’s generally considered necessary in tightly regulated sectors such as finance. Later we’ll see how automated platforms are designed to ensure ‘Continuous Compliance’, meaning the work of an approval committee can now be done with code. 

The Bad

Ad Hoc

This is how not to do change management. Ad hoc change management is typically used by DevOps teams in small companies that haven’t quite ‘got with the program’ of 21st century DevOps. In contrast to engineering-focused companies, they have very little automation. And in contrast to operationally mature companies, they have few standardized approval processes.

These shortcomings, both of automation and formal processes, leads to a low performance sentiment. The State of DevOps Report found that 80% of respondents doing ad hoc change management had low or very low performance sentiment.

The Ugly…

Governance-Focused

This approach tends to be used at large companies that have large revenues and employee head counts.  The report found that a third of companies surveyed had revenues above $1 billion. Nearly half had over 5,000 staff on their payroll.

As the name implies, governance-focused change management puts the majority of the emphasis on traditional approval and manual review processes.  Automation is often deemphasized.  This results in inefficiency and low morale. The State of DevOps Report found that 70% of respondents had low or very low performance sentiment and 62% had low or very low implementation success.

The Trouble with Toil

Change management in DevOps is a forest full of traps. One of these traps is toil. Identified by Googler Vivek Rau, toil is “the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows.”

Examples of toil can include responding to pager alerts and manually running a script every time you want to deploy to a cluster.  Toil is work that has to be done, again and again, simply to keep a service up and running.

In many ways, toil is a degenerate form of change management. It’s reactive, involving little strategy. It doesn’t add value, tending to leave the system in the same state it was in before. Typically, toil results from poor system design and lack of appropriate automation.

The State of DevOps Report found that companies employing each of the four DevOps change management styles had different levels of toil. 

Engineering-focused companies had the lowest toil, a statistic that makes sense given their commitment to automation. These were followed by ad hoc and governance-focused DevOps teams. The worst offenders were operationally mature companies. 63% of respondents admitted that toil took up 30% or more of their time.  This correlates with their relatively high emphasis on manual review and traditional approval processes.

Toil is detrimental to healthy DevOps culture because it takes away from engineering time. Because toil grows linearly with service size, there’s the danger that it could end up taking 100% of DevOps engineers’ work hours. Not only is this detrimental for mission-critical workloads, but this could also sap DevOps team morale and discourage new hires from permanently joining a team.

Smart Change Management with Automation

The best way to minimize toil is the intelligent use of automation platforms.  Google is one of the leaders in automating DevOps change management.  The size of their system (worldwide) coupled with the sheer volume of traffic and potential scope for failure leads them to develop unique solutions such as Borg for MySQL.

These days, DevOps teams have many choices when it comes to DevOps automation platforms. Below are two of the most popular platforms.

Puppet

With customers including RBS and Staples, Puppet is the industry standard for automation in DevOps change management. Puppet possesses many solutions tailored to support a DevOps team’s needs.

Puppet Enterprise is an integrated DevOps platform that can handle a range of change management processes such as Continuous Delivery, Patch Management, and Continuous Compliance.

Puppet Forge enables teams to automate DevOps workflows and innovate quickly through the use of standardized components. It’s essentially a communal repository of Puppet modules. Each module performs a specific function, such as installing MySQL. Teams can use these modules as building blocks to create customized DevOps procedures. This allows engineers to bring the powerful principles of code reuse and DRY to DevOps. 

Bolt is an open-source solution that is optimized for automating DevOps workflows. With it, teams can update their system, troubleshoot servers, and deploy applications.

Chef

Chef is used by customers like IBM, Facebook, and Capital One. It’s adapted for use in a highly regulated DevOps environment. This makes it particularly useful for large operationally mature and governance-focused companies, such as those in the financial sector.

The essence of Chef is its Chef Enterprise Automation Stack, a set of technologies that automates a company’s DevOps processes. It contains several platforms, each tailored for a different facet of DevOps change management. Let’s look at two of them.

Chef Infra automates infrastructure management by implementing the infrastructure as code paradigm. This makes your application easily scalable and portable as well as encouraging best practice such as Test-Driven Development. Security is baked into Chef Infra through its use of an agent-based approach, making it attractive for financial organizations.

Chef Habitat handles the configuration, management, and behavior of your app.  It’s technology agnostic, so developers don’t have to worry about infrastructural details when deploying new releases. It’s well-suited to applications with plenty of dependencies and frequent updating requirements.

People Power and DevOps Change Management

While automation is important, we can’t lose sight of the crucial role played by a well-functioning DevOps team. According to the State of DevOps Report, organizations that keep their employees in the change management loop are five times more effective than those that don’t.

 Automation requires employee involvement. Organizations implementing automation need to critically involve their team members through the use of feedback loops.

Wrapping Up

Automation is essential to innovation in DevOps change management. As 2020’s State of DevOps Report points out, it builds team confidence in the change management process and minimizes toil, allowing developers to focus on creative engineering projects. 

Whatever the size of your DevOps team or organization, embracing automation and keeping your personnel in the loop is the key to change management success.

Where is Your Next Release Bottleneck? 

A typical modern DevOps pipeline includes eight major stages, and unfortunately, a release bottleneck can appear at any point:

devops pipeline

These may slow down productivity and limit a company’s ability to progress. This could damage their reputation, especially if a bug fix needs to be immediately deployed into production.

This article will cover three key ways using data gathered from your DevOps pipeline can help you find and alleviate bottlenecks in your DevOps pipeline.

1. Increasing the velocity of your team

To improve velocity in DevOps, it’s important to understand the end-to-end application delivery lifecycle to map the time and effort in each phase. This mapping is performed using the data pulled directly and continuously from the tools and systems in the delivery lifecycle. It helps detect and eliminate bottlenecks and ‘waste’ in the overall system. 

Teams gather data and metrics from build, configuration, deployment and release tools. This data can contain information such as release details, duration of each phase, whether the phase is successful and more. However, none of these tools paint the whole picture.

By analyzing and monitoring this data in aggregate, DevOps teams benefit from an actionable view of the end-to-end application delivery value stream, both in real-time and historically. This data can be used to streamline or eliminate the bottlenecks that are slowing the next release down and also enable continuous improvement in delivery cycle time.

2. Improving the quality of your code

Analyzing data from the testing performed for a release, enables the DevOps teams to see all the quality issues in new releases and remediate them before implementing into production. Ideally preventing post-implementation fixes.

In modern DevOps environments, most (if not all) of the testing process is achieved with automated testing tools. Different tools are usually used for ‘white box’ testing versus ‘black box’ testing. While the former aims to cover code security, dependencies, comments, policy, quality, and compliance testing, the latter covers functionality, regression, performance, resilience, penetration testing, and meta-analysis like code coverage and test duration.

Again, none of these tools paints the whole picture, but analyzing the aggregate data enables DevOps teams to make faster and better decisions about overall application quality, even across multiple QA teams and tools. This data can even be fed into further automation. For example:

  • Notify the development team of failing code which breaks the latest build. 
  • Send a new code review notification to a different developer or team.
  • Push the fix forward into UAT/pre-prod.
  • Implement the fix into production.

There are some great static analysis tools that can be integrated into the developers’ pipeline to validate the quality, security, and unit test coverage of the code before it even gets to the testers. 

This ability to ‘shift left’ to the Test stage in the DevOps loop (to find and prevent defects early) enables rapid go/no-go decisions based on real-world data. It also dramatically improves the quality of the code that is implemented into production, by ensuring failing or poor quality code is fixed or improved. Therefore reducing ‘defect’ bottlenecks in the codebase

3. Focusing in on your market

Data analytics from real-world customer experience enables DevOps teams to reliably connect application delivery with business goals. It is critical to connect technology and application delivery with business data. While technical teams need data on timings like website page rendering speeds, the business needs data on the impact of new releases.

This includes metrics like new users / closed accounts, completed sales/items stuck in the shopping cart and income. No single source provides a complete view of this data, as it’s isolated across multiple applications, middleware, web servers, mobile devices, APIs, and more.

Fail Fast, Fail Small, Fail Cheap!

Analyzing and monitoring the aggregate data to generate business-relevant impact metrics enables DevOps teams to:

  • Innovate in increments 
  • Try new things
  • Inspect the results
  • Compare with business goals
  • Iterate quickly

Blocking low-quality releases can ultimately help contain several bottlenecks that are likely to crop up further along in the pipeline. This is the key behind ‘fail fast, fail small, fail cheap’, a core principle behind successful innovation.

Bottlenecks can appear in a release pipeline. Individually, many different tools all paint part of the release picture. But only by analyzing and monitoring data from all these tools can you see a full end-to-end, data-driven approach that can assist in eliminating bottlenecks. This can improve the overall effectiveness of DevOps teams by enabling increased velocity, improved code quality and increased business impact.

Are your customers catching production problems 🔥 before you do?

Availability and quality are the biggest differentiators when people opt for a service or product today. You should be aware of the impact of your customers alerting you to your own problems, as well as how to stop this from becoming the norm. To make sure you don’t become an organization known for its bugs, understanding the organizational changes required to deliver a stable service is key. If, as Capers Jones tells us, only as many as 85% of bugs are caught pre-release, it’s important to differentiate yourself with the service you provide.

The Problem

It’s simple to understand why you don’t want unknown bugs to go out to your customers in a release. To truly understand its impact, you need to define the impact of committing problematic code to release, before we look at its solutions.

Problem 1: Your customers’ perception

No one wants to buy a faulty product. You open yourself up to reputational risks – poor reviews, client churn, lack of credibility – when you get a name for having buggy releases. This has three very tangible costs to your business. First, your customers will cease to use your product or service. Second, any new customers will become aware of your pitfalls sooner or later. Lastly, it can have a negative impact on staff morale and direction, and you’ll run the risk of losing your key people.

Problem 2: The road to recovery

Once a customer makes you aware of a bug, you don’t have a choice but to fix it (or you’ll be faced with the problem above). The cost of doing this post-production is only enhanced with the time it takes to detect the problem, or MTTD (mean time to detect). As part of the 2019 State of Devops Report, surveyed “Elite” performing businesses took on average one hour or under to deliver a service restoration or fix a bug, against up to one month for “Low” performing businesses in the same survey. The problem compounds with time: the longer it takes to detect the problem, the more time it takes for your developers to isolate, troubleshoot, fix and then patch. Of all surveyed in the 2019 State of Devops Report, the top performers were at least twice as likely to exceed their organizational SLAs for feature fixes.

Problem 3: Releasing known errors

Releases often go out to customers with “known errors” in them. These are errors that have been assessed to have little impact, or only occur under highly specific conditions, and therefore are unlikely to affect the general release. However, this is just the coding and troubleshooting you’ll have to do later down the line, because you wanted to make a release on time. This notion of technical debt isn’t something new, but with many tech companies doing many releases per day, the compounded work that goes into managing known errors is significant. 

The Solution

Organizations can easily deliver more stable releases to their customers. Analysis indicates that there are a number of things that can greatly enhance your own stability, and keep your overheads down.

Solution 1: What your team should be doing

Revisiting the 2019 State of Devops Report, we can see the growing importance of delivering fast fixes (reduced MTTD) is dependent on two important factors within your team.

Test automation is being viewed as a “golden record” when it comes to validating code for release. It positively impacts continuous integration and continuous deployment, and where deployment is automated, far greater efficiencies are achieved in the fast catching and remediation of bugs, before they hit the customers.

“With automated testing, developers gain confidence that a failure in a test suite denotes an actual failure just as much as a test suite passing successfully means it can be successfully deployed.”

Solution 2: Where you can look for help

The State of DevOps report also tells us that we can’t expect our developers to monitor their own code as well as use the outputs of infrastructure and code monitoring to make decisions. 

This is where Coralogix comes in. Coralogix’s advanced unified UI allows the pooling of log data from applications, infrastructure and networks in one simple view. Not only does this allow your developers to better understand the impact of releases on your system, as well as helping to spot bugs early on. Both of these are critical in reducing your RTTP, which leads to direct savings for your organization.

Coralogix also provides advanced solutions to flag “known errors”, so that if they do go out for release, they aren’t just consigned to a future fix pile. By stopping known errors from slipping through the cracks, you are actively minimizing your technical debt whilst increasing your dev team’s efficiency.

Lastly, Loggregation uses machine learning to benchmark your organization’s code’s performance, building an intelligent baseline that identifies errors and anomalies faster than anyone – even the most eagle-eyed of customers.

How DevOps Monitoring Impacts Your Organization

DevOps logging monitoring didn’t simply become part of the collective engineering consciousness. It was built, brick by brick, by practices that have continued to grow and flourish with each new technological innovation.

Have you ever been forced to sit back in your chair, your phone buzzing incessantly, SSH windows and half-written commands dashing across your screen, and admit that you’re completely stumped? Nothing is behaving as it should and your investigations have been utterly fruitless. For a long time, this was an intrinsic part of the software experience. Deciphering the subtle clues left behind in corrupted log files and overloaded servers became a black art that turned many away from the profession.

DevOps observability is still somewhat changing, and many will have different ways to score how observable a system is. Within DevOps monitoring, There are three capabilities that continue to make themselves key in every organization. Monitoring, logging monitoring, and alerting.

By optimizing for these capabilities, we unlock a complex network of software engineering success that can change everything from our attitude, to our risk exposure, or to our deployment process.

DevOps Monitoring – What is it?

Along the walls of any modern engineering office, televisions and monitors flick between graphs of a million different variations. DevOps Monitoring is something of an overloaded term, but here we will use it to describe the human-readable rendering of system measurements.

In a prior world, it would have been enough to show a single graph, perhaps accompanied by a few key metrics. These measurements would give the engineers and support staff the necessary overview of system health. Complexity was baked into the same application back then, so the DevOps monitoring burden was less heavy. Then, microservices became the new hotness.

We could scale individual components of our system, making flexible, intelligent performance. Builds and deployments were less risky because our change impacted only a fraction of our system. Alas, simply seeing what is going on became an exponentially more difficult challenge. Faced with this new problem, our measurements needed to become far more sophisticated.

If you have five services, there are now five network hops that need to be monitored. Are they sufficiently encrypted? Are they running slowly? Perhaps one of the services responds with an error for 0.05 seconds while another service that it depends on restarts. Maybe a certificate is running out. Some high-level measurements aren’t going to adequately lead your investigating engineers to the truth when so much more is going on.

Likewise, if a single request passes through 10 different services on its journey, how do we track it? We call this property traceability and it is an essential component of your monitoring stack.

pasted image 0 8

Tools such as Jaeger (pictured above) provide a view into this. It requires some work from engineers to make sure that some specific values are being passed between requests, but the pay off is huge. Tracking a single request and seeing which components are slowing you down gives you an immediate tool to investigate your issues.

The DORA State of DevOps 2019 report even visualized the statistical relationship between monitoring (capturing concepts we cover here) and both productivity and software delivery performance. It is clear from this diagram that good DevOps monitoring actually trends with improved organizational performance.

pasted image 0 16

Alas, monitoring is based on one single assumption. That when the graphs change, someone is there to see them. This isn’t always the case and as your system scales, it becomes uneconomical for your engineers to sit and stare at dashboards all day. The solution is simple. Information needs to leap out and grab our attention.

Alerting

A great DevOps alerting solution isn’t just about sounding the alarms when something goes wrong. Alerts come in many different flavours and not all of them need to be picked up by a human being. Think of alerting as more of an immune system for your architecture. Monitoring will show the symptoms, but the alerts will begin the processes that fight off the pathogen. HTTP status codes, round trip latencies, unexpected responses from external APIs, sudden user spikes, malformed data, queue depths… They can all be measured according to risk, alerted and dealt with.

Google splits alerts up into three different categories – ticket, alert and page. They increase in severity, but the naming is quite specific. A ticket level alert might simply add a Github issue to a repository, or create a ticket on a Trello board somewhere. Taking that a step further, it might trigger bots that will automatically resolve and close the issue, reducing the overall human toil and freeing up your engineers to focus on value creation.

In the open source world, Alertmanager offers a straight forward integration with Prometheus for alerting. It hooks into many existing alerting solutions, such as Pagerduty. As part of a package, organisations like Coralogix offer robust and sophisticated alerting, alongside log collection, monitoring and machine learning analytics, so you can remain part of one ecosystem.

pasted image 0 17

Good alerts will speed up your “mean time to recovery”, one of the four key metrics that trend with organizational performance. The quicker you know where the problem is, the sooner you can get to work. When you couple this with a sophisticated set of monitoring tools, that enables traceability across your services, and efficient, consistent logs that give you fine-grained information about your system’s behavior, you can quickly pin down the root cause and deploy a fix. When your fix is deployed, you can even watch those alerts resolve themselves. The ultimate stamp of a successful patch.

Logging

When we hear the word “logging”, we are immediately transported to its cousin, the “logfile”. A text file, housed in an obscure folder on an application server, filling up disk space. This is logging as it was for years, but software has laid down greater challenges, and logging has answered with vigor. Logs now make up the source of much of our monitoring, metrics, and health checks. They are a fundamental cornerstone of observability. A spike in HTTP 500 errors may tell you that something is wrong, but it is the logs that will tell you the exact line of broken code.

As with DevOps monitoring, modern architectures require modern solutions. The increased complexity of microservices and serverless mean that basic features, such as log collection, are now non-negotiable. Machine learning analysis of log content is now beyond its infancy. As the complexity of your system increases, so too does the operational burden. A DevOps monitoring tool like Coralogix, along with its automatic classification of logs into common templates coupled with the version benchmarks makes it possible to immediately see if some change or version release was the cause of an otherwise elusive bug.

anomaly alert notification

Conclusion

Monitoring is a simple concept to understand, but a difficult capability to master. The key is to decide the level of sophistication that your organization needs and be ready to iterate when those needs change. By understanding what your system is doing and creating layers of complexity, you’ll be able to know, at a glance, if something is wrong.

Combined with alerts, you’ve got a system that tells you when it is experiencing problems and, in some cases, will run its own procedures to solve the problem itself. This level of confidence, automation, and DevOps monitoring changes attitudes and has been shown, time and time again, to directly improve organizational performance.

Over the course of this article, we’ve covered each of the topics and how they complement the monitoring of your system. Now, primed with the knowledge and possibilities that each of these solutions offers, you can begin to look inward and assess how your system behaves, where the flaws are and where you can improve. It is a fascinating journey, and when done successfully, it will pay dividends.