What We Learned About Enterprise Cloud Services From the 2021 Azure Outage

  • Tom Russell
  • June 30, 2021
enterprise cloud services

Azure, AWS, and GCP cloud services are invaluable to their enterprise customers. When providers like Microsoft are hit with DNS issues or other errors that lead to downtime, it has huge ramifications for their users. The recent Azure cloud services outage was a good example of that.

In this post, we’ll look at that outage and examine what it can teach us about enterprise cloud services and how we can reduce risk for our own applications. 

The risks of single-supplier reliance and vendor lock-in

Cloud services have gone from cutting-edge to a workplace essential in less than two decades, and the providers of those cloud services have become vital to business continuity.

Microsoft, Amazon, and Google are known as the Big 3 when it comes to cloud services. They’re no longer seen as optional, rather they’re the tools that make modern enterprise possible. Whether it’s simply external storage or an entire IaaS, if removed the damage to business-grade cloud service users is catastrophic.

Reliance on a single cloud provider has left many businesses vulnerable. Any disruption or downtime to a Big 3 cloud services provider can be a major event from which an organization doesn’t recover. Vendor lock-in is compromising data security for many companies.

It’s not difficult to see why many enterprises, both SMEs and blue-chip, are turning to 3rd party platforms to free themselves from the risks of Big 3 reliance and vendor lock-in.

What is the most reliable cloud service vendor?

While the capabilities enabled by cloud computing have revolutionized what is possible for businesses in the 21st century, it’s not a stretch to say that we’ve now reached a point of reliance on them. Nothing is too big to fail. No matter which of the Big 3 hosts your business-critical functions, a contingency plan for their failure should always be based on when rather than if.

The ‘Big 3’ cloud providers (Microsoft with Azure, Amazon with AWS, and Google’s GCP) each support so many businesses that any service disruption causes economic ripples that are felt at a global level. None of them is immune to disruption or outages.

Many business leaders see this risk. The issue they face isn’t deciding whether or not to mitigate it, but finding an alternative to the functions and hosted services their business cannot operate without.

Once they find a trusted 3rd party platform that can fulfill these capabilities (or, in many cases, exceed them) the decision to reinvest becomes an easy one to make. If reliability is your key concern, a 3rd party platform built across the entire public cloud ecosystem (bypassing reliance on any single service) is the only logical choice.

Creating resilience-focused infrastructure with a hybrid cloud solution

Hybrid cloud infrastructures are one solution to vendor lock-in that vastly increases the resilience of your infrastructure. 

By segmenting your infrastructure and keeping core business-critical functions in a private cloud environment you reduce vulnerability when one of the Big 3 public cloud providers experiences an outage. 

Azure, AWS, and GCP each offer highly valuable services to give your organization a competitive edge. With a 3rd party hybrid solution, these public cloud functions can be employed without leaving your entire infrastructure at risk during provider-wide downtime. 

When the cloud fails – the 2021 Azure outages

This has been demonstrated in 2021 by a string of service-wide Azure outages. The largest of these was on April 1st, 2021. A surge in DNS requests triggered a previously unknown code defect in Microsoft’s internal DNS service. Services like Azure Portal, Azure Services, Dynamics 365, and even Xbox Live were inaccessible for nearly an hour.

Whilst even the technically illiterate know the name Microsoft, Azure is a name many unfamiliar with IT and the cloud may not even be aware of. The only reason the Azure outage reached the attention of non-IT-focused media was the impact on common consumer services like Microsoft Office, Xbox live services, Outlook, and OneDrive. An hour without these Microsoft home-user mainstays was frustrating for users and damaging for the Microsoft brand, but hardly a cause for alarm.

For Microsoft’s business customers, however, an hour without Azure-backed functionality had a massive impact. It may not seem like a long time, but for many high data volume Azure business and enterprise customers, an hour of no-service is a huge disruption to business continuity.

Businesses affected were suddenly all too aware of just how vulnerable relying on Azure services and functions alone had made them. An error in DNS code at Microsoft HQ had left their sites and services inaccessible to both frustrated customers and the staff trying to control an uncontrollable situation.

Understanding the impact of the Azure outage

Understanding the impact of the Azure Outages requires having a perspective of how many businesses rely on Azure enterprise and business cloud services. According to Microsoft’s website, 95% of Fortune 500 companies ‘trust their business on Azure’

There are currently over 280,000 companies registered as using Microsoft Azure directly. That’s before taking into account the companies that indirectly rely on Azure through other Microsoft services such as Dynamics 365 and OneDrive. Azure represents over 18% of the cloud infrastructure and services market, bringing Microsoft $13.0 million in revenue during 2021 Q1. 

Suffice to say, Microsoft’s Azure services have significant market penetration across the board. Azure business and enterprise customers rely on the platform for an incredibly wide range of products, services, and solutions. Every one of serves a business-critical function. 

During the Azure outage over a quarter of a million businesses were cut off from these functions. When the most common Azure services include the security of business-critical data, storage of vital workflow and process documentation, and IT systems observability, it’s easy to see why the Azure outage has hundreds of businesses considering 3rd party cloud platforms

It’s not only Azure

Whilst Azure is the most recent of the Big 3 to experience a highly impactful service outage, the solution isn’t as simple as migrating to AWS or GCP. Amazon and Google’s cloud offerings have been historically as prone to failure as Microsoft’s.

In November 2020 a large AWS outage rendered hundreds of websites and services offline. What caused the problem? A single Amazon service (Kinesis) responded badly to a capacity upgrade. The situation then avalanched out of control, leading many to reconsider their dependency on cloud providers. 

Almost exactly a year before this in November 2019, Google’s GCP services also experienced a major global services outage. Whilst GCP’s market reach isn’t as large as its competitors (GCP held 7% market share in 2020 compared to AWS 32% and Azures 19%), many business-critical tools such as Kubernetes were taken offline. More recently, in April 2021 many GCP-hosted Google services such as Google Docs and Drive were taken offline by a string of errors during a back-end database migration

The key takeaway here is that, regardless of vendor choice, any cloud-based services used by your business will experience vendor-induced downtime. As the common cyber-security idiom goes, it’s not if but when. 

Beating vendor lock-in with 3rd party platforms

Whilst there is no way to completely avoid the impact of an industry giant like Microsoft or Amazon experiencing an outage, you can protect your most vital business-critical functions by utilizing a cross-vendor 3rd party platform. 

One area many Azure customers felt the impact of the outage was the removal of system visibility. Many Azure business and enterprise-grade customers rely on some form of Azure-based monitoring or observability service.

During the April 2021 outage, vital system visibility products such as Azure Monitor and Azure API Management were rendered effectively useless. For many organizations using these services, their entire infrastructure went dark. During this time their valuable and business-critical data could have been breached and they’d have lacked the visibility to respond and act.

How Coralogix protects your systems from cloud provider outages

The same was true for AWS customers in November 2020, and GCP ones the year prior. This is why many businesses are opting for a third-party platform like Coralogix to remove the risk of single provider reliance compromising their system visibility and security.

Coralogix is a cross-vendor cloud observability platform. By using our robust platform that draws on functionality from all 3 major cloud providers, our platform users protect their systems and infrastructure from the vulnerabilities of vendor lock-in and service provider outage. 

As a third-party platform Coralogix covers (and improves upon) many key areas of cloud functionality. These include observability, monitoring, security, alerting, developer tools, log analytics, and many more. Coralogix customers have the security of knowing all of these business-critical functions are protected from the impact of the next Big-3 service outage. 

Unlock Modern Observability