The latest Github outage and how it impacts observability
August 11, 2021
Every now and then, issues occur that disrupt the very fabric of global software engineering. Chief amongst them is the recent mass outage of Github. Github is a fundamental building block in software productivity, hosting over 190 million code repositories. Github hosts our code and libraries, runs build pipelines, and much more. It is a central hub of activity and it is consumed by tens of thousands of organizations.
But is Github a part of your system?
It’s tempting to consider Github as being outside of your software system. It isn’t a service or library developed by your team, it’s just one of the many services that you depend on, right? Well, not exactly.
Github holds a key function in the productivity of your engineering teams. If your team uses Github to host code, they literally can not make code changes when Github experiences an outage.
If your entire development team went on strike, it would be considered an existential threat to your organizational objectives. A Github outage of this magnitude has a very similar impact in terms of developer output.
It gets worse.
You’ll notice that GitHub Pages was also part of the outage. GitHub Pages literally hosts your code for you. There are a not insignificant number of websites that have DNS records pointed directly at a Github Pages site. This means that a GitHub outage is also tantamount to an Availability Zone (AZ) outage in AWS. The infrastructure on which you depend has fallen away beneath you.
Github is a fundamental aspect of your system
Github is a bedrock of both your software engineering lifecycle and your production system’s ability to function. If your teams are unable to commit and push code changes, they’re unable to respond to outages. If they’re unable to respond to outages, those outages will only get worse.
The challenge for organizations now is simple: how do you monitor Github and other third-party tools that have become first-class citizens in your observability mission? Some of these tools reveal APIs that allow you to programmatically discover their operational status. Alas, others are somewhat more mercurial.
The Github status page belongs to a class of low volume, high value data that is too often overlooked. We analyze terabytes of operating system logs to better understand our system, but we skip over the data that provides us with context. The status of Github is as fundamental to observability as the status of your AWS availability zone or the network connection for your data center. It is essential.
The goal is to create a general solution, that can consume data from disparate sources and bring them into one place so that you can correlate many different conspiring events into a coherent timeline that describes the what and the why of your system.
So how can you be ready for the next outage?
Contextual data is the hidden goldmine within your organization, costing very little to store and analyze but providing a great deal of value. Coralogix provides a comprehensive suite of features to tackle this challenge.
CPU utilization is important to understand what your system is doing, but contextual data can provide you with the why and allow you to craft alerts that deal with the complex realities of your system.
Contextual data is more than just Github. It’s Slack messages, CI/CD logs and events, third-party status pages, and much more. While they are siloed, they are hidden. If they are hidden, they aren’t valuable. Exposing this data is the next step in complete observability.