The latest Github outage and how it impacts observability

Chris Cooney
August 11, 2021

Every now and then, issues occur that disrupt the very fabric of global software engineering. Chief amongst them is the recent mass outage of Github. Github is a fundamental building block in software productivity, hosting over 190 million code repositories. Github hosts our code and libraries, runs build pipelines, and much more. It is a central hub of activity and it is consumed by tens of thousands of organizations.

But is Github a part of your system?

It’s tempting to consider Github as being outside of your software system. It isn’t a service or library developed by your team, it’s just one of the many services that you depend on, right? Well, not exactly.

Github holds a key function in the productivity of your engineering teams. If your team uses Github to host code, they literally can not make code changes when Github experiences an outage.

If your entire development team went on strike, it would be considered an existential threat to your organizational objectives. A Github outage of this magnitude has a very similar impact in terms of developer output.

It gets worse.

You’ll notice that GitHub Pages was also part of the outage. GitHub Pages literally hosts your code for you. There are a not insignificant number of websites that have DNS records pointed directly at a Github Pages site. This means that a GitHub outage is also tantamount to an Availability Zone (AZ) outage in AWS. The infrastructure on which you depend has fallen away beneath you.

Github is a fundamental aspect of your system

Github is a bedrock of both your software engineering lifecycle and your production system’s ability to function. If your teams are unable to commit and push code changes, they’re unable to respond to outages. If they’re unable to respond to outages, those outages will only get worse.

The challenge for organizations now is simple: how do you monitor Github and other third-party tools that have become first-class citizens in your observability mission? Some of these tools reveal APIs that allow you to programmatically discover their operational status. Alas, others are somewhat more mercurial.

The Github status page belongs to a class of low volume, high value data that is too often overlooked. We analyze terabytes of operating system logs to better understand our system, but we skip over the data that provides us with context. The status of Github is as fundamental to observability as the status of your AWS availability zone or the network connection for your data center. It is essential.

The goal is to create a general solution, that can consume data from disparate sources and bring them into one place so that you can correlate many different conspiring events into a coherent timeline that describes the what and the why of your system.

So how can you be ready for the next outage?

Contextual data is the hidden goldmine within your organization, costing very little to store and analyze but providing a great deal of value. Coralogix provides a comprehensive suite of features to tackle this challenge.

CPU utilization is important to understand what your system is doing, but contextual data can provide you with the why and allow you to craft alerts that deal with the complex realities of your system.

Contextual data is more than just Github. It’s Slack messages, CI/CD logs and events, third-party status pages, and much more. While they are siloed, they are hidden. If they are hidden, they aren’t valuable. Exposing this data is the next step in complete observability.

4 min

Observability

Coralogix and observability at the edge

By Ariel Assaraf and Chris Cooney
March 13, 2024

Observing Edge & WAF solutions is challenging. There are a host of unique problems to overcome, including security complexities and traffic intent identification. Let’s explore the…

Building Your Own Observability Solution vs Implementing a SaaS Solution

7 min

Observability

Building Your Own Observability Solution vs Implementing a SaaS Solution

By Keren Feldsher
February 20, 2024

Observability is a key component of modern applications, especially highly complex ones with multiple containers, cloud infrastructure, and numerous data sources. You can implement observability in…

3 min

Observability

Graylog vs Coralogix

By Chris Cooney
January 23, 2024

Graylog is a log management and security platform, built on the Torch project, originally founded in 2009. It offers a set of log management and SIEM…