Back
Back

Confessions of a software engineer who enjoyed being paged at 5am

Confessions of a software engineer who enjoyed being paged at 5am

It’s 5:14 a.m., and I wake up to the squawking geese sound of my PagerDuty alert (anyone else have this sound? No?). I’m four months into working for my new team as a junior software engineer, and this is my first time being paged in the middle of the night.

Most software engineers probably dread this moment, but I kind of love it. Agile ceremonies and Jira tickets suddenly don’t matter, and you’re fully focused on stopping a customer-impacting fire.

Joining the on-call roster was one of the first things I put my hand up for when I started. Now, with a few years of being on-call under my belt, I wanted to share why I relished the opportunity, and share my thoughts on how you can make on-call more enjoyable for your team.

What I loved about being on-call

The rush

The adrenaline of being paged in the middle of the night and having to figure out what is going on is pretty exciting. When you’re sitting at a desk all day, attending meetings and responding to emails, the rush of needing to solve a new problem quickly is invigorating.

I also enjoyed the teamwork – when you’re in an incident, and everyone is hyper-focused and working together to figure out what is going on.

The valuable skills I gained

On-call forces clarity. If you’re paged into an incident in the middle of the night, your brain is foggy, there are multiple alerts pinging, and other teams and stakeholders are waiting for your updates. My communication style quickly improved to ensure I was conveying the most pertinent information in the clearest possible way.

When I had the opportunity to be an incident commander, I quickly realized the need to exercise leadership. Instead of focusing on trying to figure out what could be the root cause from the narrow perspective of your own product, your main priority is to be the one directing others to make sure everyone knows their role, making sure people are communicating their updates, and that the customer teams have the relevant information about the impact.

Psychological safety

This was created by the more senior members of my team, and I always felt comfortable paging the secondary on-call if needed, or asking for feedback afterwards. I appreciated how, alongside learning and building trust in your software and observability systems, you also build trust in your teammates.

Despite all this, I realize I’m probably in the minority in terms of actually enjoying on-call. So while you might not be able to get your engineers to love it, how can you get them to find it more enjoyable? Here are some of the key things I noticed over the years that make a big impact. 

Essential ingredients for an on-call experience that people actually enjoy

A simple ‘first five minutes’ runbook

The runbook before the runbook

Having runbooks for specific situations and specific products is essential, but when you’re just starting to be on-call, what’s the very first thing you do? I remember looking at the first few alerts and pages thinking, where do I even start?

If you get paged from another team into an incident, how do you start figuring out what’s going on? What’s the first thing you check? 

A ‘first five minutes’ runbook would cover topics like:

  • Where to go when you first get an alert
  • Which dashboards to check first
  • How to introduce yourself if you get paged into an already running incident
  • How to identify the most recent releases
  • Where any pre-written queries are

This might seem intuitive when you’re experienced. But for someone new to on-call, having this as a starting point means they feel less flustered when they first get an alert.

Unified observability 

The less tab-switching required, the more your engineers can think

When it’s the middle of the night, the last thing you want to do is juggle three monitoring platforms, a release register, and dashboards from other products in the company, all while trying to align timestamps and find irregularities. When that’s the case, half the battle is just finding the right data. This easily becomes a bottleneck. 

Unifying all your telemetry into one platform streamlines the process for your engineers and reduces the cognitive load so they can focus on problem solving (the enjoyable part!).

Cutting the noisy alerts

The quickest way to burn out

The quickest way to get sick of on-call is to be constantly paged for things that turn out not to be important, or alerts where the issue resolves itself after a few minutes, and there isn’t anything you can do. It’s easy to start feeling resentful and apathetic towards your systems – not a great energy for when something serious goes wrong. 

Investing time into refining alerts pays off. One way to do this is combining different logs, traces and metrics into a ‘flow’ of alert conditions. Instead of getting alerted every time there’s a CPU spike that resolves itself, you get alerted only when there’s a CPU spike and an increased error rate for something specific like “100 4xx responses returned in the last 10 minutes”. As a bonus, you already have more context around what might have happened when you start investigating. Win-win!

Taking advantage of AI

To easily turn the chaos into resolution

Anomaly and pattern recognition is a fantastic use case for AI. It’s exciting to see how using an AI model within your observability platform is reshaping observability and reducing mean time to recovery.

I remember a time while on-call when we knew exactly what the root cause of an issue was, but we just couldn’t find where in the platform it was coming from. The fix was obvious, but it took hours to locate the source. An AI that could correlate data across the platform and point us in the right direction would have been a game-changer. 

I’m incredibly excited about Olly, Coralogix’s AI observability assistant. Instead of going back and forth between different dashboards, writing multiple complex queries, and piecing together data from a whole lot of different places, Olly puts it all together for you and enables you to ask questions in natural language. Read more at the official Olly documentation here.

Healthy post incident review culture

Blameless and improvement-focused

Easy to agree to in theory, but something that takes consistent leadership to maintain. 

I remember a time I made a mistake that caused a minor outage. I apologized as soon as I realized, but my senior engineer immediately said “Don’t apologize. It’s not your fault, that was bound to happen. It shows we need a better process so it’s not possible next time.” It really stuck with me.

A healthy post-incident culture treats outages or incidents as an opportunity to improve the system or the process. With that mindset, engineers are less fearful of taking on resolving incidents. 

Being on-call taught me more than how to fix production issues. It taught me how to stay calm under pressure, communicate clearly, and trust my teammates. I think this makes it one of the fastest ways to grow as an engineer. If you’re thinking about going on-call for the first time soon, lean into the chaos and excitement!

And if you want more ways to make on-call more enjoyable, come chat. I’d love to share how Coralogix is building the kind of tooling I wish I’d had back then.

On this page