It’s 3 AM. You’re the developer on call and you get woken up by an alert. It’s the third time you’re on call this month and even though you’ve been able to manage all the alerts that have come your way so far, it’s still really stressful . All you want is to deal with it quickly and get back to sleep.
This time, the alert is for an SQS issue.
It’s not your expertise, but you think you can handle it. You open git in one tab, datadog in another, and Slack in another, trying to figure out who released what, and when. Time passes and you realize you have zero context for what might have happened to cause the issue. Stressed, frustrated, and annoyed, you wake up the next senior person on call. More time passes, and the two of you can’t figure it out.
You ultimately realize you need to wake up the one person in the organization who knows everything and anything about the system. In about 10 minutes, he solves the problem. Three hours have passed since the alert. Tired and embarrassed, you go back to sleep.
I’m sure you’re familiar with this kind of story. I am. It’s a painful, common experience for many developers.
On-call today is moving from ops/ SRE teams to developers. Most developers, and even SRE’s, don’t have the right context or tools to troubleshoot. Yet they’re expected to provide 24/7 coverage for systems, and respond in real time quickly and effectively, even though it’s not within their expertise and not what they were hired to do.