It’s 3 AM. You’re the developer on call and you get woken up by an alert. It’s the third time you’re on call this month and even though you’ve been able to manage all the alerts that have come your way so far, it’s still really stressful . All you want is to deal with it quickly and get back to sleep.
This time, the alert is for an SQS issue.
It’s not your expertise, but you think you can handle it. You open git in one tab, datadog in another, and Slack in another, trying to figure out who released what, and when. Time passes and you realize you have zero context for what might have happened to cause the issue. Stressed, frustrated, and annoyed, you wake up the next senior person on call. More time passes, and the two of you can’t figure it out.
You ultimately realize you need to wake up the one person in the organization who knows everything and anything about the system. In about 10 minutes, he solves the problem. Three hours have passed since the alert. Tired and embarrassed, you go back to sleep.
I’m sure you’re familiar with this kind of story. I am. It’s a painful, common experience for many developers.
On-call today is moving from ops/ SRE teams to developers. Most developers, and even SRE’s, don’t have the right context or tools to troubleshoot. Yet they’re expected to provide 24/7 coverage for systems, and respond in real time quickly and effectively, even though it’s not within their expertise and not what they were hired to do.
Complexity is growing
Today’s systems are distributed, complex, and changing rapidly. The information needed to operate them effectively is scattered across different tools and teams which all need to coordinate harmoniously, and they affect each other when they don’t. With the number of alerts tripling over the last three years, troubleshooting has become a chaotic process that wastes precious time and requires a deep understanding of a system’s dependencies, activities, and metrics – an understanding that only a few, highly paid people in any given organization have.
Troubleshooting becomes mission impossible, especially when 85% of incidents can be traced to system changes, according to a recent Gartner recent report.
It’s a chaotic process.
How are companies dealing with the pain of on-call?
When it comes to solving real issues, you need context over the entire system.
The SRE who ultimately solved the problem described above wasn’t a better developer than the others who spent hours working on it – he was simply able to check the usual suspects quickly, as he had seen the issue before. He was one of the few people in the organization who was familiar with the system’s structure.
Training – putting time & resources towards the problem
It’s possible for a developer to be as effective as someone who has the context over a system – companies can provide better training and allocate more energy to support their employees on call. Some companies do go down this route, and focus on improving the process side of being on call. With enough training, you’re going to get good outcomes. However, it requires a tremendous amount of resources and time – for developers and the company.
Tools – installing different mechanisms, but are they truly effective?
Other companies choose to invest resources in on-call tools. They develop enhanced dashboards, sophisticated alerts, detailed playbooks, and lengthy post-mortem procedures. Identifying and implementing appropriate tools can be a full time job, as it takes constant work to keep on-call monitoring under control. But it’s not so simple – there are many black holes in a system and it can be difficult to ascertain correlations between different tools. Since there are typically very few people with complete knowledge and an accurate overview of how a system is built, it’s hard to set up the monitoring tools effectively. An alert in one service of the system can actually originate in another service and it can be difficult to effectively trace an alert back to its origins. To implement the appropriate tools, a comprehensive, detailed overview of the entire system is required.
In reality, many companies do not invest in the training nor the tools needed to solve the on-call problem.
So, what would be helpful?
First steps to solving the problem
The first step is to admit we have a problem. The tech sector needs to acknowledge that the current way of dealing with on-call is not working. It’s ineffective, broken, and causes a lot of anxiety and stress for developers asked to take on the burden of making sure everything is ok. With nearly 35% of devops engineers leaving their jobs due to burnout, this is a real problem.
When it comes to alerts, it’s commonly known that in the middle of the night you have to solve the problem as fast as possible, and the next day you need to make sure the problem never happens again. This impacts developers, managers, and the entire developer team.
We know the current way of troubleshooting is not working and that there’s a better way to handle the problem.
That’s why we started Komodor.
Stay tuned for more news from us.