While working everything as code is what the future may hold, context and visibility are still required for effective troubleshooting.
Way, way back in 2011 famed VC investor Marc Andreesen made waves when he said that software was eating the world.
As an early investor in Facebook and LinkedIn, Andreesen — correctly — predicted that software, and the code used to build it, would replace traditional industries and take center stage in the business world.
Fast forward nearly a decade later and code is at the core of nearly everything we do as humans. We order food, seek out love, and entertain ourselves almost entirely through applications.
But code does not operate in a vacuum.
It is supported by an infrastructure that can be impacted by changes throughout your various environments. If one person on your team tweaks the servers or takes another kind of action that can affect delivery or production, then the cause of the issue does not stem from the code.
Adding to the mix is the fact that changes are now happening more than they used to. As organizations have made the move to the cloud and embraced microservices, they are now pushing out new versions of software at a faster rate.
At the same time, there are now many changes going on behind the scenes outside of the code that needs to be accounted for when starting to troubleshoot issues that may arise – what you’d be looking for are context and visibility.
Some of the factors that can impact the product include:
- Configuration Changes: These can cause every kind of impact imaginable to your production.
- Feature Flags: If these otherwise useful tools are misconfigured, toggled in the wrong direction, it can severely limit the availability of the service.
- Bugs in the Jobs: These are a standard of every operation. But if there’s a bug in one of these processes, it can cause serious headaches. Take for instance a job tasked with running back-ups at midnight, but instead deletes all the data instead of copying it. The code was fine, but the process of handling the code went haywire and caused the incident. Cases like these can make for some frustrating surprises in the morning.
- Infrastructure Changes: A crowd favorite, one of a million changes in your Kubernetes clusters or AWS console can lead to a series of alerts being hurled your way.
- 3rd Party Tools – Changes in the 3rd party tools that you use for managing your processes can impact the functioning of your production. Particularly if they have functionality beyond read-only (e.g. Auth0, Cloudflare, etc). The lack of visibility outside of your comfy silo is particularly stark here as these products decidedly sit outside of your walls. This makes connecting them to the issues that they may be causing even trickier.
Suffice to say that there are plenty of spheres outside the scope of your code itself where things can go awry.
Why not just do everything as code you may ask?
It’s 2021 and code is king, so why are we doing just about anything that is not simply code?
Working everything as code may be the direction that the industry is moving towards, but it’s not where it is now. While it may be best practice to manage all of these external processes as code, most organizations just are not there. They have to contend with a technical gap. Most third parties, as well as the current state of the system itself, cannot be described as a static file. Therefore approaching it as one is not much of an option.
In light of this reality, we need to consider practical solutions that will help organizations to better identify the source of their issues faster and achieve visibility across all of the different tools or environments that they are actually working in outside of their repositories.
What is needed is a way to track changes across everything that you are using and communicate those changes to all the other teams.
Slack has become a go-to by many organizations for sharing information about new changes across their environments. In this case, teams are connecting their Jenkins, Kubernetes, or AWS to Slack with scripts and dumping those changes into the chats. There are notifications but can be hard to follow, if not a bit spammy.
In other cases, some developers are choosing to write their own tools to track changes, going about it DIY style.
A third option that we see is where notifications of changes are being sent to monitoring tools. However, beyond alerting you to the fact that there is a problem, they still lack the necessary information to be useful when time is at a premium.
Context and overarching visibility are key
For developers who are called in to handle an incident, getting an accurate picture of all the changes across their various environments is critical in supplying them with the necessary context to do their jobs – context and visibility is needed.
Most developers, who are increasingly joining the on-call roster, are familiar with their specific silo. And for the most part, that is going to be in the code. But as we have seen, oftentimes the issue will not be in the code but in a totally different environment that they lack familiarity with, let alone domain knowledge.
Instead of expecting every developer to be a domain expert where they are not, we need to provide them with the tools to get caught up to speed and pinpoint the root cause of the problem faster. Ideally without requiring them to open too many dashboards and scroll through lines on a graph to locate when the guilty change occurred. In short, we need to give them quick access to the right relevant information and the overview to be effective. A tool that would help them understand the context of a given issue.
What we need is not just another cook in the kitchen, but a solid kitchen manager who can step in, direct, and make sure that the soup gets out to customers without too much spilling on the floor.