While working everything as code is what the future may hold, context and visibility are still required for effective troubleshooting.
Way, way back in 2011 famed VC investor Marc Andreesen made waves when he said that software was eating the world.
As an early investor in Facebook and LinkedIn, Andreesen — correctly — predicted that software, and the code used to build it, would replace traditional industries and take center stage in the business world.
Fast forward nearly a decade later and code is at the core of nearly everything we do as humans. We order food, seek out love, and entertain ourselves almost entirely through applications.
But code does not operate in a vacuum.
It is supported by an infrastructure that can be impacted by changes throughout your various environments. If one person on your team tweaks the servers or takes another kind of action that can affect delivery or production, then the cause of the issue does not stem from the code.
Adding to the mix is the fact that changes are now happening more than they used to. As organizations have made the move to the cloud and embraced microservices, they are now pushing out new versions of software at a faster rate.
At the same time, there are now many changes going on behind the scenes outside of the code that needs to be accounted for when starting to troubleshoot issues that may arise – what you’d be looking for are context and visibility.
Some of the factors that can impact the product include:
- Configuration Changes: These can cause every kind of impact imaginable to your production.
- Feature Flags: If these otherwise useful tools are misconfigured, toggled in the wrong direction, it can severely limit the availability of the service.
- Bugs in the Jobs: These are a standard of every operation. But if there’s a bug in one of these processes, it can cause serious headaches. Take for instance a job tasked with running back-ups at midnight, but instead deletes all the data instead of copying it. The code was fine, but the process of handling the code went haywire and caused the incident. Cases like these can make for some frustrating surprises in the morning.
- Infrastructure Changes: A crowd favorite, one of a million changes in your Kubernetes clusters or AWS console can lead to a series of alerts being hurled your way.
- 3rd Party Tools – Changes in the 3rd party tools that you use for managing your processes can impact the functioning of your production. Particularly if they have functionality beyond read-only (e.g. Auth0, Cloudflare, etc). The lack of visibility outside of your comfy silo is particularly stark here as these products decidedly sit outside of your walls. This makes connecting them to the issues that they may be causing even trickier.