The life of a developer these days is more complicated than ever, as they are increasingly required to expand their knowledge across the stack, understand abstract concepts, and own their code end-to-end.
A major (and very frustrating) part of a developer’s day is dedicated to fixing what they’ve built – scouring logs and code lines in search of a bug. This search becomes even harder in a distributed Kubernetes environment, where the number of daily changes can be in the hundreds.
Kubernetes makes it really easy to deploy microservices, but when something inevitably breaks the developer tasked with fixing the issue is left stranded with zero context, not even knowing where to begin.
Fixing issues can be divided into two approaches – troubleshooting and debugging. Although they are similar concepts and are often used interchangeably by mistake, they are not the same, and each requires its own methodology and toolkit.
Over the next paragraphs I will try to dispel the confusion between troubleshooting and debugging, and share some best practices for incorporating these methods into your workflow.
What is Troubleshooting?
Troubleshooting is the strategic process of finding the root-cause of issues in a system, at a macro level. This involves understanding many components and how they interact with each other, finding out what cascading failures are leading to the issue, analyzing the symptoms exhibited on the monitoring tools, and feedback from the end-users.
Troubleshooting usually includes debugging as part of its process, stretches over more than one session and impacts multiple stakeholders.
The troubleshooting process is likely to uncover many bugs that can be surgically isolated and fixed (more on that in a bit). The real goal, however, is to identify deep-rooted problems in the system; inspecting the infrastructure, pipelines, permissions, 3rd party apps and services, architecture, and even human processes and culture.
What is Debugging?
Debugging is the more tactical approach aimed at fixing local issues or exceptions, more commonly referred to as ‘bugs’.
A bug can be anything that causes the program to behave in a different manner than expected. It can be a syntax issue or a problem with the logic employed in the code. Even a single typo can be considered a bug.
Unlike troubleshooting, debugging can be accomplished within a single session, where a developer identifies and isolates the issue and works out a solution. This can uncover deeper issues, and – as a result – improve the overall system’s resilience. However, this is not the goal of the process.
How Do They Differ?
As mentioned above, debugging is a subset of troubleshooting. While debugging focuses on small, local instances that can be identified and fixed in one session, troubleshooting is a holistic process that takes into account all of the components in a system, even the team’s processes, and the way they affect each other.
It is akin to a doctor prescribing you a pill to deal with a recurring headache, versus a doctor inquiring about your diet, mental state, lifestyle and conducting a full-body scan in order to understand all the various elements contributing to your current status.
This doesn’t mean that they are two separate and unrelated processes. The opposite is true – debugging lives within troubleshooting, and it’s the natural continuation of it. Troubleshooting uncovers the deeper root cause of an issue and debugging steps in to fix the thing that broke.
Best Practices to Troubleshoot & Debug K8s
So now we know the right order in which to approach the problem, but what are some of the best practices? Reduction is a core tenet of troubleshooting, and this is how they’ll be presented below. Consider the following tips as steps in a process, throwing a big net into the water and gradually narrowing the bounds, until you’ve caught your ‘bug’.
Troubleshooting Best Practices
When an issue arises – either you have an alert or an end-user is experiencing difficulties – start with a bottom-up approach in Kubernetes by listing all pods in the cluster. Check to see if something is reported as an error, not ready, or crashing – this gives you the thread to pull on and move forward.
Drill down into the issue
Describe the pod (using Kubectl) and get more info on the specifications, the configurations that were set up, and the events that happened (in most cases it stops there).
Failed to pull image "localhost:53329/nginx:latest" – In this case somebody made a typo and included a docker image that is unreachable from the cloud. Describing the pod and carefully examining each line would reveal the root-cause, and make for a quick debugging session.
Start looking horizontally
It’s time to check your config maps, ingresses, secrets, volumes, nodes – or to drill down even more by reading your app logs. It could be a Kubernetes issue or an application issue to get down to the root cause.
Configurations or Secrets used in your application might not be aligned with what the app actually needs. It’s always a good idea to check if you’re using secrets from, and in, the proper environment.
Application logs are often the last piece of the puzzle when debugging in Kubernetes (usually it will require a more intimate knowledge of the application, which the troubleshooter might not always have) but they are also the most efficient; if a pod is in a
CrashLoopBackOff, the application logs will usually include a stacktrace of an exception that caused the container to crash.
Debugging Best Practices
Reproducing the issue
You’ve determined the root-cause of the issue you have – now comes debugging and fixing. By reproducing the issue, you’ll have control over how to fix it.
- If it’s an app issue it’s easier to reproduce, because usually, someone in the organization would know how the application works, which business logic is expected etc. All developers should have a local environment up and running to try and reproduce the issue locally – making it easier to find out if the ‘fix’ actually fixed anything.
- If it’s a K8s issue it will be a lot harder to reproduce. Running a local Kubernetes cluster is possible but it’s very hard to get it to a level where it’s exactly identical to a production cluster.
Note: reproducing is certainly the best practice, but not always possible. In that case, you need to resort to a trial and error (a.k.a. “Blackbox”) approach.
Fixing the problem
Once you reproduce the issue, start making changes and make sure you fix the problem. You found out the root-cause, or a hint of what the root-cause may be, and should make changes (or ‘fixes’) one by one (this is important – if you do multiple changes at once, it will be harder to understand what actually fixed the issue).
To recap, troubleshooting is the broad process of auditing a system at a macro level, and understanding its intricacies, the very way in which the cogs of the machine interact. Debugging, on the other hand, is a process of identifying and fixing exceptions locally in isolation.
Traditionally developers spent more time debugging than troubleshooting. With the advent of the ‘shift-left’ movement, troubleshooting responsibilities are increasingly entering the purview of developers on all levels, and are no longer solely in the hands of the sysadmins and architects of the world.
Seeing as this is a major pain-point for novice developers as well as for the more senior DevOps that end up doing the troubleshooting, Komodor aims to bridge the knowledge gap by simplifying Kubernetes and providing developers with all the context needed to troubleshoot issues efficiently and independently.