Avoiding microservices mess over the long run Long before the COVID-19 pandemic made remote work a norm, software developing organizations were already moving away from the monolith-style of building applications. Instead, they have moved to the far more agile microservices environment that is made possible through Kubernetes. Now, teams can deploy smaller applications that better address their business goals without the time and large scale complexity that can arise under the legacy model of development. In comparing the two, it can feel like the difference between maneuvering a speedboat vs. an aircraft carrier. However, as our organizations have made the switch over to Kubernetes, it is important to remember that this is only the beginning. And as with all new toys, we need to contend with new issues as our teams work to maintain the new process in order to keep it running smoothly. Welcome to Day Two Operations and all those days that will follow if we do our jobs right. New toys, new problems to solve in “microservices hell” Trade-offs are a given when building and deploying software. The trick is to keep progressing towards a better system than where we were before. Inevitably software will break along the way. It is up to us to fix it faster and hopefully more effectively than before. When it comes to moving to Kubernetes, we are trading our monolith of massive applications to microservices where there are many smaller applications, each with its designed purpose. The advantage of microservices is how it is easier to fix issues as they arise. The challenging side, like cutting off the giant head of a hydra, is that we now have many more applications to keep track of. This is often referred to as “Microservices Hell” by those who have to manage it. Tracking those failures in this wider distributed system can be difficult and even harder to troubleshoot. Rebuilding Institutional Knowledge for the New Tech Stack The next challenge is that as we move our technology stack over to a new system, we lose a lot of our existing knowledge about how to fix issues when they come up. This can mean lost institutional knowledge and expertise, both of which will have to be built up from scratch. This challenge can in part be addressed by hiring new staff that has experience from outside your organization. But, on the whole, it means learning over time how to respond to various issues within your own products as they arise. These issues can stem from: Load balancing challenges Kubernetes misconfigurations Unexplained downtime in the app or others So, given these challenges, we can understand that the best practices and tools that we used to use under the old way of working are less relevant. Therefore, we need a new way of addressing the challenges of our new tech stack that is a better fit for the distributed environment. In our next post, we will lay out how we see the path forward and include a list of best practices and tools that should help us get to the next stage.