3 clusters/~400 services
About BitsoBitso’s currency exchange platform is the transparent way to access and use Bitcoin and other cryptocurrencies. Bitso allows its users to buy, sell, trade, send and manage crypto instantly, all in one place. They are the first and biggest cryptocurrency trading platform in Latin America, processing tens of millions of transactions for over 3 million customers. In 2021 Bitso raised $250M in Round C funds, valuing the company at $2.2B.
The ChallengeDeploying a distributed architecture and using Kubernetes for container orchestration was the natural choice for Bitso’s engineers, as the business continued to rapidly scale up. Their existing monitoring stack, however, was no longer capable of supporting the scale and complexity of the operations. Adding to this was the lack of broad Kubernetes expertise, which further hindered the teams’ ability to provide rapid incident response. As a fintech SaaS company, Bitso knew it had to maintain high availability and so it needed to upgrade the tech stack, and provide responders with a way to quickly identify and troubleshoot Kubernetes issues.
The ProblemIn 2021 Bitso’s dev teams were pushing over 300 updates to production each day, which also included configuration and infrastructural changes. Deploying at such a high rate increased the likelihood of an incident. However, the troubleshooting process the teams had in place often involved too many steps and relied on a patchwork solution. For instance, whenever an issue occurred the responders had to connect to their business VPN, then log in to their authentication provider and from there log into AWS. Not only did this waste precious time, but it also meant that – in order to troubleshoot – the responders would access the production cluster directly, potentially exposing the system to even more errors, especially given the Kubernetes knowledge gap in the team.
The SolutionBitso needed help with Kubernetes operations in order to support their ever-expanding scale-up, both client base-wise and headcount-wise. Simplifying K8s troubleshooting is a classic use case that’s perfectly aligned with Komodor’s offering. Komodor was able to address Bitso’s needs by:
- Streamlining the troubleshooting flow and providing a uniform process for the team to follow. Now, whenever something would break, the on-call engineer would receive a PagerDuty alert with a direct link to Komodor’s dashboard, showing the status and the deployment history of the affected service. And so, with one click, they could see the history of changes that led to the issue and use these insights to quickly pinpoint the root cause. In addition to dramatically decreasing the mean time to recovery (MTTR), the ability to investigate the issue from Komodor’s platform also eliminated the need to directly access K8s resources in AWS. Now, devs could simply pull pod logs directly from Komodor and avoid the risk of causing more issues in production.
- Providing the development teams with a bird’s-eye view of all of the components of their Kubernetes environment and making it easy to understand the context for changes being made. By making k8s resources accessible and easy to understand, the platform helped more responders take confident action and troubleshoot the issues independently.