How Bitso Empowers Its Devs to Troubleshoot K8s Independently

Juan José Mejia. How Bitso Empowers Its Devs to Troubleshoot K8s Independently
Juan José Mejia
Engineer Lead, Bitso
Oren Ninio. How Bitso Empowers Its Devs to Troubleshoot K8s Independently
Oren Ninio
Head of Solution Architects, Komodor

You can also view the full presentation deck here.

Cody: Good morning, good afternoon, or good evening depending on where you are in the world, and welcome to today’s webinar brought to you by Techstrong and Komodor. My name is Cody J. Brown. I’m the host of Techstrong Learning. We have an exciting presentation ahead of us but first, I have a couple of housekeeping notes to cover. First, today’s session is being recorded. So, if you miss any part of our session or you’d like to rewatch or share with a friend, the on-demand recording will be made available shortly after we conclude our live session today.

If you have any questions for our speakers, we want you to submit those to the Q&A tab on the right side of the screen. And for any additional comments or if you want to engage with your fellow audience members, we want you to use the chat tab. Finally, at the conclusion of our webinar, we will be giving away 4 $25 Amazon gift cards. So, be sure to stick around to see if you’re one of our winners. I’ll also mention that we do have the slides available for download under the handouts tab, which is also on the right side of your screen. So, the webinar topic today is how Bitso empowers its devs to troubleshoot Kubernetes independently.

I’m joined today by Juan Jose Mejia engineering lead at Bitso and Oren Ninio head of solution architects at Komodor. Oren, Juan thank you both so much for being here with me today.
Oren, do you want to take it away from here?

Oren: Thank you very much, Cody. So, as you shared my name is Oren. I’m head of solution architects at Komodor. A little about Komodor, we’re a startup in the Kubernetes ecosystem. We help companies large and small make the transition to Kubernetes as well as troubleshoot incidents inside their environment anywhere from imagepullbackoffs to issues with their nodes or other resources. Juan.

Juan: Thank you. Thank you, Cody, and thank you, Oren. I’m Jon Jose Mejia as mentioned. I’m an engineering lead at Bitso. I’m running a squad of engineers. And in Bitso for you if you don’t know who we are I will give you a little review, Bitso is a cryptocurrency exchange unicorn that is actually the first unicorn in Latin America. We offer and access services for Latin American people so they can access fast and in a reliable way to cryptocurrency. I want to mention that Bitso is the first cryptocurrency unicorn in Latin America and is like the top one.

Oren: So, Juan I’m sure you know there are a lot of challenges with growing and scaling and becoming a Unicorn. Before we dive into those can you tell us a bit about your current architecture and how things are set up in your environment?

Juan: Yeah sure. Before going to the challenges we’re going to tackle, I think most of you are tech people and developers and want to know, I would like to say and would like you to picture like where is the field that we are landing right now? Where is the ground where we are sitting on okay? So, first, as many of you may know Bitso, is a unicorn, right? It’s a Latin American unicorn and for the last years, it was like a startup and like every startup, right, has a lot of technical challenges and we are like in this land of migrating to a medium-size from a startup to a medium-size company, okay.

So, in this case, we have a lot in the technical part in the engineering side, we have a lot of challenges and I would like to or to show you the mountains where we are right now, okay. One of them is that we have monolith services that live on Kubernetes for now. And we are migrating and decoupling them into microservices, okay. Another mountain we have is a lot of databases and we need to make them more reliable as we’re growing as a company and we are for example making them having replicas, being closer to them so they can be more reliable and have more activity.

And another mountain we are we’re facing; we are going through is going from synchronous architecture to asynchronous. For example, if you have an API and you receive a request for you know requesting a bunch of data, maybe this API is consulting many other APIs from your services and this kind of thing. We need to scale part and we’re going to asynchronous architecture.

Oren: So, Juan I’m sure you have world challenges like you mentioned you’re becoming a unicorn you’re probably recruiting a lot of people you’re probably growing to additional areas. One of the things that I know you have ventured to is you’re a fully remote company as well. What is that like?

Juan: Yeah, it’s being a fully remote company is at first it’s hard when you don’t know each other and most of engineers we don’t socialize a lot. But sometimes it’s hard and you need to get a really great culture. And Bitso has offered a lot of that and a lot of participation from our engineering side also. You also get help every time. So yeah.

Oren: So, the simple things in life like eh grabbing a cup of coffee and going over to your colleague is not something that’s possible.

Juan: Yeah, actually we have a like ah this eh this program called ‘coffee buddies’, all the Bitsonouts, we call Bitsonouts to our Bitso members, will know this and you arrange this. We have a Slackbot that arranges for you a meeting with a random person and you have a 30-minute coffee. It’s like a great way to connect.

Oren: That sounds really awesome. So, let’s talk a bit about the challenges that you’re having eh around implementing Kubernetes. I’m sure it’s something a lot of people are having now.

Juan: I want to make a leader summary of our challenges. We have 3 challenges. First is one eh about the tools we were using, okay. The second one is about the processes we had with the engineering process and the last one is the lack of knowledge and the knowledge gap. We got about you mentioned, we are a really eh company growing every day. So, that’s it, let’s go. The first challenge is about the tools we were using. As I mentioned to you way before many years ago Bitso has like 8 years since it began. So, many years ago services were running on virtual machines. On a custom provider, not a known provider. So, every solution we had was like a standalone one, okay.

Another challenge we had with the tools is that for example when you are a startup and you have 20 developers, okay? Maybe it’s easy to access logs because you don’t have enough traffic from users and transactions. And yeah, developers access these logs via Kube control in this case. So, when you have as I mentioned when you have 20 developers it’s okay there’s no problem. But when you scale from I don’t know 20 developers to 100 developers, it’s like you need more ways to access all services.

And another challenge is that we had this Jenkins, for example, Jenkins CI for pipelines and deployments. But in the end when we are a lot of people eh running to go to market, this adds an extra effort an extra manual effort, okay. So, this was a challenge and a part of our development process. And then on the other hand, we had applications like firewalls, VPNs, and all infrastructure programs that eh were also standalone installed on these beautiful machines. This represents like a problem for our infrastructure team, because they had to maintain it, they had to install it again and when something goes down, they had to be so aware, okay.

Oren: So, Juan this challenge really makes sense, because when you’re a startup you mentioned that you were a startup for quite some time. Your main focus is to increase your velocity as much as possible. So, you don’t put your mind into creating the most robust solution, but rather find the best way to address whatever issue you’re having at the moment. If it’s a custom provider for your VMs to answer your solution right now or stand in on VPNs for any specific service that you want to connect with. Because you don’t have to look at the big picture right now. You want to get the next feature after the market. You want to develop and push the next PR.

Juan: Yeah, that’s it, that’s why I love all the stages that a company has and I had the luck of having this experience in other companies and see them grow. And it’s like every stage that a company’s at, has its own problems, has its own eh growing problems also and challenges. Yeah, that’s it. When you are a small size company, like a startup, you have this problem, you have to reach the go to market so you can get your investments and to grow to the next level. So yeah, we are like this company growing into a medium-sized company. But not only focusing on the growing to a medium-sized company we are focusing on growing a large company, on the future. We don’t want to do an effort and then redo it all again.

Oren: Yeah, so Bitso right now had an explosion. You’re growing, you’re adding, you’re recruiting more people. I’m sure you’re recruiting many developers every week to join your team to join your squad, join your tribe. I’m sure there are a lot of challenges in the process as you mentioned between the different teams arranging and coordinating. Can you tell us a bit about those challenges?

Juan: Yeah, and most of developers, as also my experience, I’m a sole developer as a background, we really like processes. I face a lot of developers that are always proposing like I don’t know let’s get a git flow, let’s get how to do the definition of that. And all the staff that are really great to hear from developers. Because at the end of the day is the way they’re going to work is the way they’re going to fail is the way they’re going to succeed. So, developers are really interested in the process. So, in this case, our processes were not relevant, because we were facing as you mentioned a large growth of people. I can tell you that we had the last year like average like 20 engineers per month.

Oren: Wow.

Juan: So, we had a really great growth, and it was hard also for our side because this growth comes with friction. One of those things that happened was that when a developer wanted to put code on a development or staging environment, they needed to do it manually. So, that was a challenge we had in our process. So, we started figuring out with eh all senior developers to work on them. Another challenge we had in the process, was when we needed to create some resources, services, pods or whatever resource of AWS, it was hard. Because not all are used to our new infrastructure-as-code implemented. Also, as I mentioned the process was changing every time.

And for example, I had a process to create a new resource last month and the next month is like different. It was hard in this regard, because we are ever-changing. And another challenge we had was the pull request interaction. As I mentioned we had some monolithic services and this implies sharing call trade responsibilities with other teams. And this is like having many hands in one soup sometimes. So, this kind of challenge is hard, because it goes with how we interact with other teams when my pull request is uploaded. And then you got some reviews and you have to put them in context with everyone on it and on other teams that are not in the same context or your business, your logic, your agreements.

Oren: So, everyone just wants the best code to be pushed, and once you see a pull request that’s coming out everyone wants to see how you can refine it, how you can make it better.

Juan: Yeah.

Oren: How does it interact with my code that I’m going to push, not right now, but maybe the code that I’m going to push next week. To make it ready for that.

Juan: Yeah, there’s a lot of trade-offs in this interaction and you have to wait for a comment or for approval, and then they have to wait for your response and so this was like…

Oren: And that just doesn’t make sense. Like you said before, you want to get to market as fast as possible, you want to deliver, you want to go.

Juan: Exactly, that’s one of the things that lasts a lot on our site. But also, we have like challenges that are shorter and tiny. Like for example, logging into the AWS, you have to log in into your authentication provider, then into AWS, then your client. There were like tiny processes that also take you a lot of time. We like this idea of not only focusing on the big picture on the big problems but also focusing on improving tiny stuff.

Oren: So, essentially improving the quality of life of the developer as well. So not only that he develops a good product, but he enjoys the process that he’s doing it.

Juan: So, that’s it. That’s what we as Bitso want to offer. A great developer experience. So, we can reach our goals and also be happy with what we do.

Oren: Absolutely and I’m sure that connecting to your cluster and having a timeout every few minutes and having to log in again. I can really relate to that frustration.

Juan: That’s it. And the last challenge we have is that we notice a lack of expertise on Kubernetes among our developers. We have a really tiny amount of developers that have a lot of experience with Kubernetes and on infrastructure. But as I always mention, we are like or most of the companies are product-oriented companies. We want developers that are focused on the business logic that can build the complex algorithms for reaching our goals for fixing our core value of our business. They are also focused on the users and see the user’s needs. So, this kind of thing was like of living, sorry, of dealing with Kubernetes. It was sometimes hard, because we had to live with Kubernetes. There is no way to escape from it. The problem is how we deal with it.

Oren: Was the challenge mainly with the existing developers who weren’t familiar with Kubernetes or with new people that joined the company who were not familiar with this infrastructure? Where did you see that challenge come from?

Juan: Yeah, in the first place, there were developers that were used to that. But on the other hand, they had to know how all the processes were implemented, how all the infrastructure was planned. It was also hard for them for those who know Kubernetes. Obviously, we don’t have access to production, so this makes a lot of extra effort as every company has. For those who don’t know Kubernetes so well, it happens to me a lot of developers, you can be used to either on Dockerization and other orchestrators or other stuff. But if you don’t have an idea of Kubernetes, you have to go and deep dive on learning Kube control and learning how to access them.

Yeah, those kinds of things were like making our tasks take longer. Because if you have made a task, you have to know first how to work with your tools.

Oren: And essentially what you’re doing is you’re bringing on new developers and there’s this new infrastructure that they’re not familiar with. At the end of the day, you brought them to develop your solution. You want to develop the best of the best platform there is. And you don’t want to invest the time necessary in them learning the new infrastructure.

Juan: Exactly, yeah.

Oren: And investing their time in becoming an infrastructure engineer but rather to continue their path as a software developer.

Juan: Yeah, that’s why we have this part of Site Reliability Engineers or the DevOps engineers that’s why that they are for. So yeah, we need in this case or most of our developers are like product-oriented developers.

Oren: So, you spoke about a really really interesting challenge that I’m sure every company that is going through a growth phase like yourself is currently facing. Transitioning from the architecture and infrastructure that served them when they were growing up in a way the best infrastructure that they needed and had for the time and transitioning to something that is more mature, more enterprise-ready for a much larger company. You spoke about internal processes and how you had to graduate from one team to a large organization as well as the challenges in learning. I’m really interested in hearing how you tackle these issues.

Juan: Yeah, here we go. As I mentioned to you, we had this first challenge, it was about the tools we had. We had to migrate from a virtual machine on a custom provider to AWS. It was one of the first things we did and then we integrated also to Kubernetes all of our infrastructure. So, it was a really huge step but it was almost from the beginning of the campaign. So, it was a really long time ago. Then another thing we did is eh we replaced it in this case Jenkins CI with CircleCI. So, we can have it fully automatic. For example, in this case we can have continuous integration and delivery that we didn’t have before.

So, the only thing you do is to push your changes and then you have all your flow going automated when you have notifications. So, that’s really great.

Oren: That’s really awesome.

Juan: Yeah. another thing that we did is that that’s why we are here with you is we implemented Komodor. This was a really great acquisition. You helped us to travel through a lot of incidents and to reduce the MTTR, and the solution of and the visibility of health statuses and the visibility of events. And also, the gap between one deployment and another, and what changed. This helped us a lot, this was really great eh because we have many tools for troubleshooting in just one place.

Oren: So, how did Komodor really help you with newer engineers, with people who are less familiar with Kubernetes?

Juan: Yeah, in this case, it worked like a facade between developers and the Kubernetes and all the staff. Because we don’t have to start as I mentioned to you a lot of tasks, we had on the day to day. From going on Kubectl and going to the logs, going to Kubectl and doing a describe on a pod on many stuff that you just have to go with your single sign-on, and you are in. That was really great a really great task.

Oren: Awesome.

Juan: Yeah, another solution implementing is like, we’re using Splunk for accessing our logs. I would like to see it this way because it may confuse, you can see logs on Komodor, and you can see logs on tools like Splunk or Kibana. But we use Komodor for example, in personally I use Komodor logs from Komodor mostly when I’m deploying a new service, troubleshooting a deployment or creating a new service and I want to know if the pod is already held and see what are the first logs, the service goes starter well, okay. So, that’s why I use Komodor, logs from Komodor.

But if I need to cherish with a specific transaction, you know that has an issue for something more detailed I go to Splunk. So, that’s the way we are using right now.

Oren: So, I can see there’s a question from Amanda from Manchester. If Komodor is integrated together with Splunk? So, Amanda on that just Komodor is able to integrate together with Splunk in order to give you access to your historical logs. So, you can see events, and historical events within Komodor and then direct yourself from the event in Komodor from the incident there to the historical logs in Splunk. And see exactly like Juan mentioned, a specific transaction that went through or deep-dive investigation into the incident in Splunk historically as well.
So, Juan one of the challenges that you mentioned were the procedural changes, challenges between people, and being a fully remote company. I’m sure that’s the first challenge there is but growing significantly adding 20 developers per month. You’re talking about culture. How did you handle that?

Juan: Yeah, first of all one of the great things I had to mention on Bitso is culture. I’ve never seen a great and visible and real culture of collaboration where your voice can be heard. And this is reallysomething that’s hard to find in a company. I think that base allowed us to make a lot of improvements. If for example, I proposed some things on public channels and it was like, hey, I like the idea let’s go to do it and he was like, okay, I wasn’t expecting to see that but yeah let’s do it. This is really nice and most of it comes from the developers themselves. That’s great. And one of them was that they implemented a bot, this effort was done by DevOps and if there is someone from DevOps in the webinar, I would like to thank them.

Oren: Everyone from DevOps worldwide for the great work you’re doing everywhere.

Juan: I see. They did, they do our tools for us. So, they implemented that bot and our manual processes of merging and going pipelines and going to production on the deployments. They were done automatically by a bot, where you can consult the bot and ask questions to that bot. You can ask for comments. You just run a comment to the bot and it just deployed stuff and did merges and it’s really nice.

Oren: That is what was done manually before. Anyone had to do it manually until then.

Juan: Yeah, that’s the idea to go in from manual stuff or things that take a long time to go into automatically.

Oren: Yeah, basically do it more than twice you should automate it.

Juan: Yeah, exactly it’s the more price in two minutes. It’s you should automate. Another one we did on our process part is to set up SIG meetings with our DevOps teams. So, for example, as I mentioned we have this issue when you have to request a new resource. For example, a few months ago we started using messaging architectures like Kafka or Queues or something like that. And those are kind of things that were really new to our infrastructure, and there are a lot of resources that are new and we want to integrate. There are teams that want to use new tools or new applications.

So, in this case, we set up some meetings that we are doing frequently like weekly or bi-weekly with DevOps. So, we can plan with them and we can say that hey the next month or the next Q we’re going to need this resource please be prepared. So, as you know not all the processes are already planned, because we are as I mentioned the start of growing to a medium-sized company. Not all the processes are formalized, we are struggling also with a lot of these informal stuff and informal processes. So, the first thing we have to do is just like plan it and start planning, start the conversation. And that’s what is this effort.

Another thing that we did about the PRs interaction, that lasted a lot and I can measure in my team that we have a lot of PRs and efforts and user stories we had to give to the user. That lasted more like two or three spins and it was like really crazy. We all have this kind of stuff, but the problem on that we want to focus is, hey, how we do what we reduce this gap. What we did we need to…

Oren: We need to deliver.

Juan: Yeah, and also this has a lot of impact. The users want features, the product cannot reach the goals, the developers get frustrated because your feature is not going to production.
Oren: So, what you do?

Juan: Yeah, and what we do is to, on this part was to make agreements on a team. And before coding, for example, I have my task, I have my team. Before I code, I have an agreement with my team, hey, how we’re going to solve this. We’re going to use this tool, this framework, this strategy, this pattern or whatever and we all agree on that before. So, that when I call my solution and send the PR, there is no interaction needed. Because we all agree, hey, this is what we talked about this is why this is good. I recommended this way before he goes to code. So, the PR interaction was reduced mostly to zero, just like most of them of the last PR that I got is like okay this is easy, okay approved.

There is no such interaction and we reduce that happily, we have a lot of user stories completed like coding, testing, testing in environments, and going to production in the same sprint. So, this is really great.

Oren: So basically, once you understand you create a baseline of understanding between you and your peers, you’re ready to go and stop clashing.

Juan: Yes, that’s it and we are all happy because we all contributed as a team. At the end of the day, that’s what most leaders want and what most organizations want. This is a team effort, not a solo effort.

Oren: Absolutely.

Juan: Yeah, and that we have that in mind. And also, about the tiny stuff that we implemented is like instead of using AWS key for login to AWS that obviously took us a lot of time, a lot of interaction. We have to I don’t know set up your two-factor authentication stuff and grab your phone. And then we use the submit to AWS tool that was really nice, because it has also integrated AWS key, all the Kubectl, had all the stuff in one program. So, it was really nice. It’s a common line also but we have implemented these kind of things.

Oren: Awesome. So, let’s go back and tell us about how you address the knowledge gap that people had when joining the company or people who are already in.

Juan: Yeah, about the knowledge gap it was like I mentioned we had like 20 engineers per month in. And in my team, we have like I don’t know two or three engineers and it’s a lot also for a team. We saw sometimes, I don’t know, if you have to deal with this service, go talk with, you always have a reference, go talk with that guy. And when they’ve been talking, we have the two engineers asking on DM of the same guy. And it was like we saw this situation and so like they are interacting a lot but taking a lot of time from the senior engineer and it’s like okay, let’s do something. And we ideated like these sessions. We have weekly sessions where we can share knowledge.

We also open the mic to everyone. We created like a safe space for engineers where they can share their knowledge. For example, we started like doing the last task you did on your screen you can share it and share how you solve it. And then you can have a lot of interaction with engineers. Because it could happen, hey, I could do it better and yeah, all suggestions are appreciated. We have a really great space that also, another thing we open to where we were open to do is to open the mic to new comers. Because we love diversity and that is really nice. The more diversity we have and when it’s controlled in our teams is greater. Because new people also have a lot of background, a lot of experience, and a lot of things to bring on.

We open the mic to new comers to say, hey, you know a lot of frameworks you are certified on that other framework. Just help us and tell us how to use it better. And it was really nice in this in this case and we put topics like how to troubleshoot on Kubernetes, how to troubleshoot incidents, how to troubleshoot this kind of service. And also, we had the pleasure to have you Oren for the audience during one of the sessions. So, he could tell us about Komodor, right?

Oren: Yeah, it was really great to join you on your company meeting to share about Komodor. How to troubleshoot issues in Kubernetes how to use Komodor more effectively and also hear a really active session of people asking all sorts of really insightful questions. Sharing things that they’re looking to see in the product, resources that they want to have added, unique ways to use the Komodor events timeline, and really getting all that feedback and questions from such a lively group.

Juan: Yeah, in that going back on that session that you were with us, I was like measuring like 80% of our engineers on my team at least was like they don’t know Komodor. Because it was recently installed in our applications. So, just a month later, I was really happy because all the engineers were using Komodor to troubleshoot, and they were okay, I’m getting this easier than before. So, we are looking for this kind of initiative and also this initiative of knowledge transfer, was adopted also for other company, other teams were doing it also. So, I think you participated also in a huge company meeting for Komodor, right.

Oren: Yeah, so Niv from my team had the pleasure to join the full company and share the knowledge about Komodor for about, I think it was in two groups, to over 200 engineers. Which are now using Komodor, in troubleshooting issues in Kubernetes, seeing the status of the deployments when they’re coming through. And it was really an active and lively session.

Juan: Yeah, that’s really great about Bitso, we take initiative, they grow and we can make the life of engineers easier.

Oren: And I’m sure that these learning sessions are really helpful for junior people, junior software developers who are joining Bitso, and now learning how to improve their development. But also, for senior developers who are joining.

Juan: Yes.

Oren: Both to learn company culture, to learn about technologies that they might not be familiar with, or maybe even share from technologies that they’re familiar with or frameworks that they’re familiar with to people in the company on how the best way to use them.

Juan: Yeah, totally and in that part, we are all equal. We are all learning. We cannot be senior in everything that you have. So, that’s really great to know that we have a space for learning.

Oren: Absolutely. So, we spoke a bit about new technologies that you implemented, new tools, your transition to AWS, how you use Komodor to troubleshoot issues in Kubernetes. We spoke about processes and new processes that you implemented while you exploded from a startup to a unicorn and growing your team.

Juan: Yes.

Oren: And about the learning session. Is there anything more that you want to share about the process?

Juan: Yeah, the process and it was really nice as I mentioned yeah, this is worth mentioning that you have space, for your voice can be heard. The processes are always changing, you have to deal sometimes with the sprint goals, the interaction with the two teams or with new comers and train them with the processes that are changing. But having a really supportive team and giving this culture of support is really nice. It’s the culture is like I as I mentioned the basis of it. because yeah, it happened to me yesterday I have a teammate that did a PR for me because I wasn’t close to my computer and it was really nice. I love this, but it goes all about the culture.

So, to be available to be open with all of us and you can implement this kind of solution we did and I like love to share this with you because this may work for someone. As it’s working for us.

Oren: So, I’m sure that a lot of our listeners right now are going through many things that you have gone through in the process of growing the company, growing inside the company, and growing your team as well as challenges with Kubernetes. And might have questions for us. I see that there are some questions already coming in. This could be a really good opportunity for us to take some questions from the crowd. So, if you have any questions, please feel free to share. And we’ll start taking them.

Cody: Awesome, so while we wait on some more questions to come in, we will address the ones we’ve already received. So, here’s the first one. Does Bitso use a monitoring tool like Dynatrace, NewRelic, Datadog, Prometheus, or any others?

Juan: Yeah, and on the part of monitoring, you can use a lot of tools and those tools are kind of different. Yeah, those ones actually are pretty similar, but this is like for observability. They tend to all solve the observability. But yeah, we use in this case Datadog for observability. In this case something that I learned; I like to share learnings in this case because that’s why we created the transfer sessions. I learned for example in this case that Prometheus is like calling every service and getting that information in the database. But the kind of services like Datadog use your services go and hit the Datadog service.

So, in this case we implemented Datadog. Yeah, just with a few configurations I think we have two configurations to do in our service. You can expose your metrics over there and use the dashboard and add some columns or dimensions to our metrics. So, you can make it visible what is really going on in your business logic or your services.

Oren: So, we also have an integration that we have between Datadog and Komodor. Juan, do you want to share a bit about how you use a Datadog and Komodor together?

Juan: Yeah, mostly, oh sorry, in this part I really like to use Komodor for like as I mentioned in with this Splunk for a quick overview of the things. When I need something that like, I had this tool, this pen, right. And or my cellphone to, to make a call. Komodor is like that, or all the tools should be like that. I have a tool, and I want to take it like, with no friction with no so much interaction, I just want to take it from a table. Komodor is that way in my case if I want to see a quick status, a quick log or how’s the health status or that kind of stuff.

On the other hand I use Datadog for a more complex or more doing on a, like a more complex analysis. I use Datadog for example I have transactions and the transactions fail or are completed. I want to publish that business logic on Data Dog. So, I did my custom metrics. Because for all pod status, I don’t know Java memory or all those kinds of stuff, there are a lot of things that are already done. You have actually a lot of technologies that are already done that already monitor the memory and stuff. So, I don’t know how to work on that. I need to work on business on how many acceptance rates I have on my transactions. What are the sources of the transactions? All that stuff, I use Datadog for that kind of thing.

Cody: So, our next question reads what is your solution for reverting complex DB changes?

Juan: Oh, that depends on where are you putting that complex DB changes. But most of the cases, I see like if you’re doing on development or staging or production, but most of things our solution is to make the complex DB changes. Like isolate from the real solution or trying to, if I do a complex DB change don’t try to modify the existing one, I won’t try to make both live together. So, if something happens, I can delete this change of the new database structure for example and the old one in a way that can be rolled back, that’s kind of the way of going safely.

Oren: Thank you for I think for sharing Juan.

Cody: Awesome. So, this next question reads I recently had a Cron Job issue that took a lot of time to troubleshoot. What is your solution for tracking and troubleshooting Cron Jobs?

Juan: If I’m not mistaken, I haven’t worked with Cron Jobs and Komodor, but you have suffered for that, right.

Oren: Yes. So, we recently launched our support for Jobs and Cron Jobs. And what we do is we track all the Jobs, how much time they take. We can save their information historically, so you are able to see the providence for every Job that is running. You can look at the logs, even after the Job has been completed. You can see what’s the average time that Job usually takes. And then when it takes longer, you can see that in Komodor as well. One nice thing as well is that you’re able to see on one timeline all of the different occurrences and Jobs that are happening.

So, if you have Jobs that depend on one another you can see the dependency between those Jobs on the timeline. How they interact with one another, as well as services that might be required for the Job to be completed. So, if your Job is depending on one of the services and that service is down and your Job fails, you can correlate between the Job failure and the service that it was depended on.

Juan: We’d love to do that because the last time, I had to deal with Cron Jobs, it was like more than 6 months ago. I had to do the travel should be a Kubectl and it was like okay I have to map this name of Cron Job with this one now the Cron Job is dead and it was really hard. I think in this part with Komodor, we will have a better time.

Cody: So, can you give an example of an issue that was discussed in the weekly knowledge sharing session and the lessons that you as a company took from that session?

Juan: Yeah, and we have one that was about, yeah, actually it was about Datadog and metrics and we explained it. I took that session to explain how from our services you have to implement it, how the best practices of creating metrics. Because creating metrics from your service on your logic is really easy. You just have to add one line of code and that’s it, the metrics first. But sometimes you don’t have, you can see I don’t know the counting of an event and just that. But metrics are really insightful and are really great because you can add the dimension or text to every metric and you can do a lot of crazy stuff and cross information between them.

I really like that one because we learn from that session that there was a lot of interaction with developers. And they were really happy after the session, because they were like oh my god, I can do all of that those kinds of things, I can do or I was doing it wrongly. I learned that this is really useful. This is really useful for teams for companies. And doing not only sessions that you are always talking, like we are doing right now. We need sessions where developers are interacting. They are the main sponsors of the sessions.

If I bring a topic to a session, I notice that right, we have a different session and I noted that the place where we have more interaction where when we open the mic to the developers. For example, if I am bringing one topic, I will arrange the meeting or prepare the meeting in a way that they are more interested and they are more participating. And making more questions than when I’m speaking. So, that’s a learning I took and we took as a team from these sessions that it is really worth sharing.

Cody: Awesome. Thank you for that. So, one hurddle with working with Kubernetes is defining the amount of compute power you need on a service running on a cluster. Speaking of CPU and memory limits, how do you tell what is enough and what metrics do you use to configure these figures?

Juan: Another computer, for instance okay. When it’s enough sometimes is about the technology you are using and sometimes it’s about how much power you have on your host, your host machine. You have to define and how we define, how is enough. It’s a really hard question for me because I’m not coming from the infrastructure part. But these kinds of things we used to tackle as a team with the infrastructure team, with the solution architect team. So, that’s how I tend to solve stuff working together. But yeah, usually I have to be honest with you, most companies when our startup side or startup perspective, we focus much more on delivering. And when we come to at this medium sized company that we are going, then we start to measure how many resources we are tackling.

So probably this kind of thing at least in my opinion, I’m not giving so much focus. I know there are teams that are really focused on that, that are the really saving resources. But I prefer to have more resources in my services and to be prepared if something grows. But I think mostly in Bitso, we have started measuring a lot of these things because they are going to take more resources or more money than we got. So, I think that’s why we have Komodor to measure how our memory is used and we have all the the metrics. But yeah, the idea I think in this part is to measure. Once we can measure, we can take decisions.

Oren: So, I can also share with you we’re working with a lot of companies who are doing the transition now with Kubernetes. And one of the approaches that our customers took is to give developers on their development clusters when they’re just working on their application a very very low amount of resources. So, the developers that are working to create the solution, need to really work with a very lean pod that they’re running in with very low limits. And that way they’re really working hard to optimize their application as much as possible. And they might be experiencing out of memories or other reaching the CPU limit.

Juan: They can iterate.

Oren: And they can iterate more and more and once they have that issue; they’ll be notified by Komodor for out of memory. Why exactly did it happen, what was the pod’s lifecycle until it reached the limit, so they can understand better how to optimize their own application. And then when they go out of production, they get a slightly higher limit in case they need to handle a higher load or scarce or things like that.

Juan: Yeah, that’s interesting because we have like these two perspectives. Like the going on the minimal requirement, so you can improve from there and start like saving a lot of money from the beginning. Or maybe being more conservative and saying I’m going to take; I know one gigabyte of memory and it’s like okay. But okay depending on the size of the service they have. But yeah, we will have a lot of different perspectives that are really great. Because they both work and in one time you have to start saving, and that’s great.

Oren: I think they even went down to giving like a bare minimum of like 200 megabytes for a service. So, just to really reduce it to the minimum, really challenge the developers to create really optimized applications.

Juan: Yeah, that’s it. I mentioned you want one gigabyte because sometimes we have a monolithic service or big services that have a lot of responsibilities and there are a lot of resources. And yeah, that’s that may be like a not a standard but sometimes it happens.

Oren: So, that’s an excellent question. Thank you.

Juan: Thank you.

Cody: So, it looks like we are on to our last question. So, if anyone in the audience has any last-second questions, they’d like to send in now is your last call. So, this last question reads when you mentioned switching to AWS VPN was this only site-to-site or are you using the AWS VPN client?

Juan: In this part also have to be honest and not an expert on an AWS, but yeah, the idea is to as I mentioned we have standalone VPN applications running on our servers. And when we migrated to AWS, the idea was to increase the reliability to increase the app timing, to have redundancy of VPNs. If something goes, down take another program to take the work. So, in this case, I can answer the question, that is far from our infrastructure team they have all the details on that.

Oren: So, I would believe that the AWS VPN client is used if you’re connecting from your local machine. Whereas, since you’re a fully remote company, I would doubt that you’re using a site-to-site, since you don’t have a site that you’re based from.

Juan: Yeah, I guess, but yeah, we received the solution from AWS, that’s the thing.

Oren: Yeah, so with VPN you really need to understand what is the type of solution that you want to create if you’re connecting from your personal computer, you’ll be using the AWS VPN client. And if you want to connect from your office for example to bridge it to the environment that you’re working with AWS, you’ll probably be leveraging the AWS site to site.

Juan: Okay, I think I misunderstood the question or maybe we got misunderstood, the question was misunderstood. Because yeah, when I was talking about VPNs, all that stuff, I was talking about B2B VPNs like business VPNs with other providers and that stuff. That’s our focus because we have this kind of security when connecting to some providers. That’s why I was mentioning that about the VPN for our company. That’s another thing that is part of the IT team, but I’m not sure what they’re using behind the client we have. I’m not sure if they’re using AWS VPN, I guess so. Yeah, because this is a matter of cyber security stuff of the company.

Oren: That will remain the private.

Juan: Yeah.

Cody: Excellent. Well, that seems to be all the questions that we’ve received today. So, before we wrap things up, I’m going to let you guys have the floor in case there’s anything you want to let the audience know before we officially conclude.

Juan: Oren.

Oren: Thank you, Juan. So, yeah, as I mentioned at the beginning, I’m from Komodor. We help companies scale together with Kubernetes and help troubleshooting incidents in Kubernetes. We provide a full troubleshooting platform that helps you analyze and understand what is happening in your Kubernetes clusters, track events over time, and make Kubernetes troubleshooting more effective. We’re also leveraging Kubernetes workflows that help you troubleshoot incidents that are happening right now by analyzing what are the issues and what are the best practices to apply in order to resolve those.

Juan: Awesome. Yeah, by my side ah really thankful for all the people participating really love the questions and interaction we have in the chat, in the Q&A also. And thank you Oren, thank you Cody. As I mentioned, I am Jose Mejia I am leading right now a team on Bitso. Bitso, we are making, like our mission is like making crypto useful, so people can engage from everywhere physically in our neighborhood – Latin America. So, people can have cryptocurrency, can exchange, can send to other countries and can do a lot of stuff with that. Here in Bitso I really find my leadership role.

I started my leadership role here. I found out that I really love to help people to develop correctly, to develop a better, to develop happy. And I am willing all my days to make their day to day easier, so they can code and they can be satisfied with, I’m proud of what we are working on, what we’re building. So, there is a slide, where you can see if you want to know our open roles. We want to know more of Bitso also, our people, the team is working on this part and they will attend to you really fast.

Cody: Juan and Oren, thank you both so much for taking the time out of your day to join us today and for putting together the PowerPoint and just for sharing your collective expert knowledge with us. So, we really appreciate that. I’d like to also remind our audience today that this session was recorded. You’ll receive an email with a link to access the recording on-demand and you can also find it living on the DevOps website at slash webinars be sure to look in the on-demand section.

So, I do still have 4 $25 Amazon gift cards to give away. Our first winner is Kalpesh C. Our second winner is Victor N. Our third winner is Theo T. Our fourth and final winner is Naveen C. So, to the four of you, please keep an eye on your inbox to claim your gift card. But if you don’t find that email, just check your spam folder. I would like to extend our gratitude to Komodor for sponsoring today’s webinar, and my final thanks goes to you, our audience. Thank you so much for spending time with us today. We ask for just one extra moment for you to fill out a quick post-webinar survey that should pop up on your screen here in just a moment. But other than that, hope to see you at a future Tech Strong Learning webinar. Everyone, have a great rest of your day. And Oren, Juan, thank you so much.

Juan: Thank you.