- Home
- Resource library
- Webinars
- Defying the Odds Building Robust Safe Workloads with Aqua Security Komodor
Defying the Odds Building Robust Safe Workloads with Aqua Security Komodor
Nir Ben-Atar,
DevOps Team Lead, Komodor
Anaïs Ulrichs,
Dev Advocate, Aqua
You can also view the full presentation deck here.
[Transcript]
[Beginning of Recorded Material]
Udi: Hi everyone, and welcome to defying the odds webinar with Aqua Security and Komodor. I’m Udi, Dev at Komodor. Before I clear the stage for our amazing speakers, just a bit of housekeeping. This webinar is being recorded, and after the event, you will receive an email with the recording and the presentation deck.
We will start with a high-level overview of the challenges Kubernetes poses for developers, talk about best practices and tools to tackle those challenges, and finish with a live demo of Komodor and Trivy operator solving real-life scenarios. There will be a Q&A session at the end, so please feel free to drop your questions in the Q&A tab below or in the chat if you prefer. I hope you’ll enjoy your time together, and learn new things that will help you all to build more robust and secure workloads in Kubernetes.
And so why are we even here? So according to several recent surveys, the two most prominent blockers for Kubernetes adoption in production are one, security, and second one is reliability, or day two operation. What’s interesting is that the two are very closely related, and can actually affect each other. And to unpack this quandary, we’ve partnered up with our friends of Aqua security, who really understand their Kubernetes security. And so, without further ado, please allow me to introduce our speakers.
To cover the troubleshooting aspect, we’ve got Komodor’s very own DevOps lead, Nir Ben-Atar, also known as NBA. He’s a DevOps lead at Komodor, and he has vast experience in infrastructure and Kubernetes, and even some security background, because he used to work at Cognite before joining Komodor. And we also have with us, and she really needs no introduction, she is a CNCF top ambassador, a GitHub star, the woman behind 100 days of Kubernetes.
Formally at survey at Sivo, and the current Dev Advocate at Aqua security, the one and only Anais Urlichs. So I’m going to clear the stage and let the real experts talk, so please enjoy, and remember to follow Komodor and Aqua and then Anaia of course for more great Kubernetes content. And we’ll start off with Nir, so take it away.
Nir: Yes, can you guys hear me well? Can you hear me well? Anais, can you say something, so I can hear you?
Anais: I can hear you perfectly.
Nir: Yes, okay, cool. So first of all, very excited to be here, together with Anais, this is like kind of my debut webinar here in Komodor. And I’m going to start with a quick who am I, and I feel already embarrassed talking about myself, but I’m just going to go real quick. Right now, I’m a DevOps lead at Komodor.
I used to be a DevOps group manager at Cognite, like Udi said earlier. I love Kubernetes, I love automating things, and I play beach volleyball and all kinds of sports really, and that’s pretty much me. I’m not going to talk too much about myself. So I think let’s just get to it, and talk a little bit about Kubernetes troubleshooting.
So let’s talk a little bit about what’s so great about Kubernetes. I believe that most of you guys already kind of know Kubernetes, and if you don’t, maybe you can start with Anais series about 100 days of Kubernetes to get a head start. Kubernetes is really like a game changer, it became the de-facto standard for deploying microservices onto cloud environments, and it’s really widely accepted for a good reason. It allows you to increase efficiency by natively distributing workflows onto nodes.
It really also scales amazingly almost out of the box using HBAs and cluster auto scalers, and that kind of stuff. And its tooling is really going to allow teams to be super agile and apply DevOps methodologies. And I think in the area, and in the era of cloud computing, it’s already named the cloud OS, and to be frank as users and admins, we all really kind of love it. So Kubernetes is good, this is basically the slide. I think the dark side of Kubernetes is the fact that it’s pretty complex.
And complexity is fine when things are working. I mean, I don’t mind that my AC or my air conditioning is pretty complex as long as it keeps my room chilled at the summer nights, right? I do mind when things go wrong and I wake up in the middle of the night all sweaty, and all I’m left with is an abstraction, a remote control without real understanding of what’s going on, and how to actually solve the fact that I’m actually, my room is really hot.
So I think because Kubernetes abstracts so many of the concepts that we assisted admins used to know inside out, when things go wrong, it can be very hard to trace the root cause or the issues in our workloads. So what do we do? We kind of add tooling, and then for adding for tooling, it can also get kind of complex too, because then we need to learn how to use those tooling, and it also has like a learning curve.
So Kubernetes kind of natively kind of abstracts lots of the things that we used to do manually, or automatically using like best scripts or python scripts or whatever scripting language we use to support. But it does have complexity built into it, and it requires a learning curve. So what can we do to sort of simplify Kubernetes troubleshooting?
So I think trying to say that like in a webinar with China, like giving it ten minutes is really trying to, it’s not going to be the whole thing that’s going to make your cluster or your Kubernetes workflows reliable, but I’m just going to give a couple of best practices. And the first one is what I think I’m going to call invest into Cluster visibility. So for any distributed system and especially in Kubernetes, visibility is king, and when it comes to visibility, I’m just going to mention several tools and four main domains.
So the first domain that I’m going to talk about is APM or application performance monitoring. So these tools can be software as a service, or self-hosted. So software as a service could be like data dog or new relic, and self-hosted could be like a Prometheus Grafana duo. And why does it increase your cluster availability is several things.
Okay, so by adding APM, you can add performance metrics to your application, and it’s going to allow you to identify how your application behaves in your production life cycle. How so? For example, once you have your APM in place, you can test your cannery or your blue green deployment. So I deployed an application, and things kind of, I want to make sure that the new application kind of behaves the same way that I wanted to behave.
If I have some metrics on for example the amount of requests that it’s processing on, or the amount of queries that is performing and the timing for that, I can say that my cannery is actually a good deployment, so this is one reason to add your APM. Another good reason to add an APM is you could set your auto scalers to scale on specific metrics instead of like a resource consumption.
So for your auto-scalers, you could say okay, whenever my workload is reaching a certain CPU threshold or a certain ramp threshold, then you’d want to auto-scale. But this is kind of vague. Sometimes you want to specifically scale on a specific metric. So for example, the amount of requests that my processing microservice is doing when it pulls a messages from a queue. So if I know that I have a search in my request, and this is a metric that I can really count on, I can set my HBAs and my auto scalers to make sure that they scale on that specific metric, instead of actually just scaling on CPU or ram.
So this is another reason why you’d want to add an APM. And thirdly, obviously, when things go wrong, you do want to get alerts and you want to make sure that you’re brought to attention as an SRE or a DevOp or a production engineer or whatever your name right now. So when things go wrong, when your metrics kind of get to a threshold or reach a certain I don’t know percent, or any limit that you want to set, you’d want to get alerted.
So when my time for queries is taking too long, or it takes too long for my application to load, I’d really want to get alerted as soon as it happens, so I can get to it right away, and actually, solve my issues really quickly. So this is like the APM domain to increase your cluster visibility.
The second domain I kind of want to talk about is logging, and logging like APMs can have two, like you could have a self-hosted or you could have like a software as a service solution. So software’s as a service solution could be like logs IO or CoreLogix or a Loki and a self-hosted could be like a Fluentd, ELK that kind of tools. I think for logging, having a system to capture or present logs is one thing, but making sure these logs are made for distributed system is another.
So basically, you have to make sure as an SRE or DevOps to communicate to your developers, and to make them understand what information is required from the logs, and making sure that the logs is standardized and the log formats are kind of the same, and the severities kind of represent the same things.
So when you look at a distributed system with lots of microservices, you have some sort of standard. Say you have like a micro service architecture and you’re handling like an event flow, things like making sure you pass through an event ID as part of the logs, is going to make handling issues really so much easier if you could just like filter by an event ID, and then having to go through all the microservices and looking for a specific event ID. So having a system is one thing, but having a standard of what logs are meant to be, and what they’re supposed to do is really going to increase your cluster visibility.
The third domain I kind of want to talk about is alerting tools. So alerting tools, things like pager duty, Ops genie or even like a slack sync, are really important for you to really, once your APM is kind of hit a threshold or once you have an alert for a specific log on your login mechanism, you really want to make sure that you’re ready with playbooks for the specific alert type, right? So you want to make sure that you have a system in place to actually call you in the middle of the night and wake you up and actually get you to do what you’re doing, and to solve the issue. This is the third domain, alerting tools.
And the last one, last and not least, is actually specialized Kubernetes tools. So APMs and loggings and alerting tools are relevant for everything, not just Kubernetes. You can have them for serverless, you can have them for, I don’t know, legacy ec2s or anything like that. So specialized Kubernetes tools are going to, like Komodor, are actually specifically designed to give a user context for Kubernetes resource or an event. And it’s going to provide you with the user with actionable data to act upon before and when things go wrong. So these are kind of the four main areas that I want to talk about cluster visibility.
The second-best practice I kind of want to talk about is making sure you standardize your processes and tooling. So Kubernetes is like an amazing community, right? And it has amazing people developing amazing tooling. And as a DevOps engineer, my fun or my cup of tea is to look through DevOps newsletters and subscribing to DevOps influencers like Anais and going to conventions, and the amount of tooling for Kubernetes is almost like overwhelming.
And as a DevOps engineer, we like to try everything out, right? We deploy to local clusters and check things out, and it’s almost like in our nature. And something we have to look out for when we’re planning our production infrastructure and processes around Kubernetes is making sure that we identify tools that have the same objective. I’m just going to give like a quick example. So there’s like the objective of defining different values for your environments, right? Your prod and your staging and your tests and your load environments may have like different values and configurations or image versions or secrets, or all kinds of differentiation between your environments. And you have several tools for dealing with this objective.
You have like helm and customize, and both of them kind of try to tackle the same problem. In some cases, helm is more suitable, and some customized. But if we were to incorporate multiple tools answering the same objective would be adding a level complexity to our development or support cycle. What do I mean? Like what’s an example if I have like both helm and customized? So imagine a CI pipeline for an application, and then as a part of the CI pipeline for the application, you want to enter, and to go through the Yaml’s or your resources definitions, to make sure that they’re same before they’re deployed, right?
So if I’m using helm and customized, I’d have to use two different linters, I have to maintain different linter configurations and then I’d have to do extra work and complexity becomes like a native thing that I have to do if I have to support an application with both solutions tackling the same problem. If I had to look at the deploy command, and I’d have to wait for a resource to be ready, I do like a dash-dash wait on a home, and then I do a cube CTL wait on a customized deployment.
And then this is just for like CI operations, imagine like day two operations like having to roll back. So like helm provides you with a roll back out of the box mechanism, but customize doesn’t. And then what if I had to roll back a complete application with some customize and some helm into it, it really brings in so much more complexity when you have two solutions for the same objective. So this kind of applies for every single like objective that you’re trying to achieve with tooling in Kubernetes.
So how do I manage secrets? Do I use seal their external secrets? How do I scale? Do I just use a vanilla HPA or Arcada or do I use a service mesh? Do I use link or D or SEO or both? I think the bottom line when we’re talking about standardizing Kubernetes resources is experimenting is great and performing proper POCs for tooling is essential, and frankly, it’s really super fun.
But when we’re choosing something to be run on our production infrastructure, we want to make sure that we don’t use different tools for solving the same objective. So that’s like the second part. Standardizing Kubernetes processes and tooling.
Anais: And I think that really applies to not just troubleshooting or like the tools that you mentioned, but really any tool in the cloud native space. That there’s for most tools, if not every tool, there’s an alternative that does something similar, right? And the way you’re talking about Helm is customized, you can talk about it that way because you know the difference, because you had the option to look at both tools, right?
But most people who are getting started, or who are looking for a specific solution, for instance, if you’re looking for the first time at Komodor, right? And you’re looking at similar tools, you don’t know yet how to reason about Komodor versus, like the way you reason right now about customers versus home, this is really complex when you’re just getting started with a new tool, when you’re exploring a new tool.
Similarly, when you’re looking at security scanners, you will have to invest some time into looking at the different tools and start reasoning about what is this versus another, right? And that’s going to be something that will apply for every tool out there, right? Not just like helm versus customized or tools like that, but really any tool that you want to deploy in your cluster, so you should actually go ahead and reason about okay, what does it give me that other tools don’t give me? Versus what does it not give me.
Nir: I completely agree. I think that my point specifically was do your research, do your POCs, make sure that everything is kind of in the right place, and you can reason why it shows a specific solution, but make sure that you don’t experiment in production for both solutions, because trying to solve the problem is only half of the solution.
Having to support it in terms of the entire delivery and day to day operation pipeline is going to be costly, and it’s going to cause some issues if you actually support both of them as part of your production infrastructure. Do that on your test environment, do that on your staging environment, fine. But once you go to prod, make sure that you have one solution to solve a specific problem.
My next kind of best practice is to treat Kubernetes resources like code. I know we’re talking about infrastructure as code, it’s really very fluffy, everyone’s talking about it, I know. But using code best practices is really going to help you as an SRE or a DevOps engineer, to put safeguards in place to prevent untested changes from actually rolling out the production. So things like performing PRs properly, like asking the right questions as part of your pull request.
Use your linters, have proper CI pipelines, make use of admission controllers, having multiple environments for testing purposes, and down the line, staging environments and QA environments. And then, only then, reach a prod specifically for Kubernetes resources is really important, because my thesis is things kind of break due to many reasons, but the main reason kind of things break is changes.
And we like changes, right? We push towards shifting left and making sure that changes are really easy for developers to do right now. But we as SREs or DevOps engineers, we’re like the protectors of the production environment, and we must practice like good code hygiene and best practices to make sure that we’re in the right place in terms of how we can treat our production environment.
Also, like a good example for treating Kubernetes resources like code, is make use of our back. So just like you wouldn’t let anyone merge into your master branch without going through a PR and having approval, and just like you would block that right away. You shouldn’t allow anyone to like change your cluster state on your production name space, production environment that actually is going to be where the customer is actually there. So this is the third kind of best practice.
In terms of a conclusion, I think ensuring the right foundation to your Kubernetes environment from the get-go is really going to ease the process of troubleshooting down the line. It’s going to help you move faster, increase ownership, and bring more value to your customer. So now after talking a million, I’m just going to transfer over to Anais which I’m a big fan of, and I’m going to let her talk a little bit about security in Kubernetes.
Anais: Awesome. I don’t think I have that much to say, I’m not sure, I will try. Can you get to the next slide please, slight maintainer, yes, awesome, thank you. Yes, quick overview of me, actually I didn’t start in this space, right? I got started in space at the end of 2020 after I got tired from the from working in the blockchain space, but I was working in the blockchain space and open-source project. Then when I joined the cloud native space, I was first not working primarily on open-source projects.
But that changed when I got started Aqua security at the beginning of the year. So now I’m their open-source developer advocate. You mentioned a few times that reliability engineering, I worked at SRE last year, taught myself at it, but I missed the developer advocacy side of things, it’s kind of, it’s for me there, I don’t know, I love it.
You get paid to learn, so I got back into developer advocacy. I also have a YouTube channel and weekly newsletter, so if you’re curious about that, check it out. I didn’t do a hundred days of Kubernetes, I did a bit less, a few days, but you can still follow me for new content.
Nir: Yes, definitely.
Anais: Can you move to the next slide, please?
Nir: Yes.
Anais: Awesome. So I’m talking now about Kubernetes security. If you look online at my previous presentations, you will see I have talked about multiple different topics from GitHub, CSD pipelines to all of kinds of things and the CI. I got started with cognitive security, however, I think it should have gotten started way earlier when I got started in the cloud native space, and I’m going to tell you why, because it’s actually, well, it is high, but it’s not as high as you might think.
It’s similar to getting started with any other cloud native tool out there, and that sounds really like as a way of simplifying it. But ultimately, when you get started with observability, you might get cyber observability tools and you learn how to use them, but for a long term you won’t understand them inside out. And it’s similar to security, when you get started with cognitive security, at the beginning, you will not understand everything, everything that there is to cognitive security and you don’t have to get started with it.
But because there’s kind of the stigma around it, about security being reserved for a group of people who are already experts or like more proficient in a specific field, people kind of try to stay away from it, which is actually a problem, because we need more security experts, more people who understand how to use them, and how to integrate the tools that, also I’m going to show you from Aqua, from our open-source project, into your stack, into other tools such as Komodor. Next slide, please.
So the thing is why it’s so difficult kind of to touch upon that more detail why it’s so difficult to get started with cognitive security is that there are multiple different components that will factor into how you approach security for your project, and how you manage security tools. So the space overall is moving very quickly, and security is not necessarily the thing that will help you to navigate the space better, right? It’s something else that you will have to keep up with.
Versus tools such as Komodor, troubleshooting tools, observability tools, they will help you to understand what’s going on basically. Security tools will need you to be more proactive with everything going on. You will have to proactively look into what new CDEs mean, what new misconfiguration issues appear within your environments, and so on it’s nothing that’s going to be shown to you out of the box.
Then the other thing is that there are multiple different stakeholders open source and of propriety software, that have all different interests, right? They have different priorities. And a lot of companies don’t necessarily focus on investing into security and into security teams, it’s something that might be managed at the end of a sprint, in the few days that might or might not remain, who knows, right?
And then the next thing is obviously security is considered to be very difficult to get started with, and that’s also due to a lot of the tools being very immature in this space, right? Like lots of cloud native tools, they are open source first and they are not necessarily having the best user experience, and they’re difficult to navigate, right? So that’s why you might not want to or enjoy even looking at them and using them.
Then last two things that the tech stack that you might have originally is very complex, and might not have a security tool out of the box, so you might have to learn how to use security tools related to your tech stack. And then the other thing is that not every tool will have existing integrations available, which will require you to also build those. Next slide.
And this is kind of related to the user experience of security tools, and that’s also how I feel getting started with a lot of cognitive tools, that the documentation will tell me one thing do X and when I try to do X it’s not there, it doesn’t work. The instructions don’t match what I’m seeing inside of my cluster, right?
And I can tell a little bit of the story as well like preparing for this webinar. I was trying to use a tool that was less mature, and I thought oh, it’s going to be easy to just configure it, but it wasn’t there out of the box and it just didn’t make much sense on the way it’s supposed to be configured. So that actually happens a lot, right?
Nir: Yes, definitely.
Anais: And this is the thing is, the funny thing is when I found this gif, it’s meant to be as a joke, but it’s in a text, but it’s a real problem that people tell you that you have to press this or do that, and it just doesn’t exist. It’s not there, it’s out of the box there, every tool has to be configured differently, next slide. So here are just some of the tips that kind of more in terms of soft skills and priorities of how you can approach security in an easy way, but I use open-source tools from Aqua security or other open-source tools doesn’t matter, but kind of something that I would recommend you to do.
Next slide, so the first thing is always look at what you use, right? There’s a lot you can do early on when you choose different resources, when you’re developing your application. So different people throughout the development life cycle, if you’re working within a team can focus on securing different aspects of your stack, of your infrastructure. But ultimately, the main theme here is that don’t just trust things you find online, lots of it is open source and you have to look into how it’s maintained, who maintains it, what are their objectives as well.
So validating what you use, how you use it, what you deploy is always better than just going ahead and trusting those tools, right? Even if a tool has millions of downloads, it could always be something. So looking at the individual components that you’re using and trying to understand their security, potential security issues will help you down the line even if you don’t understand and you don’t have to understand every detail of those potential security risks.
Next slide, another thing is that we can, what we should do is and that’s kind of related to shifting left, shifting left is a great thing. But ultimately, the thing is, or I have some problems with the with the theme itself, because a lot of times, it results in engineers and people who have a different day-to-day job having to do more work, and that shouldn’t be the goal, right? If you just say everybody, every engineer should do everything or should be empowered to do everything, then you might end up in a situation where nobody does anything.
And that’s a big issue. So definitely empower people within your team, but also assign responsibilities and ownership of those tasks. Like everybody should be able to contribute, but there should be a clear path on how to contribute, and ownership of when to contribute for example to the security of your infrastructure of your tools.
Nir: I just want to say I completely agree on the point. I think having everyone being able to contribute is one thing, but having a designated area of responsibility for specifically security, but also other things, is really going to help you also empower the people. So once you’re going to try and contribute to a specific area.
If you have someone to discuss with, and talk about and get advice on how to actually implement that, it’s really important for you to be able to shift left the decision tree. So I completely, I’m strongly a plus one on this point.
Anais: The thing is also like with, for example, I would use Komodor if I’m already in a situation where I have to figure things out, right? Where I have to understand things better. So it proactively helps me with my job, right? Versus security tools often they just add additional tasks to your job, which is also the difference, right? Of like how those tools have to be approached differently.
And then the last thing is that automation is great, but obviously, and if you watch this episode of the Simpsons, you see that it can also easily fail such as this little bird just flipping over and not being able to press the key anymore, and that’s also how a lot of automation pipelines might end up.
That automation is great, but only up to a certain level, and also only up to the level that you understand it, right? It won’t cater for edge cases necessarily; it won’t do the proactive tasks for you either. So those are the main themes. Now the next thing, is the demo, right?
Nir: Yes. I think specifically for automation, I think you said correctly. It’s not like once you automate one sort of task you’re done with it, right? It’s never the case. You’re like okay, you automate one use case and then your automation kind of breaks, and then you automate another part of your pipeline, and then well, there’s a different edge case which kind of breaks the whole concept.
So automation is not like a silver bullet, which kind of solves the whole solution. You’d always have to revise those automated processes. You have to make sure, especially for security that they’re up to date and they’re actually validating the right say, against the latest versions of the CVE repositories or all kinds of that kind of stuff.
You have to make sure that they’re up to date, and you have to get give them some care and love. So once you automate the process, it’s not like you’re over and done with. You have to continuously make sure that these processes are going to work and they’re still valid.
Anais: And also understanding when those processes fail, right?
Nir: Definitely.
Anais: Yes.
Nir: So let’s do a demo. Anais, are you excited about the demo?
Anais: I am excited.
Nir: I am.
Anais: Fingers crossed.
Nir: Yes, okay. So I’m just going to give like a quick overview about how Komodor can help you, and we’re just going to talk about two different scenarios, if I could just get my screen to work, yes. Can you see my screen?
Anais: I think so.
Nir: All right, cool. Okay, so first off, this is Komodor, Komodor is quite the platform and I’m kind of like in love with it. So I’m a bit biased, but I think it’s a really good platform for someone who’s trying to understand what’s going on in their cluster, and trying to debug a specific solution. I’m going to talk with two hats.
So the first hat is like an SRE or a DevOps engineer who’s trying to look and get an overview of their system, and then the other hat is going to be a specific developer working on a microservice, trying to solve a specific problem in their environment. So I put on the DevOps hat, I’m now looking at what we call the events page, and the events page is something that’s going to allow us, as an SRE or a DevOp engineer, to make sure that they have a bird’s-eye view of everything that’s happened inside our system. This includes different clusters, I can filter them out, this includes all the namespaces and all the clusters, and includes all the services that I have.
And I have like a really cool widget here which I could select time from and to see all the events in that period of time. So I can just like pick this and this, and see a specific amount of events. So I’m just going to look at the specific event here, I’m just going to pick a random one.
And this is an availability issue, and this availability issue for a specific service I can see that there was an OOMkilled, and as part of this really cool feature that Komodor is showing, I’m going to see that the logs from where the pod actually failed. So you know how different, instead of having to log on and trying to figure out exactly what happened in my log scenario, I can get the right context of when this happened and get the right logs for that.
And I can look and see exactly what happened and see with the pod events, and all kinds of really cool stuff which going to give me a good snapshot of what happened at that time. So now if I’m just going to pick this specific service, and see all kinds of other stuff that I can see here. Let’s just pick like a deploy event. So lots of stuff that Komodor is going to help you try and get context of is what happens during a deploy, and because Komodor is an agent installed in your cluster, you get a diff of everything that’s changed between deploys.
So we get a specific diff of any annotation or environment variables or limits which are changed as part of the deploy, and then you can get visibility on to what happened when you did this change. But a really cool feature that Komodor is going to show is an integration with GitHub. So if you incorporate your GitHub say a GitShot reference as part of your CI pipeline, you can get for each deploy exactly what happened between the last deploy.
So in this deploy, a guy who’s an engineer here at Komodor, he changed like a cache mem size. And from this deploy, I can see what happens inside the image which is really cool, because this caching service is actually not, I can’t see this when I just log into Kubernetes, I had to go to get and understand what happened, and this is going to shorten my way into understanding what happened in this deploy. So this is like my hat as a SRE, looking at the entire system and having to filter specific events and trying to understand what goes on, and Komodor gives you great visibility for that.
So when I switch my hat to a developer, and I’m just going to say that I’m looking at the specific environment, let’s say I’m working on a service called eventspooler, right? Eventspooler is a deployment, and eventspooler has a configuration defined in a config map which says the amount of API limits. And let’s pretend like I’m working on a feature which is going to increase the API rate limit.
So I can see that the API rate limit is actually configured in the config map, but while I was working on that feature, someone called me just in the open space and said okay, something is going on in your production infrastructure, there’s a deployment to failing or something like that, and because I’m a good Devs r you’re a good DevOps engineer, I just do really a quick change to production.
And I’m trying to figure out exactly what’s going on. So I change my contacts, I like go to deployments and see what’s going on in the failed service, right? And I check it out, I’m just like checking it out, seeing what’s going on, looking at logs, seeing stuff like okay, I’m going to say something to UD saying look, your data sync is not there, it’s not in the image, check something out, and then I try to kind of go back to what I’m doing, right?
But as part of what I’m doing, I kind of forgot that I changed the contents, and I kind of changed the config map in the production environment. So I’m just going to do that real quick. So I’m just going to change the API rate limit to something big. And then as part of doing that, I also want to kind of scale it out, because I know that my advanced pooling is kind of slow. So I’m going to go to my deployment and I’m going to scale that out to like 10, why not?
So while I do that, I can kind of go to Komodor and see what it kind of looks like from my perspective as a developer. So I can go to the services view and look at the eventspooler, and straight away, what I can see is looking at best practices of what happens in my deployment, even without anything happening, right? I can see that I have issues with memory limits, just my deployment doesn’t have any memory limits and no CPU limits.
This is really important for me actually when I’m looking at the deployment, it’s going to help me make my deployment better. When we don’t have memory limits or CPU limits, our quality of service that we’re getting for our workload is burstable, which means that if our node is going to be over occupied with all kinds of nodes, with all kind of workloads affecting all kinds of CPU and memory consumption, it could actually be evicted and we really don’t like that. So we want to make sure that this is stuff that we alert on.
The second alert is saying that the tag is not specified, and it’s true, I have a latest tag, and it’s a really bad practice. It means that whoever pushes that is going to automatically pull onto the deployment, so I don’t want that. So if I was a good developer and I looked at the screen, I’d make sure that things are actually working fine.
So going back to what we did earlier, we did a deploy, and we change the amount of replicas from one to ten, right? But something really cool that Komodor is doing, because Komodor kind of knows what, so we’ve only changed the deployment, right? We’ve only changed the replica set or the amount of replicas for this deployment. But Komodor is aware that this deployment is actually using a config change or a config map for the API rate limit, and it’s going to show me that things are actually changed in this deployment.
And it kind of correlates that together with the deployment, to make sure that we’re actually shown the context of everything that’s changed for that specific deployment, and it’s going to give you a good idea of what’s going on. So I think that’s mostly it, what I kind of want to display here in Komodor, and I’m going to get the stage over to Anais, and she’s going to show a little bit about the operator.
Anais: Okay.
Nir: Anais, so you want to share your screen?
Anais: Yes.
Nir: All right, fingers crossed is going to work just fine.
Anais: Okay. You can see my screen?
Nir: Yes.
Anais: Okay. So this is Trivy, Trivy is our all in one security scanner. With Trivy itself, you would usually go ahead and you would install, the Trivy CLI and you would use the CLI directly for your terminal or you would use the CSD pipeline and you use CLI through the client tool through that. The thing is that this is obviously focused more on engineers, scanning security resources, the CLI tool is not focused on cluster admins or security professionals who want to see what’s going on inside your Kubernetes cluster.
So we have an additional to Trivy, we have the Trivy operator. And the Trivy operator is basically a set of Kubernetes custom resource definitions, so basically Kubernetes resources that you can apply to your Kubernetes cluster, to extend the Kubernetes API. That’s ultimately what we talked about at the beginning of the talk, that the power of Kubernetes is really that it’s so extendable, and that you can deploy pretty much any application to your cluster and extend and build upon the Kubernetes API to retrieve information as well as push information to your cluster.
Now for those who are new to Kubernetes operators, this is basically how it looks within your cluster. So you have your Kubernetes cluster, and then you have a controller which is kind of the operator, living inside your cluster that’s kind of deployed through a set of Yaml manifests. And a controller or operator in Kubernetes usually has a specific task, it’s one of those automation tools that’s supposed to automate a human task that’s usually done by somebody manually.
Now this is all done within your Kubernetes cluster, so anything that’s living within the Kubernetes cluster, it will just build upon that. So for instance, the Trivy operator lives inside of your Kubernetes cluster, and then it can scan for example your deployments for vulnerability issues and other issues within your cluster. Right now it’s scanning configuration issues within your cluster and vulnerability issues. You’re going to extend it to do additional scans. But ultimately, at this point everything is within your cluster. And I’m just going to show you how that looks like, so here’s my terminal, I just have to move the screen otherwise I don’t see you.
Let’s see, is it going to open? You did well in opening it before. Okay, so here’s my operator service, but ultimately if I go to parts, I have here all the events polar then we just upscale. And then we have here the Trivy operator within our demo name space. And the Trivy operator is just another Kubernetes deployment that’s living within you, and whenever there is a new container image detected within the cluster, within this namespace specifically, it’s going to do a vulnerability scan.
And that’s going to be safe, the vulnerability scan, it is going to be saved as a CRD as a Kubernetes custom resource definition called vulnerability report. So we can query the cluster for vulnerability reports, and we see our vulnerability reports from within that namespace. So for instance, I have here of my example application, a vulnerability report, and it has five critical vulnerabilities, 33 high vulnerabilities, 11 medium vulnerabilities and so on.
Now this is kind of an overview that you can see through a tool such as KNS, you can also use other integrations such as lens to get vulnerability reports. Or you can also see the operator itself through Komodor, and that’s just the beauty of it, that while you’re using the Trivy operator, you would still want to use, and you have to use all kinds of other monitoring solutions and troubleshooting solutions to see what’s going on within your cluster.
So the Trivy operator is kind of living within your cluster and you would want to set up alerting rules, and other rules to be notified whenever there are changes, whenever there is a new vulnerability report, for example, with critical vulnerabilities, that’s when you want to get notified. But it shouldn’t be something that you actively have to go and check, you just want to be notified.
For example, in Komodor, if the Trivy operator is down, because then it can’t run new vulnerability scans, and then it can’t notify you of new critical vulnerabilities within your cluster. So you want to be notified when your Trivy operator doesn’t work as it’s expected, and you want to be notified additionally if there are critical vulnerabilities, for instance, within your cluster of new reports.
Now this is in this example application that I deployed here, there are those critical vulnerabilities that doesn’t tell you a lot, you can then also query the Yaml of that vulnerability report, and here you will then see the details of all of the vulnerabilities that are within. And you can also see within the URL, you can see more details about that specific vulnerability, and ways to fix it.
Now I’ve said in this case the operator to only show us vulnerabilities that can already be fixed, so it will only report your vulnerabilities that can already be fixed, and I fixed that in a new container image. So over here, this is the container image that’s just deployed, it’s just running within my Kubernetes cluster that the Trivy operator scan within our Kubernetes cluster. Now I did a new Komodor demo, I did a new container image and I just called it the good image, and we can now go ahead and we can update our deployment within our cluster.
So I can go ahead and say cube cuddle apply, and then it’s within the manifest folder I think, it’s everything, is this the Trivy? No, this is the wrong repository. Let me open this terminal, and then you can see here other things that I tried. Okay, so over here I want to now deploy the manifest update. So I’m going to say cube cuddle apply, file and then it’s within deployment and manifests. So I’m just going to apply it to our demo playground, otherwise not going to work.
The updated container image, and now we can go back to our cluster, and we can see the updated container image running within our cluster as part of this deployment hopefully, and here’s already our vulnerability scan, so once let’s go back to parts. So once there’s new deployment, the Trivy operator will run a job that’s basically scanning our new deployment, in this case, our new container image that we just deployed to the cluster.
So when we go back to vulnerability reports, we can then see the new vulnerability report here. And in this container, like in the updated container image, we don’t have any critical or high vulnerabilities or any other vulnerabilities that have a fix available. Now there might be vulnerabilities within that container image that don’t yet have a fix available, meaning they’re just vulnerabilities that are known of in this space, that have CVs available, but you can’t proactively fix them or like the maintainers of the base image that I’m using, can’t fix them yet, so that’s what it means ultimately. And this is what the Trivy operator does with the new cluster, any questions?
Nir: I have like a question, because it’s one of the first times that I’ve seen the Trivy operator. So you mentioned that it’s not going to show vulnerability if it’s not going to be fixed, how is it distinguishing between the ones that can be fixed or can’t be fixed?
Anais: So let me maybe go ahead, and what can I scan? So for instance, let me scan this container image, the previous one, and then I can show you how that works. Now was it react example up, something like this? 8.0.0. So I can take this container image, and I can use, in this case, I’m going to use the Trivy CLI. And with the Trivy CLI, I can scan any container image, I don’t have to have it locally. So for instance, before I deploy or choose container images, I can go ahead and scan them.
So you can say Trivy image and then I can scan my container image. So this is going to go ahead, it’s going to pull the container image for default, it’s like a hub, but it can also choose any, you can choose any of the container registry. And it’s going to scan that container image, and it’s going to list all of the vulnerabilities, and it’s going to classify them whether they’re critical, high vulnerabilities, low vulnerabilities whatever they are.
Now Trivy knows which version is installed within the container image, and it can see based on its vulnerability database if this vulnerability has already been fixed. And this is the fixed version, but the fixed version is not used within a container image. And that’s how Trivy basically has the Trivy database, that’s also open-source tool, so let me just move this away, so within the Aqua security git repository, we have Trivy and we have the Trivy database, which is a separate project, but which Trivy uses under the hood.
And it’s basically pulling from a list of different resources, vulnerabilities on a six-hour period. So every six hours it’s updating its vulnerability databases, and it knows then if there are new vulnerabilities, or if vulnerabilities that didn’t have a fix available before, now have a fix available. And that’s how it can provide you this information. The thing is as you can see, this list is just one container image and it’s very long.
Now if you have multiple container images running within your cluster, or you have larger container images. Like if I would go ahead and scan an older version of Ubuntu, I would have lots and lots of more vulnerabilities. So it would basically spam me with vulnerabilities, some of which might have effects available, some of which might not. So I want to specify, whenever I say 3d image, I want to specify that I only want to see container images that have a fix available. So I can say, go ahead.
Nir: This is a configuration you can also set for your operator, obviously?
Anais: Yes, exactly. So that’s what I actually said in the operator, that I want to say ignore and fixed, true. I always, like I want to ignore all of the unfixed ones. Like it’s not going to make a difference to me if there are unfixed vulnerabilities within my cluster, that’s ultimately that. So that’s how Trivy makes a distinction between fixed and unfixed vulnerabilities, and the ones that are shown.
Nir: I’m going to give the people in the webinar, the participants an option to ask some questions. And if not, I have another question if you don’t mind. So it looks like, sorry, I’m just going to mute.
Udi: Looks like Q&A tab is empty, last chance for anyone in the audience to raise a question to Nir or Anais, or myself. So I think Nir, you got a chance to ask your question.
Nir: I’m just wondering, my DevOps mind is thinking. So Anais, can Trivy operator also handle as an admission controller? So can I not allow a deployment to enter my cluster if it has vulnerabilities? Or is it something that I should include as part of my CI pipeline as a command line?
Anais: You would want to use that as part of your pipeline, to say if the container image has certain vulnerabilities. You can specify if it has certain, like specific types of vulnerabilities, and you can also, now with Trivy, with the latest version, you can configure like Trivy plugins. So you can say, you can specify for example additional configuration files through which Trivy can know whether or not the vulnerability also affects your specific resources.
So a lot of times, we have a lot of vulnerabilities, but they might not actually affect our resources, right? Like they might be there, especially if you use certain combination of older versions of different types of tooling, right? Then you’re not going to be affected by those vulnerabilities.
And you might want to know that especially if it’s quite a hassle to update those resources, right? So you can also scan for yes, as part of your pipeline basically, whether or not those vulnerabilities affect you. That’s what you would do. But it’s not like automatically doing that, like you will have to run the steps.
Nir: You have to configure it like as part of your automation?
Anais: Yes.
Nir: Very cool. Udi, I think you can take the stage.
Udi: Okay. Anais, do you have a question for Nir, perhaps?
Anais: Do I have a question, I’m out of questions. What’s next for Komodor?
Nir: It’s a very big question, I don’t know if there is time to answer.
Anais: I know.
Nir: I think a really cool thing that we’re trying to do next is trying to give a DevOps engineer or developer the right context for different kind of issues.
So for example, if we’re trying to understand network issues, that we’re going to give like a good context of all kinds of resources that are related to network, or if we’re checking out, we’re looking at persistent data, and we have different kind of resources which relate to that, so we want to give a specified dashboard for that, that together with showing all kinds of other resources across multiple clusters and having the ability to manage multiple clusters, and seeing the difference between them is what we’re kind of working on right now.
I think Komodor is really exciting, and I know that the new features are coming through are going to be super cool for anyone who’s trying to understand what’s going on in terms of troubleshooting, and I’m excited for what’s ahead, so could continue logging in, look at other webinars, we’re going to show some surprises next. That’s it.
Udi: Yes. So if there are no more questions, [Inaudible 00:56:28.20]. So I just want to thank everyone for joining us, and thanks Nir, and especially Anais for joining us and bestowing everyone with your knowledge and experience.
I hope everyone that joined learned at least something new today. And like Nir said, keep checking in, we’re constantly wanting and changing and improving same as Aqua. So stay in touch, follow Anais on twitter, and we’re going to have a toast, to have.
Nir: Wow, Anais you’re not ready for that either.
Anais: Why did you prepare without me? You should have let me know.
Nir: It was a mid-session, I don’t know, I just see like a shot.
Anais: Just bring you that?
Nir: No, I mean this is Udi, this is the perks of working with Udi.
Anais: He just brings you random stuff, yes.
Nir: He just brings you random shots, that’s part of your webinar. So cheers, Anais.
Anais: Cheers.
Anais: So with [Inaudible 00:57:35.18] and then we always have a secure and reliable clusters, and [Inaudible 00:57:40.04]
Anais: Amazing.
Nir: Bye, take care.
Anais: Bye.
[End of Recorded Material]