#028 – Kubernetes for Humans Podcast with Chris Bailey (IBM Instana)

Itiel Shwartz: Hello everyone, and welcome to another episode of Kubernetes for Humans Podcast. Today with me on the show, we have Chris. Chris, I’ll be happy if you can introduce yourself.

Chris Bailey: Yeah, sure. I’m Chris Bailey. I work for IBM, and I am the CTO for Instana Observability, which is our modern full-stack observability solution for operators and SREs.

Itiel Shwartz: So, just for people who don’t know Instana, I would say it’s quite similar to Datadog, New Relic, and other players. Is that a correct comparison?

Chris Bailey: Yeah, we’re all trying to solve the same problem. There are differences in the approaches each of us takes, but yes, we’re competing products.

Itiel Shwartz: Before we talk about Instana and what you’re currently doing, I’d be happy if you could share a bit about your background—how you got into tech, observability, and in general, your journey.

Chris Bailey: My entry point into working on technology was actually working on implementations of Java and JVMs. I started by working on programming languages and runtimes for programming languages. From Java, I moved a little sideways and started working in the open-source community for Node.js in its earlier days. I led working groups on things like monitoring, diagnostics, performance benchmarking, and so on.

Itiel Shwartz: Chris, I have to ask, how did it happen—moving from Java to Node.js? It’s quite a big jump; it’s not the usual path most people take.

Chris Bailey: Some people think it’s a small jump because, you know, Java and JavaScript sound similar. But it’s more about the context. I’ve been with IBM for 23 years, working on IBM’s implementations of Java. Back in the day, this was before open source was prevalent. Sun, as the creator of Java, created the Java specification and the specification for the JVM. Other vendors could then create compliant versions, primarily for their own operating systems and platforms. IBM took Java and made it run on IBM hardware—Power, AIX, etc. As IBM became interested in other languages like Node.js, that’s where the sideways move happened. I was working on foundational runtime technology for Java, and we were interested in doing similar things for other languages. At the lowest level, there’s a lot of commonality—most runtimes are written in C, C++, or Assembler. They involve interpreting the higher-level language into a set of operation codes or opcodes. So, it’s more of an evolution than you might expect. Of course, at the programming level, yes, you’re going from a typed language to a very dynamic language, which is a bit different. But that was my first step—from Java to Node.js.

From Node.js, I then started working in the Swift community. IBM had a deep relationship with Apple around mobile development. As Swift started to replace Objective-C, we helped Apple open-source the project. They were interested in making Swift a widely adopted language, not just one inside the Apple ecosystem. Myself, some of IBM research, and some of IBM development worked to get Swift up and running on Linux, get it into server environments, create an HTTP stack for it, create a microservice framework, and build a server-side ecosystem around it. My career started with programming languages, then moved up to building frameworks and microservice frameworks around them. After that, I moved on to developer tools, CI/CD pipelines, and deployment capabilities, particularly around microservices and cloud-native.

Itiel Shwartz: That’s quite an interesting and unique route. I don’t think I’ve had many guests on the show who come from such a low-level background, working on assembly, C, and Assembler, then moving up the stack to cloud-native. It feels like it’s almost a completely different ballgame. What was your first project when you moved to building CI/CD? What was the project like?

Chris Bailey: IBM was building its own platform around Kubernetes. We were working in the Tekton community to help build out Tekton’s build capability and have that as part of our platform.

Itiel Shwartz: How long ago was that? Tekton is quite new, right? Was it four or five years ago?

Chris Bailey: Yes, about four, maybe five years ago. I was working around Tekton, Argo CD, and how you specify and lay out GitOps project structures. We were making it easy for developers and application teams to build cloud-native solutions.

Itiel Shwartz: Can I ask, what is the motivation for IBM to invest in something like Tekton? Why does IBM invest in the overall ecosystem to make it easier for companies to adopt Tekton, Kubernetes, and so on? What’s the strategy and goal behind it?

Chris Bailey: If you look at the vendors doing this—Rancher, for example—they’re creating a distribution of Kubernetes and simplifying it to make it easier for developers and operations teams. Red Hat, a subsidiary of IBM, does the same with OpenShift and Red Hat developer tools. VMware does the same with Tanzu. This is how you go from Kubernetes as a bare substrate to building a platform and ecosystem around it, making it easier for enterprise customers to adopt, run at scale, and standardize.

Itiel Shwartz: So, IBM’s endgame is to help large enterprises utilize cloud-native resources more effectively by investing in the ecosystem, making it easier for companies to adopt Kubernetes and related technologies. Is that the KPI here?

Chris Bailey: Exactly. That’s what we do through Red Hat and OpenShift. OpenShift is like Red Hat Enterprise Linux. It’s an opinionated platform, taking the core Linux kernel and adding device drivers and packages to provide a complete platform. OpenShift does the same for Kubernetes—it includes service mesh, build capabilities, and continuous deployment, ensuring these components integrate smoothly and work together. This provides a simplified solution for large enterprise customers.

Itiel Shwartz: That makes a lot of sense. Now, let’s get back to your journey. You were working on Tekton, which was seen as a potential replacement for Jenkins but still in collaboration with it. Can you give us a bit of insight into why people should use Tekton, what the alternatives are, and what the current status is?

Chris Bailey: I’ll start by saying I haven’t worked on Tekton for three or four years now. That work is being carried forward inside our OpenShift team. But back then, there were many vendors taking slightly different approaches to solving the same problem. Tekton’s big advantage was making CI/CD cloud-native. It runs on Kubernetes itself and uses CRDs to declare stages and steps. For anyone familiar with the Kubernetes ecosystem, configuring and deploying Tekton is natural—it’s an extension of what you’re already doing.

Itiel Shwartz: When you left the project, what was the status in terms of adoption or features?

Chris Bailey: From our perspective, we got it to the point where it was becoming an integrated part of the platform. When I moved roles, the last thing I did was hand over the work we had been doing to Red Hat as a subsidiary. OpenShift build pipelines are Tekton, and OpenShift continuous delivery is Argo CD. So, it’s now completely productized as a standard part of the OpenShift platform.

Itiel Shwartz: That’s super cool. Where did you go after that? What was the next step for you?

Chris Bailey: After that, I moved into observability and operations, which is where I am now. Instana was a company that IBM acquired. I worked as part of the acquisition team, evaluating Instana and working with them to see if our vision for observability aligned with their ethos and DNA. For the last four years, I’ve been working with the Instana team and now lead them from a technical perspective and strategy point of view.

Itiel Shwartz: Before we dive into Instana and what you do differently, I’d like to say that even before the acquisition, Instana was a unique piece in the ecosystem. There were the legacy systems like AppDynamics and perhaps Dynatrace, which was the first widely adopted generation of APM solutions. Then came the new kids on the block like Datadog and New Relic, making APM more accessible. Instana seemed to be saying, “We’re also going to be in APM, but much more cloud-native and containerized.” They took old concepts—observability, detecting issues, investigating them, solving them—and reimagined them for the cloud-native era. I was a big fan of Instana back in the day, and they had a cute robot as a mascot, which was nice. Why did IBM acquire them? What was the synergy, and how are things going four years after the acquisition?

Chris Bailey: To your first point, a lot of the major vendors in the observability space—New Relic, Dynatrace, Datadog, AppDynamics—have been around for 15-plus years. Kubernetes hasn’t been around that long. Instana’s big advantage was that it was born in a Kubernetes world. If you look at the history of monitoring and operations tools, they’ve had to react to changes in how we develop and deliver applications and infrastructure. We used to have three-tier architectures with a monolithic database, an application server, and web apps or APIs. All of that ran on bare metal or virtual machines, so you had three or four technologies to monitor. There were specialized tools for each—database monitoring, infrastructure monitoring, application server monitoring. 

Applications were deployed once a quarter, and this was a big deal. The three teams would get in a room together to monitor their tools and ensure nothing went wrong. Now, application design and delivery have evolved. We moved from three-tier to service-oriented architecture, then to microservices. We went from a single database, single application server, single type of machine, and updates every few months to tens or hundreds of components, each with its own database, using the right database for the type of data. It’s no longer just Oracle; it’s Postgres, CouchDB, MongoDB, Redis, ClickHouse, Elastic—all inside one solution. 

We’ve got a scale of technologies, using languages beyond just Java or .NET—Ruby, Python, Erlang, Go, etc. There’s an explosion of scale, and everything is dynamic. Autoscaling, clustering, replicas, and on-demand releases—delivery frequency has gone from every three months to once a month, once a week, and some companies release daily or multiple times a day. The idea of three teams of specialists understanding everything is no longer possible. The tools have to give you that information. 

This is where the transition from monitoring to observability happened. It’s now a complete full-stack view of everything happening in real time. That’s where Instana’s value proposition lies.

Itiel Shwartz: That explains why Instana is needed, but why did IBM acquire it? What’s the bigger vision for IBM?

Chris Bailey: IBM has a long history of providing monitoring tools, but those tools were designed for older architectures. The switch to dynamic, large-scale complexity requires a different approach. You can try to rebuild existing technology, but sometimes it’s easier and more effective to start from a blank sheet of paper, which is what Instana did. They started from scratch, understanding the scale requirements and the need to handle containers and Kubernetes as first-class citizens. That’s what made Instana attractive to us.

Itiel Shwartz: That makes total sense. Four years later, where is Instana now? Has anything changed in the core mission or technology perspective?

Chris Bailey: One of the things about having a technology that was only three or four years old when we acquired it is that there were gaps in capability. Over the last four years, we’ve been closing those gaps. For example, we’ve added the ability to run synthetic tests—making requests to an API or endpoint from locations around the globe to ensure it’s working correctly. We’ve been building in core capabilities that people expect from observability tools.

Another big part of our acquisition case was where we wanted to take the capability. Instana has three unique differentiators. First is what we call “observability as code” automation. Instana’s goal is to automate everything. For Kubernetes, you deploy the Instana agent as a daemon set, an operator, or a Helm chart that deploys an instance of the Instana agent on each underlying worker node. From there, it monitors the worker node and discovers everything running on it. We auto-instrument everything, and the agents can auto-update themselves. We build a digital twin representation of everything running—namespaces, services, deployments, pods, containers, and their relationships. 

Because we’ve auto-instrumented the workloads, we track every single request in the system. For example, if you have a Node.js service making a call to a Java service, we track that request. We start at end-user monitoring—web and mobile front ends—and track requests through load balancers, proxies, services, databases, and back. We’re automating the end-to-end operations role to auto-detect incidents and help resolve them as quickly as possible.

The second differentiator is that we collect 100% of every single request, along with metrics like CPU, memory, disk, and heap usage, every second. By collecting every second, we ensure we don’t miss crucial information. If you’re at 100% CPU for 5 seconds, we’ll catch that. Other vendors might sample every 10 or 30 seconds, potentially missing critical data.

The third differentiator is our ability to do analytics. With high-quality data, we can do better analytics, which ties into our automation. We automatically detect problems in the system and analyze the flow data to understand cause and effect. For example, if a service is showing errors at the API level, we follow the requests to the database and see that the database is returning errors because it’s out of disk space. We automatically correlate everything and identify the root cause.

Itiel Shwartz: We’re almost out of time, so let me ask you a couple of final questions. First, how are you different from Datadog or Dynatrace? I’ve spoken to the VP of Engineering at Datadog, and what you’ve described sounds similar to what they offer. What do you tell people who ask this?

Chris Bailey: At our core, we’re the only vendor that collects 100% of every request going through the system. This allows us to do unique things, like knowing exactly how many requests come from mobile versus web devices and who those end-users are. We also collect application logs as part of the request, so you can trace errors back to the specific request that caused them. This level of data gives us unparalleled insights and allows us to do automatic analysis, helping you understand the cause of problems. No one else is doing this because they lack the data to do it reliably.

Itiel Shwartz: It sounds like a huge technical challenge, but we won’t have time to dive into that. My next question is, what is the biggest challenge for you right now? What do you see as the most significant challenge or opportunity?

Chris Bailey: Let’s call it an opportunity. The explosion of scale means that as an SRE, you’re expected to support an environment with five different data stores. You can’t be an expert in everything. One of the challenges in Ops has always been, even if we give you all the data and identify the problem, the next question is, what do I do about it? If you’re told there’s increasing lag on a Kafka topic, how do you fix that? For us, this is where generative AI comes in. We mine information from knowledge bases to automatically present relevant information to the Ops team. For example, if you’re running out of disk space, we can suggest expanding the volume, deleting files, or migrating files to a backup. We also provide operations tasks and automations to help implement those solutions, reducing the time to resolve problems once they’ve been identified.

Itiel Shwartz: That’s super interesting and aligns with what we’re seeing at Komodor. Last question—what does the future hold in terms of technology trends? What’s the biggest trend or shift you see in the ecosystem?

Chris Bailey: I think we’ll see more workloads moving to the edge, closer to users, whether they’re machine users or human users. We’ll end up with multi-tier topologies and solutions. Scale will continue to increase, and almost every medium to large company is trying to figure out how to add more AI into their solutions. We’ll need better ways to observe and understand how that AI is executing.

Itiel Shwartz: Chris, this has been a great episode. It’s been a pleasure having you here. Anything else you want to say to conclude?

Chris Bailey: Just thank you for inviting me and spending time to talk. It’s been one of the most interesting conversations I’ve had, from starting with Java at IBM to observability and edge applications. It’s been a pleasure.

Itiel Shwartz: Pleasure having you, Chris. Goodbye.

[Music]

Chris Bailey is an IBM Distinguished Engineer and CTO for Instana, leading the technical strategy and development of Observability and IT Automation solutions at IBM. He is a recognized technology leader across programming languages, runtimes, platforms, observability, and automated IT operations. He has pioneered projects that fundamentally enhanced open-source communities and cloud-native platforms. He is currently focused on providing comprehensive real-time observability across the entire enterprise from Business to IT, and leveraging AI and Automation capabilities to enable organizations and teams to meet their operational objectives.

Itiel Shwartz is CTO and co-founder of Komodor, a company building the next-gen Kubernetes management platform for Engineers.

Worked at eBay, Forter, and Rookout as the first developer.

Backend & Infra developer turned ‘DevOps’, an avid public speaker who loves talking about infrastructure, Kubernetes, Python observability, and the evolution of R&D culture.  He is also the host of the Kubernetes for Humans Podcast. 

Please note: This transcript was generated using automatic transcription software. While we strive for accuracy, there may be slight discrepancies between the text and the audio. For the most precise understanding, we recommend listening to the podcast episode