#007 – Kubernetes for Humans Podcast with Todd Palino (Linkedin)

Itiel Shwartz: Hello everyone, and welcome back to another episode of the Kubernetes for Humans podcast. My name is Itiel Shwartz, and I’m your host. Today, with me on the show, we have Todd Palino. Todd, happy to have you.

Todd Palino: Wonderful to be here, thanks for having me.

Itiel Shwartz: I’d be happy if you could do a quick introduction about yourself and then, as always, start with why and how you got into computers. What was your background and the path that you took?

Todd Palino: Absolutely. Right now, I’m a Principal Staff Engineer and SRE at LinkedIn. I’ve been here for coming up on ten years this December, so it’s been a while. Before LinkedIn, I was at Verisign doing operating systems standards. When I joined LinkedIn, I started in the Kafka SRE organization, which was when Kafka was still a very young technology. The founders were still at LinkedIn, and we were building it up, developing a lot of the operational knowledge around how to run Kafka. I later switched to doing capacity engineering for a while and helped build that team within LinkedIn. Now, I’m focusing on incident management and resilience, building up application testing, deployment controls, and a lot of tools and platforms to allow our service developers to focus on writing their applications and not on running them. That’s where I am today.

Itiel Shwartz: How did you come to LinkedIn? Why LinkedIn, one of the best platforms in history? What drove you to join LinkedIn, and how many people were there when you joined ten years ago?

Todd Palino: We were a much smaller organization when I joined. The SRE org was probably only about 100 people at the time. Now, under the same leadership, the site engineering organization is more like 1,200 or 1,300 people. I actually found LinkedIn through LinkedIn. I wasn’t really looking for a job at the time—I was pretty happy working at Verisign, where I’d been for about 11 years—but I was always open to a change. A recruiter reached out to me on LinkedIn to ask if I’d be interested in interviewing for the SRE org. SRE was something I’d always wanted to get into, but being in a traditional operations role on the East Coast, it was a bit different from what was happening in West Coast tech companies. I went through the interview process, ended up moving out to California for a few years, then moved back to the DC area and switched to remote work. I went remote before it became the cool thing to do during the pandemic.

When I joined LinkedIn, I didn’t know much about SRE. The head of SRE gave me a choice between two positions: one in search and one with Apache Kafka. I knew what search was, but I had no idea what Apache Kafka was, so naturally, I chose Apache Kafka. That decision turned out to be fortuitous because it allowed me to reinvent my entire career, becoming an expert in Kafka and really getting started in the SRE space, which gave me a great path for growth within LinkedIn.

Itiel Shwartz: You mentioned the word “SRE” quite a lot, and it can mean different things in different organizations. Could you give me a brief overview of what SRE means at LinkedIn, especially considering the organization’s size, and maybe go into more specifics about what you’re currently doing?

Todd Palino: Sure. The SRE org within LinkedIn has changed a lot over time. When I started, we were very much in a gatekeeping role, trying to prevent too much from breaking on the site and slowing developers down while performing operations. Over time, we’ve shifted more of our efforts to building tools and platforms that handle the work. I always say the best SREs engineer themselves out of a job because there’s always something new to do. We’ve built platforms that do automated capacity measurement and management, automatically adjusting resources over time based on site traffic projections.

We’re actually in the middle of what we call SRE 3.0, moving from SREs handling operational work to having service owners and application developers being directly responsible for running their applications. This allows SREs to focus more on platforms—building capacity platforms, internal cloud management platforms, resilience tooling, testing tooling, deployment platforms, and so on. The goal is for service owners to be directly responsible and on call for their services without the toil traditionally handled by SREs.

Itiel Shwartz: That’s super interesting. One of the things we’re doing at Komodor is helping our customers with capacity planning. It’s one of the biggest challenges in today’s environment because most developers don’t really know how to manage capacity. They just want their applications to run, without worrying about cost, reliability, or other factors. Could you give me your take on this general problem and how you tackled it at LinkedIn?

Todd Palino: One of the challenges is that we don’t have great tools to help application developers understand how many CPUs they need. We originally had a tool called “Dino” that did capacity testing in production by stress-testing applications using production traffic. The first version, which came out about seven years ago, caused site incidents and took down applications, so nobody liked using it. We developed a second version that was safer to run in production. This tool unbalances the load in a cluster, loads up one instance until it reaches its limits, and then backs off slightly and holds it there to quantify traffic levels. We got it to a point where it runs safely on almost all applications, and it’s become a foundational tool for our capacity management.

With this tool, we can measure how much traffic a single instance of an application can handle, and then we build automation on top of it that projects the overall traffic for the site and automatically scales applications up or down as needed. We also do yearly projections to inform our physical infrastructure team how many servers need to be purchased. This automation takes a lot of the work off developers, allowing them to focus on deploying their applications without worrying about how many instances they need.

Itiel Shwartz: You mentioned that you’re primarily a Java shop. I remember working on a low-latency Java application a few years ago and reading a fascinating guide on optimizing Java applications, possibly written by someone at LinkedIn. It was hardcore tuning, which not many people need to do, but when you do, you need strong expertise. It’s not just about changing Xmx or Xms settings—it’s much more complex.

Todd Palino: Exactly. It looks like magic to a lot of people, which is why it’s important to have experts build tools that encode their knowledge, so everyone else can benefit. Some of our tools are open source, like JXRay, which was created by someone who now works at LinkedIn. The reason our internal tools work so well is because we have a common application framework that our developers use. This framework provides metrics, a common control plane interface, and integrates with our CI/CD systems, making everything fit together. Because of this common framework, tools like Dino work across all applications, making capacity management much easier.

Itiel Shwartz: Where does the line get drawn between the SRE team, the platform team, and developers? If I’m a developer on one of your teams, when do I need to wake up in the middle of the night for an issue, and when is it your responsibility? Who decides?

Todd Palino: We’re moving towards having application developers be directly on call for their applications, which is a shift from when SREs were on call. Our observability platform provides signals for whether an application is healthy or not. Because of our common framework, we can define a central set of signals and put reasonable alert thresholds in place. If there’s an issue, the application owner gets paged. If there’s a problem affecting multiple applications, our site operations team will try to figure out whose problem it is by looking at paging information or coordinating reports. This way, we can mitigate problems quickly, like shifting traffic between data centers, before having to debug whose issue it is.

Itiel Shwartz: That’s interesting. A lot of people would love to have such a system in place. I often hear from customers that they’re not sure if an issue is with infrastructure or an application. Do your alerting and triage systems work well in practice, directing issues to the relevant team most of the time?

Todd Palino: Yes, consistency is key in these things. With consistent metrics, any expert can step in and start diagnosing an issue. If that expert can then encode their knowledge into triage tooling, the tooling can automatically analyze metrics much faster and identify the likely source of the problem. Our observability team does a lot of work on the triage tooling, and they partner with data scientists to develop models. While most of us SREs aren’t data science experts, we’re fortunate to have a large and highly qualified data science organization at LinkedIn, and we leverage those folks for operational models as well.

Itiel Shwartz: That sounds super cool. Could you share something interesting from your LinkedIn experience, maybe a transformation or change that resonated with you and might resonate with our audience? How was it handled in the organization?

Todd Palino: The move to focus on platforms has been challenging for a lot of folks, especially those who came up through traditional operations where developers handed over a program, and operations ran it without much interaction. SRE operates differently, partnering closely with developers, and now we’re transitioning to have developers take on more responsibility for operations. This shift is difficult for many people because they want to be the heroes who step in when the site is on fire. However, the pandemic has helped

 with this transition by refocusing priorities, especially around work-life balance. At LinkedIn, our culture has always emphasized not overworking and doing the right things, but we’ve had opportunities to refine that further, like implementing half-day Fridays and no-meeting days during remote work.

Itiel Shwartz: What do you think was the biggest hurdle to overcome during this change? Was it the Ops team, the developers, or both? Who drove the change, and who was the hardest to convince?

Todd Palino: SRE definitely started this change, driving it to reduce toil and focus on platforms. The biggest challenge now is with the developers because many don’t yet understand the toil of operations and what it entails. That’s part of why we’re doing this—because the closer you are to the problems with your application, the more motivated you are to fix them. When developers are responsible for responding to incidents, they’re much more motivated to address recurring issues. However, the platforms aren’t yet where they need to be for developers to do this without significantly increasing their workload. We’re spending a lot of time building platforms to make incident management easier and faster, but we’re still in the middle of this transformation.

Itiel Shwartz: When doing such a transformation, do you think it’s more about using carrots or sticks to motivate people? What worked for you?

Todd Palino: Right now, it’s probably more of a stick than a carrot situation. Developers are taking on more work without much direct benefit, while SREs get to focus more on platforms and less on individual applications. This is better from a toil perspective for everyone, and it’s also better for SRE career paths, as SREs are specialized software engineers rather than operations folks. Getting SRE back to its roots of software engineering and platform development is a great thing for all of our engineers.

Itiel Shwartz: We’re seeing a lot of companies trying to empower developers for Kubernetes. There’s always this clash between developers caring only about their code and the reality that their code can sometimes cause issues. It’s a big mindset shift, but it’s inevitable given the number of developers and applications out there.

I see we’re close to running out of time, so I’d love to ask you what you think the future holds. Where is the world going, especially in the SRE and developer space or with Kubernetes? Give me some predictions about how life will look in a couple of years.

Todd Palino: We’re moving towards full automation of everything—deploying applications, alerting, managing resilience, and capacity. In a year or two, we’ll see these things fully automated for developers, allowing them to just write code. From an SRE point of view, this will allow us to expand what we’re looking at, like applying AI, especially generative AI, to Incident Management. We’re still early in this, and the tools aren’t great yet, but we’re starting to explore how AI can help, like by developing incident summaries or enhancing triage information.

As we automate more of the toil around running applications, we can focus on making these platforms really good. This will reduce the time to detect and mitigate problems, improve the speed at which developers can deploy new features, and ultimately allow us to move faster while maintaining site reliability. That’s the tension we constantly manage—rolling out changes quickly to support the business while ensuring reliability and resilience. The more we can make reliability and resilience painless, the faster we can move, and that’s where we’re headed.

Itiel Shwartz: That’s a good prediction. We’ll check in two years from now to see how it turned out. Any last words, Todd?

Todd Palino: I think we covered it all. My focus is on resilience, so it’s all about how to do this stuff safely. We’ve got some really interesting things we’re looking at with continuous testing and deployment, and I’m hoping we’ll be able to share more about that soon.

Itiel Shwartz: Thank you for being a guest. Good luck with LinkedIn, and I hope you’ll be there for another ten years or so. Thanks a lot.

Todd Palino: Thanks very much.

[Music]

Todd Palino is a Senior Staff Engineer in Site Reliability at LinkedIn on the Capacity Engineering team, where his team is creating a framework for application capacity measurement, analysis, and change intelligence. Prior to that, he was responsible for architecture, day-to-day operations, and tools development for one of the largest Apache Kafka deployments. In his spare time, Todd is the developer of the open-source project Burrow, a Kafka consumer monitoring tool, and is the co-author of Kafka: The Definitive Guide, now available from O’Reilly Media. Out of the office, you can find Todd at conferences like SREcon and LISA, sharing his experience from years in SRE technical leadership, and at Kafka Summit or ApacheCon talking about how to feed and water Kafka infrastructures. Or maybe out on the trails, training for the next marathon.  

Itiel Shwartz is CTO and co-founder of Komodor, a company building the next-gen Kubernetes management platform for Engineers.

Worked at eBay, Forter, and Rookout as the first developer.

Backend & Infra developer turned ‘DevOps’, an avid public speaker who loves talking about infrastructure, Kubernetes, Python observability, and the evolution of R&D culture.  He is also the host of the Kubernetes for Humans Podcast. 

Please note: This transcript was generated using automatic transcription software. While we strive for accuracy, there may be slight discrepancies between the text and the audio. For the most precise understanding, we recommend listening to the podcast episode