#046 – Simulating, Scheduling, and Saving: Optimizing Kubernetes with David Morrison (Applied Research)

Dr. David Morrison

Founder & Research Scientist, Applied Computing Research Labs

Listen on:

Listen to the Podcast

Episode Overview

In this episode of Kubernetes for Humans, host Itiel Shwartz talks with Dr. David Morrison, founder of Applied Computing Research Labs and former distributed systems engineer at Yelp and Airbnb. David traces his unusual path from operations research into cluster orchestration — starting on Apache Mesos at Yelp, leading the migration to Kubernetes around the 1.14/1.15 era, and later becoming an approver on the Kubernetes Cluster Autoscaler at Airbnb. The conversation centers on SimKube (SimCube), David's open-source simulator that records production cluster traces and replays them locally against a real control plane with mocked nodes, enabling capacity planning, incident replay, and what-if analysis without spinning up real infrastructure. Itiel and David also dig into why Kubernetes utilization stays stubbornly low, why cost-saving initiatives so often stall in day-two operations, the on-prem-versus-cloud pendulum, and what the next abstraction layer above Kubernetes — the 'Python or Rust' for distributed systems — might eventually look like.

In this episode we discuss:

David's path from operations research to distributed systems: Yelp on Mesos, the migration to Kubernetes, and becoming a Cluster Autoscaler approver at Airbnb
How Yelp's open-source Pasta API layer made the Mesos-to-Kubernetes migration take roughly 18 months without disrupting developers
What SimKube/SimCube actually simulates: a real control plane with API server, scheduler, and controllers paired with mocked kubelets via the KWOK project
Why Kubernetes clusters typically run at only 30-40% utilization, and why scheduler/autoscaler control loops often fight each other and erase expected savings
The on-prem pendulum, the loss of capacity-planning discipline in the cloud era, and the search for the next high-level abstraction above Kubernetes

Key Takeaways

1

A well-designed internal API/abstraction layer (like Yelp's Pasta) is what makes a cluster-orchestrator migration tractable — the developers above it shouldn't notice the swap.

2

SimKube runs a real Kubernetes control plane against simulated nodes (using KWOK), so a 2,000-node production cluster trace can be replayed on a laptop for debugging, capacity planning, and what-if scenarios.

3

Most Kubernetes cost-savings efforts stall after the easy autoscaling wins because scheduling, autoscaling, and committed-spend instruments (RIs, savings plans) interact in non-obvious ways.

4

Low cluster utilization is largely a visibility problem: product developers double memory until things stop crashing, and platform teams can't see how those individual choices add up.

5

Kubernetes today is the C++ of distributed systems — powerful but low-level; the open question is what the Python or Rust on top of it looks like, and whether AI-assisted multi-tier programming is part of the answer.

Full Transcript

Itiel Shwartz: Hello everyone and welcome to another episodes of the Kubernetes for Humans podcast. Today we have in the show David. David, can you please introduce yourselves?

Dr. David Morrison: Sure. thanks for having me on the show. my name is David Morrison. I’m based out of California. and I’m a founder and uh research scientist at a small business doing uh Kubernetes scheduling and autoscaling. and yeah uh excited to be here. Thanks for having me.

Itiel Shwartz: Kubernetes scheduling is always like interesting. But before we’ll jump to what you are currently doing, can you share please a bit about your journey? Like where did you start? when did you first were when were you were first introduced to Kubernetes and you know a bit more color on what led you do what you do?

Dr. David Morrison: Sure. so I actually uh didn’t start in distributed systems at all. I was doing uh I was in the field called operations research which is uh kind of the study of like algorithms and logistics and planning and optimization. and uh, I ended up at Yelp, uh, which is, uh, you know, review websites, uh, uh, or review restaurants, um, and and other places. and I ended up on their distributed systems team, uh, where they wanted me to save some money on their AWS bill. and that was sort of my

Itiel Shwartz: like you didn’t with like you didn’t had any like experience with that like that was your task like reduce the bill without really knowing. Okay, that that sounds like it was

Dr. David Morrison: it was kind of a trial by fire. yeah, sounds like um when I started at Yelp, uh we were using Apache Mesos. and so I spent a whole bunch of time sort of learning the ins and outs of Mesos and also just like how distributed uh schedulers and like cluster orchestrators work. and then my first introduction to Kubernetes was in I think 2017 or 2018. we switched at Yelp from Mesos to Kubernetes, you know, like a lot of other companies did at the time. and so was really spending a lot of time like trying to understand like how are they the same and how are they different and like how can I apply some of my skills to this like brand new technology?

Itiel Shwartz: Is like the migration something that you know like how how was it decided or what was like your pro like place in the process because back then it was like being like an early adopter to do this like kind of migration like most companies it took them like three four maybe five more years. So maybe share a bit and and I guess like you guys already had a huge setup but you know share a bit about that.

Dr. David Morrison: Sure. yeah it was a big point of discussion. it was I think when we finally migrated I think we were on like Kubernetes 1.14 or 1.15 something like that. but it was a big discussion of uh is Kubernetes going to be the thing that we should use or should we stay with Mesos? we had a we had quite a few like long conversations um with me and my team and you know uh engineering leadership about what the right direction there was kind of

Itiel Shwartz: so how did Kubernetes win in the end like why how um

Dr. David Morrison: we had another person on the team who had you know gone to KubeCon a couple of times um had played around with Kubernetes and small-scale experiments um and I think was just uh in general really clued into the industry direction um and his comment was like look this is where everybody’s going um and Kubernetes gives us a lot of capabilities that Mesos doesn’t and so we should we should make this switch um I was I think sort of funny uh I was sort of on the opposite side like I kind of wanted to stick with Mesos because it was what I what I knew and what I thought it was a really cool piece of tech. but uh you know in the end I’m glad that we made that decision.

Itiel Shwartz: I think Mesos is like a really cool piece of tech. like I’m a huge Kubernetes adopter but I think Messos had a lot of things that he have done right like it’s it wasn’t like a playground project but no share with me like how much time did you like did it took you guys to migrate like I have one of Commodore customers is a huge like me setup and I think they it took them like three or four years to do the migration like end to end how much time did it took you guys and how painful Was it?

Dr. David Morrison: So, it was not that painful. one of the things that Yelp did really well, uh, is they built a API layer. it’s actually open source. It’s called PaaSTA. you can go look it up on GitHub if you want to. but, uh, it’s an API layer between the cluster and the cloud and all of Yelp’s developers. And so uh we were and like the abstractions that they implement there were like very well done. I was not involved in building that part. So like I really have to commend the authors for that. But uh what it meant was we were able to swap out uh from Mesos to Kubernetes without really impacting what Yelp developers did or how they did their jobs. And so, uh, I’m not going to say that it was painless. but I think we did the bulk of the migration in like a year and a half or something like that.

Itiel Shwartz: Quite impressive. Quite impressive, you know, to the scale and, you know, the company and so on. And what happened next for you for Yelp? Like, okay, you found out about Kubernetes, you ditched Mesos, even you liked it. What now?

Dr. David Morrison: Yep. so I spent about four and a half years at Yelp. I left uh in late 2020 and moved to Airbnb. and I was doing, you know, more of the same. Airbnb was all in on Kubernetes. and so I spent a bunch of time there, uh, doing autoscaling. I got I’m actually an approver on the Kubernetes Cluster Autoscaler project. and got some, you know, cool commits in there. but uh yeah, same same sort of work uh just a different place.

Itiel Shwartz: Okay. And and then what like

Dr. David Morrison: Yep. And so then after Airbnb um uh in 2023 I jumped ship and uh started off on my own. so I have a small business um Applied Computing Research Labs. and we’re really focused on helping you, you know, do scheduling and autoscaling and save money and increase reliability. one of the cool projects that I’ve been working on is this thing called SimKube. and what this does is it uh basically creates a simulated Kubernetes cluster. so you can take you know your production uh your production workloads uh your 2000 node cluster or whatever uh you can record a trace of it of you know the last 24 hours of data and then you can replay that in a simulated environment on your laptop.

Itiel Shwartz: How h how one second like how does it work like uh like walk me through the details like Kubernetes has a lot of like moving pieces right like there’s the layer itself there’s the scale there’s the scaling there’s the traffic there’s the network like what parts are you guys simulating and like what’s the main use case for that like is it to

Dr. David Morrison: yeah so the main use case is trying to understand the Kubernetes control plane better so in SimKube I’m running a real control plane and I have API server, controller manager, scheduler. you can set up an autoscaler if you want or if you have any other custom controllers. all of that stuff’s real. what’s fake, what’s simulated is the data plane. So all of the nodes and all of the applications that are running on those nodes uh are mocked out. I’m using this uh project called KWOK. It’s Kubernetes WithOut Kubelet. and basically what it does is it emulates the Kubelet API. but there’s no actual real hardware. There’s no Docker. There’s no code that’s running there. And so this is the sort of secret sauce that lets you uh run 2,000 nodes on your laptop because there’s not actually anything happening um on the sort of application side.

Itiel Shwartz: and like where are the users? Like who is using it? What’s like I understand why it can be helpful for someone like you but what people who are like not in that domain are using it like what’s the main use case like checking different strategics of like like scaling right but what

Dr. David Morrison: yeah so I think there’s a few different use cases um trying to understand uh you know something that happened yesterday uh is sort of one use case of like maybe there was a incident or an outage with your cluster and you like don’t understand what went wrong. then you can replay the events uh locally and really go in and like debug and troubleshoot. I think another cool use case is what’s going to happen tomorrow. so maybe you know you’re trying to do capacity planning or your boss comes to you and says like we’re expecting 10 times traffic tomorrow. Like what’s going to happen to our cluster? Like where is it going to fall over? you can actually uh based on like like your production data you can actually you know make some predictions about what’s what that would look like and where your uh bottlenecks are going to be

Itiel Shwartz: but will it simulate like the network as well like the different like distribution like you know simulating and like I spend a lot of my time trying you know doing like AB testing for like large system cluster it’s extremely like hard right like you don’t know where the private will be what users what you know there are so many moving pieces so like does it captures everything or does it mainly captures like the big blocks of you know compute and network

Dr. David Morrison: right so uh it doesn’t because again there’s no application code running um it’s not able to simulate the sort of network and traffic patterns um but what you can do is you can say oh you know this this pod or this deployment uh correlates ates with our traffic. And so, you know, when we have uh high traffic periods, it it’s running 100 replicas. And so, we know that if uh our traffic doubles tomorrow, then we’re going to need to run 200 replicas for this deployment. and so it’s not perfect. Like, you can’t you still have to think about a little bit where your traffic’s going to go and how that impacts things. but once you’ve sort of done that, then you can run that run that simulation and see how your cluster behaves.

Itiel Shwartz: Okay. No, that sounds interesting. So, so one second like I feel that we jumped like to too too much to like the technical aspect. Sure. You were in Airbnb, you were in Yelp, like big com like not big companies but you know like large like enterprises like large like you know large startup or like small enterprises no matter how you want to view them h why to jump ship why to open your own company how was the process how how working for you so you know share a bit about that like I’m sure that a lot of our listeners might have like their own cool idea in the space so yeah share a bit about that.

Dr. David Morrison: Yeah. this is a thing that I’ve been thinking about for a long time. even before I was at Yelp, I was kind of like wondering, I wonder what it’d be like to, you know, just run my own company. And um I think uh it was being at these large companies or these like enterprise companies gave me the experience and the sort of network and the ability to uh take that next step. so I’m really glad that I had that opportunity. but what I love about what I’m working on is like I’m not just building something for one company. I get to build things that help people all over the world. And that like like that’s really exciting to me. so um, you know, if you have other listeners who are thinking about this, like it’s certainly it’s not easy. it’s been a it’s been a slow hard road. but I’m really enjoying it. This is the this is the best work I’ve ever done. So

Itiel Shwartz: ah that that that’s fun to hear. So you know maybe go into the details you left and what’s the goal like what you know yeah like what do you do like in the end of the day why should someone hire you guys and what do you provide and what type of customers do you have?

Dr. David Morrison: Sure. so right now the bulk of my work is done in consulting. So people come to me and they’ll be like hey we want to understand uh why our Kubernetes clusters are so expensive or uh you know we just switched from Cluster Autoscaler to Karpenter and we want to like we don’t understand how to tune Karpenter to be you know cost-effective. so it’s these really like sort of targeted questions of how can we save money, how can we improve reliability. I think SimKube was originally designed sort of as a tool sort of internally for me to use. what I’ve discovered is that uh there are a bunch of other people who are like, “Wow, this is really cool. We’d love to use this in our environment.” and so what I’m trying to do uh long term is to build more of a business around like helping people come in, simulate their clusters, do real data analysis on what’s happening in their environments and then we can answer questions, you know, the same sort of questions we talked about before of how do we scale or how do we save money and then take that to your CEO or your CTO uh and back that up with, you know, these are real numbers.

Itiel Shwartz: Okay. Okay. That that makes sense. So when you are consulting, what are the patterns that you usually see? Again, let’s assume I’m a platform engineer. I have a lot of clusters. I have a lot of developers. What am I doing wrong most of the time? What should I do? And yeah like maybe share bit more about you know like the dirty stuff of we know how does the world actually looks like like outside of like Airbnb and Yelp which I’m sure that they are not represented the industry in the scale the level like in a lot of things.

Dr. David Morrison: Sure. I mean, one of the things I’ve discovered when I talk to people is, uh, everybody’s doing stuff a little bit differently, but by and large, we’re all trying to solve the same problems. We want we want an application that’s reliable and that is useful to people, and we want an application that doesn’t break the bank. and uh in a lot of these companies and organizations, your platform team is not your core competency. it’s a thing that’s there to support the actual business. and that puts a really sort of funny dynamic on how you prioritize and what conversations you have. um because you’re not it’s really difficult to tie a direct line between any particular change that you make to your infrastructure and how that makes your business better. and so that’s really one of the things I want to be able to do is come in and be like, look, if you make this change, uh this affects your bottom line in this way. that’s something that I don’t know that I’ve really been anywhere where they’ve been able to answer that kind of a question before.

Itiel Shwartz: Okay. No, so it’s like the business side to the technical side, you know, like I saw I think like a study from I want to say CAST AI but maybe I’m wrong here that most Kubernetes clusters are only like 10% utilized. Is it something that you see as well for your customers? And if so, like why? Like everyone knows that you know if I just double like from 10% to 20% I’m saving half of my cost right like assuming that so why do you think this happens what should they do and maybe again like maybe I’m a bit more details on that sure um I think I 10% seems low uh based on what I’ve seen but certainly like 30 to 40% utilization is not uncommon uh and I think there’s a bunch of reasons for this. it’s really hard if you’re a platform like or if you’re a product developer somewhere. you just want your application to work and so if it runs out of memory, you’re just going to go in and you’re going to double the memory. and it’s going to work and then you’re going to move on with your life and you never really come back and think about like what the implications of that are. and so uh it’s tough as a product developer to really understand uh the impacts that you make. and then also uh it’s really tough even as a platform engineer to understand uh the choices that your the people that you’re trying to support are making and how that impacts everything in your cluster. and so even and so I think one of the reasons why you see these low utilization numbers uh is um there’s not clear observability and visibility into how individual choices uh impact sort of the whole

Itiel Shwartz: Okay. Okay. No, I think like what you’re saying makes sense. So no maybe like you know like costsaving is something that is very trendy right like everyone is talking about it almost all of our customers and most of the prospects that we meet have like a huge initiative let’s reduce cost by I don’t know a lot like most of those projects or a lot of them are failing right like why or maybe again like maybe you do see something wrong even your job is to consult for companies that I guess struggle so Mhm. Like yeah, like share a bit and share a bit more about that maybe.

Dr. David Morrison: Yep. I think we’re in like a day two operations phase with Kubernetes and like these platforms now. And so most of the lowhanging fruit like it’s really easy to turn on autoscaling and then you’re saving a million dollars a year. easy done. once you’ve tackled this lowhanging fruit, um it’s really difficult to understand uh if I make this specific scheduler change, is that actually going to save us money? And there’s so many different control loops and moving pieces in there. So like maybe you make a scheduling change to bin-pack your uh pods more efficiently but your autoscaler is trying to scale up nodes so that everything gets spread out and so you end up in this situation where you’ve got two control loops that are fighting with each other and so you don’t actually end up saving any money. then you throw in things like uh for AWS you have like reserved instances or savings plans uh where you have committed spend uh and maybe you like maybe you’re using fewer resources but it doesn’t actually change your bill at all because uh you told AWS I’m going to spend X amount for the next five years. and so really understanding how all of the technical and uh sort of social and uh you know economic things tie together is like it’s really challenging and so it’s not surprising to me that uh people are struggling in this area like I think we see similar patterns.

Itiel Shwartz: Well, one thing that you know I had like a very interesting discussion uh with like one of my co-workers the other day is on-prem like Mhm. is it a thing? Are people going back to on-prem? Did they never left on-prem? Like what’s your take when you know we see the cloud providers are gaining more and more popularity? Maybe share a bit more about that from like a cost perspective or efficiency. So what do you do? What do you see? What do you think?

Dr. David Morrison: yeah, this is a great question. there’s certainly a lot of like splashy headlines about how we saved money going back to on-prem. I think there’s a few factors going on. when we like as an industry when we moved to the cloud um there was this whole job description around capacity planning that we just lost. the cloud is infinite uh for all practical purposes and so nobody has to think about their capacity and so I think a lot of these companies that are sort of going back to on-prem are seeing value from that as a forcing function of now we actually have to think about our capacity again. and so you know we can make tradeoffs about uh how many resources our application really needs instead of just treating this as an infinite resource uh that we’re really disconnected from. I think there’s also an argument to be made that you know some of the cloud providers uh you know they they exist to make a profit and they want to make money too. totally fine. but uh I think there’s potentially an argument to be made that uh they make it a little harder to understand what you’re actually spending than that would be ideal.

Itiel Shwartz: No, that that I think like no one would say I don’t think you can argue with that. Can you I don’t think so. You know like I feel that not not really like in day. so so do you think that like 5 10 years from now we will see more on-prem because it’s like maybe like the easy resort or like where are we going as an industry?

Dr. David Morrison: I think what we see as an industry is uh it’s kind of like a pendulum. we see this in all kinds of different fields. We’re seeing this in AI right now. there was a long period where people were like well AI is meaningless or useless. It’s not ever going to do anything. And now we’re kind of in a like big explosion of uh possibilities for AI. we see the same thing for infrastructure. we were on-prem like every company had a data center for a long time and being able to understand how to administer that data center was a really valuable skill. And then there was the pendulum swing. Now everybody’s on the cloud. and now I think what we’re seeing is the pendulum’s just swinging back. and I think it really depends. some companies are going have done really well in the cloud and they’re going to continue doing well in the cloud. some companies are uh discovering that they can do better when they’re on-prem or hybrid. and I think it really comes down to like uh what sort of skills do you have? What sort of things do you want to prioritize? um how you know how does this impact your bottom line and your business and u yeah

Itiel Shwartz: okay okay that’s fair and maybe you know like I love like ending with a bit of like predictions like it can be around any topic but let’s say that mean you are talking couple of years like what’s going to be difference and you know you can talk about scaling you can talk about Karpenter new technologies the kubernetes itself like the sig where we going like in a high level overview like as a whole.

Dr. David Morrison: This is a really interesting I spent a lot of time thinking about this. I think what we’ve built with Kubernetes is kind of like the lowlevel layer. So we have all of the pieces there to really uh run and deploy an application in any sort of way that you want or can imagine. what I think is missing is the uh you know sort of the abstraction on top of that. So every company you know Yelp had this uh with pasta um Airbnb had their own internal thing. Every company is building their own abstraction layer to make it easier to manage. and so I think about like I have an analogy where I compare this to programming languages like we had assembly and we had C and we had C++. and I think we’re kind of in the like Kubernetes is like C++. but then we built on like we built Python or we built Rust or we built Go um on top of all of this stuff. And so I’m really like where I want to be uh is like what’s the Rust or what’s the Go that we’re going to develop for distributed systems. I think that’s a really fascinating question um that um you know we don’t really understand.

Itiel Shwartz: Do you have anyone that you see that is like an interesting like you know like contender right now like for the new Python new Go?

Dr. David Morrison: I think there’s you know there’s a couple different really interesting directions. there’s a I actually gave a talk at KubeCon last year around uh kind of multitier programming uh where the idea is uh you know let’s take your monolithic application and have automation that underneath the hood deploys it in a distributed fashion. I think that’s maybe one direction that we can go. and there’s a bunch of like toy projects out there right now. but I’ve not seen anybody like be successful at uh, you know, making a production grade. Google Service Weaver was an example here. but that project was actually deprecated because it, you know, it’s really hard to do. and this I think is maybe an area where AI might be able to come in and help as well. maybe it’s able to understand like how your program is structured and figure out like which pieces uh can we split off and deploy elsewhere and which pieces need to all be colloccated or something along those lines.

Itiel Shwartz: Yeah. Interesting. I think you know programming language like Python as example even me as like a simple developer I can understand Python I can like benefit from it. when it comes to something bigger as you described it’s more on like how can I take the organizational complexity and somehow create something that is like one size fit all for a lot of like different companies right exactly and Airbnb are quite different even there are then like currently is that like C++ so no it’s interesting take like I really wonder as well like where is the industry going to go yeah okay David anything else that you want uh you want to say to our listeners

Dr. David Morrison: I don’t think so. I guess maybe I’ll just close by like, you know, I really love the title of your podcast of like Like all of the tech that we’re building, like all of the autoscaling or the, you know, service mesh or whatever, like that’s there in support of humans. and so I like I really love the sort of message that that sends of like we’re trying to build stuff to make, you know, humans better and to make humanity better. and so really appreciate that message around your podcast. and hope that that’s something that we can all kind of continue.

Itiel Shwartz: I do I do hope that as as well. Thanks a lot, David.

Dr. David Morrison: Yes. Thank you so much for having me on the show.

Itiel Shwartz: Pleasure having you.

[Music] Kubernetes for Humans.

This is an AI generated transcript of the conversation

About the Guest

Dr. David Morrison

Founder & Research Scientist, Applied Computing Research Labs

Dr. David Morrison is the founder and research scientist behind Applied Computing Research Labs (ACRL), a small business he started in 2023 to focus on Kubernetes scheduling, autoscaling, and cost optimization. His background is in operations research — algorithms, logistics, planning, and optimization — and he holds a PhD from the University of Illinois at Urbana-Champaign. Before launching ACRL, David spent about four and a half years at Yelp, where he worked on the distributed systems team migrating production workloads from Apache Mesos to Kubernetes, and then moved to Airbnb, where he became an approver on the Kubernetes Cluster Autoscaler project. He is the creator of SimKube (SimCube), an open-source tool for record-and-replay simulation of large Kubernetes clusters.

Resources Mentioned

Applied Computing Research Labs

SimKube (SimCube)

Pasta (Yelp's open-source PaaS/API layer)

Kubernetes Cluster Autoscaler

KWOK (Kubernetes WithOut Kubelet)