#059 – From Early K8s to the Edge: Shifting Compute Left with Dave Aronchick

Dave Aronchick

Co-founder & CEO, Expanso

Listen on:

Listen to the Podcast

Episode Overview

Host Itiel Shwartz sits down with Dave Aronchick, co-founder and CEO of Expanso, for a tour through Kubernetes' early days and where compute is heading next. Dave reflects on joining Google in 2014 as the first non-founding PM for Kubernetes, leading GKE, and co-founding Kubeflow, then unpacks the philosophy that let Google give away the top of the stack while still building a commercial business. The second half pivots to his current mission: shifting data pipelines and AI workloads left, all the way out to where data is generated, so teams can filter, govern, and add context before anything moves to a data warehouse. Along the way: solar inverters, mining analogies, edge video analytics, and why lineage and provenance are quietly the most important features for any AI-driven system.

In this episode we discuss:

How Google navigated open source in 2014–2015 to make Kubernetes the industry standard
Running GKE alongside upstream Kubernetes without forking — the conformance philosophy
The origin story of Kubeflow and why declarative ML pipelines needed their own framework
Why data, the speed of light, and security all argue for shifting compute left to the edge
How Bacalhau and Expanso help teams run rich data and AI pipelines next to the source

Key Takeaways

1

Google's Kubernetes strategy worked because the team refused to fork or carry heavy patches — the differentiation was running infrastructure better, not capturing the API.

2

Kubeflow was built on the bet that Kubernetes alone wasn't enough; ML teams needed declarative pipelines layered on top to be productive.

3

Three forces push compute toward the edge: data is growing everywhere, latency and bandwidth are fixed by physics, and the most secure data is data you never collected.

4

Shifting ETL left — filtering, averaging, and governing data right next to inverters, cameras, or point-of-sale devices — keeps downstream warehouses like Snowflake and Databricks dramatically cleaner and cheaper.

5

AI agents fail mostly from missing context, not missing intelligence; investing in lineage and provenance at collection time is what makes downstream agents reliable.

Full Transcript

Itiel Shwartz: Hello everyone and welcome to another episode of the Kubernetes for Humans podcast. Today I have in the show David. David, do you want to introduce yourself?

Dave Aronchick: Yeah, hi. I’m Dave Aronchick. I’m co-founder of Expanso. In a previous life I led a distributed data company. In a previous life I was the first non-founding PM for Kubernetes at Google. I led the GKE project. I co-founded Kubeflow. And I’ve co-founded the open source project Bacalhau which does compute-over-data. I also happen to work at several of the major hyperscale clouds.

Itiel Shwartz: Yeah, I have to be honest that you know before the call I checked out your LinkedIn profile and it seems like you started in marketing and somehow ended up being like in all of the big companies. So maybe share a bit about your start and then like a journey and we’ll take it from there.

Dave Aronchick: Yeah, I’ve never done marketing. I you know way back in the day I wanted to be a doctor and you know at the time I was doing a lot of stuff around fMRI. I’d always been like a nerd and I always loved computers. But you know that that was purely like a you know off-handed thing. I just liked doing it. But I went to work at the NIH which is a big research and government run research institution here in the US where I ended up doing a lot of fMRI. That’s functional MRI on what the time was was very large data doing scans and things like that. And so I used Sun systems and Linux systems and so on and and you know taught myself a whole bunch of it. and then I went and did a two startups, one one startup and another startup where I was CTO and director of internet operations. and then I started through Microsoft where at Microsoft I was in the server and tools group doing product management. I did another startup. and then I did which I led as CEO for 6 years. And then I did Amazon and Chef, the DevOps platform. Before I landed at Google where I was doing where I started under Kubernetes. Kubernetes had already been founded. but they asked me to come on and and lead the project lead Google’s open source presence from from in the Kubernetes community from the product management perspective, not on the engineering side. Obviously, there were terrific engineers there. but then also you know, lead the GKE product which was the hosted Kubernetes.

Itiel Shwartz: Okay, so maybe like take us back then, right? Like just cuz it’s like a super interesting timing. Even I feel that for some of the listeners, it’s might sounds like a ancient history. But back then, Kubernetes wasn’t really like the the standard, right? It’s not that it like you said Kubernetes and everyone were like, “Yeah, it’s like Linux, whatever.” So so take us through through those days, maybe.

Dave Aronchick: No, it’s it’s really funny cuz when I took the job, so so I had the job or they offered me the job in in the middle of 20 20 14. and I ended up saying, “Look, I’d love to work with you, but I you know, I’d love to go over to Chef and and you know, I’m really a big fan of this declarative or excuse me, they they were imperative DevOps and things like that which Chef was really at the cutting edge for. and one of the reasons was Google just hadn’t really established itself as an open source community. They were doing enormous amounts of open source, but only in other projects, right? They obviously were core contributors to Linux kernel, they contributed to Hadoop, and and they had done you know, Android at the time, which they bought. they had Angular, I believe. Every Anyhow, several open source projects. Yeah. But Google wasn’t really famous as an open source company. Now again, that doesn’t mean there aren’t phenomenal engineers, but I was interested in doing some stuff in open source. Anyhow, you know, after you know, Chef had some acquisitions and things like that. I was like, “Ah, I’m not sure there’s a space here.” And so I went back to to Google to work on this. And even at the time, Google was still really trying to figure it out. You know, Google is a phenomenal place to ship software, but it’s a phenomenal place to at the time to ship software internally. They really hadn’t got the motions right for how to do open source product planning and how to to let the community lead. They can vote on it, but they’re going to lead even when they have very, very deep thoughts about what to do. so, you know, that was a really confusing time. In addition to that, some of the most core components were not owned by Google at all. you had Docker, obviously from a container perspective. you had now you know, Google had no opinion. We we did try to do a bunch of different things around networking, and we didn’t again, we were trying to be very disciplined about not just saying, “Hey everyone, go do this.” and so, we looked at projects like Weave and Calico and something, you know, around that. We had no opinion on disk as well, right? We were looking at MinIO and GlusterFS, and there were a bunch of other things as well. So, it was a really interesting and challenging time. Our you know, what what I kept going back to the team with, and I kept pitching is like, “Look, our job is not to, you know, is certainly to to do the work, right? And we’re going to do the work, you know, selfishly we wanted people to run it on Google Cloud, but we’re going to be we have to be happy with people building on Kubernetes even if it directly competes with us, right? That is our whole bit. We just want people to build and run software in this way cuz we think it’s the best way to run, you know, software in this microservices world. and and we have to establish these base workflows for the community to say, “Here’s my voice. I have the same weight as any other contributor here. All that matters is showing up and doing the work and carrying water and all that kind of stuff.” and, you know, that’s what we did. And again, I you know, I feel incredibly lucky to have been there. again, I want to re-stress like I was just this much of of the solution. it it just so happened that we had a bunch of people internally on the team who knew about open source, at Google who knew about open source, and I really don’t want to miss downplay the work that Red Hat and IBM and I mentioned Tigera and Calico, and I mean the the list goes on and on and on. I don’t, you know, it would you know, no matter how long I spent on this podcast saying all the people who contributed, it would be too short. so it was really really something and and you know, I one one of the best times of my life.

Itiel Shwartz: So like it it sounds like a lot of fun and a lot of things like happened simultaneously, but I must ask like I understand the open source front, right? Like you guys want people to use Kubernetes as much as possible and it makes sense. But on the same time, according to what you say, you also led GKE led GKE, right? Which is basically like a a commercial product, right? Like here we do want some money. So weren’t those like roles a bit competing or like how was it?

Dave Aronchick: You know, it it it competed a lot less than you would think. You know, we tried very, very, very hard only to to run upstream compliant work. And and during my time there, we spent a bunch of time on Kubernetes conformance and things like that because we wanted to say, “Look, you know, you as another cloud, not just us, we’re going to like eat, you know, eat our dog food and and do what we say.” Any cloud that wants to claim you are running on Kubernetes, you know, which we held the trademark for, the foundation held the trademark for, excuse me. You have to be compliant and we will come up with a conformance structure for you to go and do this. And so that’s what we did. And and so there were certainly things, there’s still things today where they, you know, at Google we were like, “We think we can run this better.” But and we will we will implement them when you run on Google Cloud. So for example, again, going back to networking, we’ll provision networking for you in a Google-native way and that will tie to your, you know, authentication and your your will provision disk for you and all those kind of things. But we’re not going to do it with any kind of proprietary stuff. Excuse me, proprietary interfaces. This will be all about making your totally compliant Kubernetes cluster run better on us. And our whole philosophy was basically came down to two-fold. One, if we have people, you know, if we fork in some way, if we carry too many patches or anything like that, that means that people who are running their Kubernetes cluster on prem or on other clouds will not be able to move to us because that will involve some form of significant migration. So, we’re going to try and keep this as simple as possible and we’re really going to focus on the infrastructure. And the second half of it is a fundamental belief that the stuff that people were doing on prem or or on other clouds, we felt like competitively we would be able to do it better. And that really didn’t come down to Kubernetes level changes, it came down to like we just think we can run our disk and our our CPU cycles and our networking more efficiently than other people. And and as a result, that’s where we’ll get our margin and our differentiation. you know, it it that was a philosophy and and it’s one of the reasons why it was so easy for us to give all this top-level stuff away because we knew at the end of the day the thing you can’t give away, the thing you can’t open source is a server box with that needs electricity to go in and and value to come out. and so that that’s what we would we would be selling.

Itiel Shwartz: Okay, that sounds good. So so so maybe like you know one would guess, I don’t know, after that you will go to do some cool Kubernetes startup or something like that, but what was your next step after like Google?

Dave Aronchick: Well, so I was still at Google when I left the Kubernetes project to go start Kubeflow. Kubeflow? and the idea behind Kubeflow was I love Kubernetes, I still love Kubernetes. I think it’s you know, really transformative. but people need declarative frameworks to run on top of this that that run in a Kubernetes friendly way. And Kubeflow that that was the entire point. It was like here’s the standard pipelines that people need when they’re they’re training or training ML or running inference. And we feel like we can build this again in an open source friendly way such that people are able to to run it more easily on Kubeflow or excuse me on Kubernetes. And and that’s what we built. You know, we co-founded it with a Kubernetes engineer and an internal Google ML engineer and the three of us together joined up and and launched it in I think December 27 or 17 and you know, it’s obviously done really really well and then I took that and Microsoft really wanted to work on this. So like I left Google to go work on that at Microsoft along with some of the other open source activities they were doing with ONNX and and some of their early Databricks and machine learning work and I did that there until I decided to go on and solve solve the next problem that I’m I’m working on which I think is the next big thing.

Itiel Shwartz: Okay, so like Kubeflow Kubeflow obviously became the most popular project I want to say in the like the data pipeline and a lot of things tried to copy and like build on top of, right?

Dave Aronchick: Yeah. But what was your next stop afterwards?

Itiel Shwartz: So after that I was at Microsoft and I worked in the Azure ML group and I worked in the office of the CTO. I was very lucky to work for one of my technical heroes Mark Russinovich who I’ve known and really admired for years and years and years. And I still was very very passionate about machine learning. And so I did some work there. This was you know, it’s it’s hard to imagine now but like it it was in the days before you know, chat GPT and and machine learning was so well understood and so a lot of people still were really trying to figure out, you know, exactly how the pieces are going to fit together in machine learning. I have an opinion. I still think this you know, is the case where people, you know, want to do debugging on their local laptop and use notebooks, things like that, and then ultimately move it to the cloud, but I think things have changed quite a bit with the advent of like Claude Code and so on. but after that, I, you know, continued to the next thing, which is Expanso.

Dave Aronchick: I went to work at a company called Protocol Labs, where they funded my work to around compute-over-data. you know, in support of what they were doing, which was a distributed storage system backed by IPFS and Filecoin. And one of the biggest problems was, how do you Now we have all this compute in the world, how do you rationally think about moving these compute jobs to where the data is being created? Because, you know, I think there are three truisms, right? Thing one is data is getting bigger. Again, not not really controversial, but it is getting bigger everywhere. It’s not getting bigger next to your data warehouse. It’s getting bigger on prem, it’s getting bigger across cloud, all the way out to far edge, manufacturing and satellites, and so on. Your data is growing. Video, sensor data, you name it. Thing two is speed of light isn’t getting any faster, right? Bandwidth is constrained, latency will be constant in 10,000 years time, it’ll still be, you know, 40 millisecond ping time between, you know, New York and London, right? There’s just no way to shorten that. And so if you want to do things on a more regular basis with lower latency, you’re going to have to start thinking about what I can do before I move my data. And then the third big concept is around security, right? And the more that you would start to move your data, the more you open yourself up to security vulnerabilities. The most secure data you can have is the data you never collected in the first place. And so if you can apply filters, apply governance, apply everything as far left as you can before it touches your bronze tier or your data warehouse, you’re going to be in much better shape. That doesn’t mean you don’t get rid of your bronze tier or your data warehouse, right? Keep using them. They’re wonderful. But do stuff really far to the left. And that’s what we’re working on as well. And so that with those three things together, that’s what we built you know, starting at Protocol Labs in the open source compute-over-data platform called Bacalhau, which we built a company on called Expanso. And it’s designed to do exactly that. Give you rich data pipelines all the way to the left that are, you know, totally compatible with all your downstream work.

Itiel Shwartz: So so maybe like explain it to me cuz, you know, I understand data, right? Or at least I try to understand compute and pipelines, right? Like in my head, like you have the data, right? And you have the pipelines, and like they are good friends, right? But what you’re describing is that they are becoming much closer, right? So maybe like

Dave Aronchick: Actually, what it’s kind of the reverse. We want them to be far more separated and give you the option. So let’s take a very, very simple example. Mhm. you know, we’re doing some work with some solar companies. And the solar companies have thousands of solar panels out there in the field. And each solar panel is generating data on an inverter, right? The the the sun is coming in and the inverter is converting it into voltage. And then they’re they’re using this. Now, this same power company has a gas turbines in other locations that are sitting idle, and they want to say, “Oh, the solar productivity is going down, so we need to spin up that gas turbine that’s way over there.” and we need to do that thoughtfully, right? We don’t want to like do it too soon cuz that means wasting gas and polluting the atmosphere and so on and so forth. So they need that data. Now, the problem is the default setting, there there is no way to configure this On these inverters is to send data every 1 second about the total amount of processing that it’s going on, right? Or the total amount of voltage that’s going on. Now, these things are spread all over the place. They’re connected via Starlink or cable modems or very poor connectivity. and in order to do your modeling around this, you don’t need data every 1 second. You need data every 10 minutes, right? That is more than enough of a window. And so, what you’re doing is we give them the power to very easily run this data pipeline. a portion of this data pipeline right next to the inverter, right there on site. I don’t have it on a Raspberry Pi. It’s not a Raspberry Pi, but you can do it on a Raspberry Pi that can process thousands of machines. And all we’re doing is averaging that together and giving them reliable provenance lineage-based data that they can push forward. And now that means, you know, previously, if they were sending all the data through, you’re talking about 600 signals per second per inverter, you know, and every solar panel has an inverter on it, right? so, you’re talking about an enormous amount of data, which you don’t need. But if you’re able to move a portion of your data pipeline to the left, you can now be much smarter downstream. Now, it still lands on their Databricks or Snowflake or their Redshift or whatever, but they’ve taken a portion of that ETL pipeline and shifted it left, meaning they can apply more governance and intelligence upstream and get even faster results. Cuz otherwise, they would have to wait, you know, potentially days until they were able to download all the data and make and take action.

Itiel Shwartz: Okay. I I think I understand it now. So, what you’re saying is like a portion of the computer pipeline now runs in like these solar thingies, right? Like in the end.

Dave Aronchick: Yeah, so and it really does apply to just about anything. Whether or not you’re cross cloud, whether or not you’re on prem, whether or not you’re on vehicles or manufacturing or industrial. We do a lot of work with power, a lot of work with Telco. It’s not on these devices cuz a lot of times these devices are pretty locked down. Usually it will be right next to those devices where you can collect all this data in a really elegant way. and then you move only a subset of that data over. So, for example, like the analogy I always give is like imagine you have a mine, right? You’re a mining company and you want to go and collect a bunch of, you know, gold from 10 different mines. You could, if you want, take, you know, have a a back lift, pull all the data and pull all the dirt out, put in the back of a pick you know, a a dump truck and drive it, you know, a thousand kilometers away until you can sort through it, right? But that seems pretty wasteful. Wouldn’t it be nice to have the to filter, to sort that data earlier before like not in the mine but right next to it, filter it all out, hey say look now all this stuff is dirt, this is not worth moving. Oh, but this this thing looks kind of like a gold nugget, so I’m going to move that. And you just keep doing all that right next to it. And and then your dump truck can move a lot more you know, candidate material because it’s not moving all the raw dirt. And then, you know, again, there there all these hard problems. So, now let’s say oh, you know what? I’ve looked at all these candidate materials and it turns out there’s also silver in here. How do I now send an instruction to every mine that like, hey, don’t just look for things that look like gold, also look for things that look like silver, right? You can do that as well. Or let’s say somebody, you know, one of the trucks one of the dumps comes back and there’s a little bit of radioactivity in there. Like, oh no, like, you know, some one of these mines hit hit something bad. Well, how do you know where that came from unless you have a full lineage, you know what dump truck delivered it, all that kind of stuff. We can help apply that as well. So, it’s it’s more about thinking about all the places you are collecting the data and moving portions of your jobs out there because that helps you have a richer, more reliable, you know, less dirty data not to use the pun that then what you’re doing today.

Itiel Shwartz: Okay, then you know like it makes total sense and you know, now with ever growing amount of data I understand the need. But, you know, like we’re almost like 20 something minutes into the call and you like didn’t mention AI or agents even once. Maybe maybe now it’s like a bit of a time for like buzzword, right? Like so. How how does it connect? How are you guys an agent company and like how does it work?

Dave Aronchick: Absolutely. You you can’t do it without agents. So, in two big ways. One, that that pipeline that I talked about where it’s just filtering and things like that, we can do full rich data or AI out there as well. So, for example, you know, I was just at a national security conference this past weekend where they’re looking to get much more intelligence over video at the edge, particularly when it’s disconnected, right? We all think about these video streams. Your standard camera is going to do about 2.4 gigabytes an hour of video, right? Right now, that’s the default. And which will overwhelm any bandwidth. What you want to do is apply a little bit of AI up front and and analyze this, say okay, I’m only going to send data if there’s a human being in it or something like that. There’s a bunch of different things you can do there. and so, we can help you run that pipeline right next to where your data is. And that will help you move less data. On the Additionally, one thing that I’m really passionate about is lineage and providence. So, one big thing that keeps happening is people are like, “Well, why is my agents, you know, coming up with such bad results? What’s going on?” And more often than not, it’s because you’re lacking the context to for the agent to make good decisions. For example, let’s say you have a point-of-sale device. You have a store, it’s got four point-of-sale devices in it. And point-of-sale device number three has shown zero transactions in the past hour. Okay, what’s wrong? Is the thing, you know, shut down? Is there Is there a worker on it and they haven’t been like doing any checkouts? Maybe there’s a you know, whatever, a storm outside, a hurricane, and they shut the entire store down. Maybe they took that one, someone spilled a diet Coke all over it, and they took it to the back room so they could go clean it. There’s just many, many, many reasons that that may show zero transactions, but you don’t know if you don’t have the context for that data as it flowed through. And if you don’t know, if a human being can’t tell, then an AI certainly can’t tell. So, I strong Like one thing I strongly recommend, whether or not you use our product, which makes it very easy to do this, or you use another product, you write it yourself, I strongly recommend adding more context to your data as you move it from wherever it’s being generated, again, could be on the machine in the same VPC, all the way to your far, far, far edge, but add more context as you move this through in order to make your AI downstream smarter.

Itiel Shwartz: Makes like complete complete sense. Maybe like we will end it maybe with a prediction by you. Like where are we heading? You can talk about data, you can talk about compute, you can talk about agents. Like where is the world heading towards?

Dave Aronchick: Boy, that you know, if I knew that I’d be investing in betting a lot. It’s a crazy crazy time. I think we are heading to a place where you know, obviously I’m talking my own book. I think data will continue to be enormously valuable and our ability to collect well-structured intelligent data will will simply be you know, without you know, there’s no amount of value that is too high for how much that will be that will be useful. And maybe I’ll use it to train on AI tomorrow and maybe I’ll use it in five years, maybe use it in 20 years. But I do think you know, we need to do the investments now in order to get the big payoffs later. But but after that, I think the big thing is how do we think about a world in which you have a phenomenal amount of leverage? You you know, you are one person with two eyes and two ears and you can listen to something and you can do go do something and your brain is probably coming up with 100 ideas a minute, right? And that’s fine. Great. Do go do that, use those ideas and and go to work on them. But but you know, you’re bound by the amount of time you have and the amount of diligence you can apply to any one problem and so on. What does it look like when every one of those 100 ideas that you come up with based on what you see and based on what you hear can now be acted on? Maybe not all the way, maybe not launch a new company. But like what does it look like when you can do like you can really pursue 20 things at once and how do they work? When everyone can produce 20 things at once, what happens? And then, you know, what happens when you get to even further autonomy, right? Again, it’s not like I think that this thing is going to run away from humans. I just don’t think that’s going to be the case. But I think it will be guided by humans in more and more leveraged ways. And so the same way that, you know, you know, to use the coding example, in in 1970, I had to go write assembler in order I was not alive in 1970, to be clear. In 1970, you know, I had to go write assembler in order to get anything done. And then, you know, I got to go go write in C and that was, you know, not much better. And then, you know, Microsoft and Apple come out with, you know, Windows libraries that allow me to like create an image and things like that. These are all points of leverage where, you know, sure, the stuff at the bottom layer isn’t as efficient that that as it would have been, probably more efficient than I would have built, by the way, even then. but it’s probably not as efficient as if an expert sat down and thought about everyone. But it allowed me to like draw, you know, box on a screen and and show that to someone and have someone interact much more quickly than it ever had before. And I think we’re just continuing to explore that and expand that particular space. and that’s where I see things going, just just increasing, increasing, increasing leverage. We will need to be diligent to make sure it’s not, you know, not that it runs away from us in a negative sense, like, you know, the paperclip scenario or whatever, but runs away with from us in a sense like, well, it I didn’t you know, correctly articulate what I wanted it to do. And as a result, I was unhappy with the results. Cuz that’s what I think could still be a problem.

Itiel Shwartz: Okay, with that, I think we’ll end this episode. David, thanks a lot. I think it was like it was a pleasure and like talking with someone who was in Kubernetes in such early days is also interesting. And I think like what you guys are building makes total sense. Like if someone who played quite a lot with data and ETL, obviously. So, thanks a lot. David, I really appreciate it.

Dave Aronchick: Thank you so much for having me on.

[Music] Kubernetes for Humans.

This is an AI generated transcript of the conversation

About the Guest

Dave Aronchick

Co-founder & CEO, Expanso

Dave Aronchick is co-founder and CEO of Expanso, the company behind the open source distributed compute project Bacalhau. He was the first non-founding product manager for Kubernetes at Google, where he also led Google Kubernetes Engine (GKE) and co-founded Kubeflow. He went on to lead open source machine learning strategy at Microsoft Azure in the office of the CTO under Mark Russinovich, and worked on compute-over-data at Protocol Labs before launching Expanso. Earlier in his career he held roles at Microsoft, Amazon, and Chef, and was a multi-time startup founder and CEO.

Resources Mentioned

Bacalhau (compute over data)

Google Kubernetes Engine (GKE)