#043 – Gaming on K8s: Stateful Servers, Low Latency, and an Incredible Infra Journey with Siddharth Dhulipalla (Hathora.dev)

Siddharth Dhulipalla

CEO and Co-founder, Hathora.dev

Listen on:

Listen to the Podcast

Episode Overview

In this episode of Kubernetes for Humans, host Itiel Shwartz talks with Siddharth Dhulipalla, CEO and co-founder of Hathora.dev, about running real-time multiplayer game servers on Kubernetes. Sid walks through Hathora's journey from a software framework for stateful apps to a pure infrastructure offering, the brutal latency budget of a 60Hz dedicated game server, and why EKS and GKE were non-starters once bare metal entered the picture. The conversation digs into Hathora's hybrid architecture built on Talos and Sidero's Omni, the work required to hit sub-3-second pod startup with multi-gigabyte container images, and where Kubernetes' networking and scheduling models still don't quite fit gaming workloads.

In this episode we discuss:

From Databricks multi-cloud and Palantir's pre-container cloud team to founding Hathora in the gaming space
Why stateful, in-memory game servers don't fit the standard load-balancer-plus-database web architecture
Hathora's pivot from an opinionated software framework (Hathora Builder) to a pure stateful hosting platform
Running hybrid Kubernetes on bare metal and cloud VMs with Talos and Sidero Omni instead of EKS or GKE
Where Kubernetes still falls short for gaming: hostPort conflicts, pod startup latency, and large-image container pulls

Key Takeaways

1

Real-time game servers reconcile player input around 60 times per second, leaving roughly 16ms per frame — there is no time to write to a database in the hot path, so state lives in memory on a single server.

2

Cloud-managed Kubernetes (EKS, GKE) was a non-starter for Hathora because they needed to enroll their own bare metal nodes alongside cloud VMs to control egress costs; Talos plus Sidero Omni gave them a workable hybrid control plane.

3

Hitting a P95 pod startup under 3 seconds with 3–20GB container images required heavy optimization — pulling images in the hot path during matchmaking is essentially a non-starter, so peer-to-peer image distribution (Kubefledged) becomes critical.

4

Kubernetes networking is awkward for game servers: hostPort forces pod-spec-level port assignment and prevents co-scheduling on the same host, and ingress/NodePort don't cleanly map to dynamic per-match port exposure.

5

Hathora's bottoms-up GTM started with Unity hobbyists, moved to sub-10-person studios, and then to venture-backed studios with $10–100M budgets where founders want to buy infra rather than rebuild a central platform team.

Full Transcript

Itiel Shwartz: Hello everyone and welcome to another episodes of the Kubernetes for Human podcast. Today with me in the show we have Sid. Sid, happy to have you.

Siddharth Dhulipalla: Hey Itiel. Great to have you. Great to be here.

Itiel Shwartz: yeah. So Sid, maybe tell us a bit about yourself. Who are you? what do you do and what’s your like favorite Kubernetes object?

Siddharth Dhulipalla: Interesting set of questions. so by way of introduction, my name is Sid. I am one of the founders and the CEO at Hathora. we provide game infrastructure uh to studios that are building the games. primarily our bread and butter is in hosting the dedicated server component which is the thing that everyone if you’re playing Call of Duty or Halo you connect to once the match has been found and this is the server that’s keeping all the state that everyone is modifying constantly in sync. we started the company about three years ago. we’re a big Kubernetes shop at this point. we’ve tried a few alternatives before settling on Kubernetes. uh but happy to chat more about that down the line. Prior to this, I was at Databricks where I was working on multicloud standardization since the platform had to operate in Azure, AWS and GCP at a pretty large scale. And then prior to Databricks, I was at Palantir where I led the cloud team. and that kind of preceded Kubernetes. we’re mostly like Puppet Ansible at that time and mostly not even in containers. so have a good bit of experience uh even in the pre-container world.

Itiel Shwartz: I hope you know you still have your stocks with you.

Siddharth Dhulipalla: Yeah. Don’t don’t remind me because uh there’s a three-year expiry and it’s been more than three years since I left so I had to sell.

Itiel Shwartz: okay. Okay. With that, let’s talk a bit about you know like uh you know maybe starting to think maybe share a bit about Databricks if you can right like a bit about your journey because this is where you really like started Kubernetes and multicloud is always challenging right? So always like a lot of problems. So if you can share a bit about you know what you did there and then what led you to start a company like in the gaming industry. It’s also quite yeah so at Databricks um by the time I joined the company had already started to use Kubernetes um for compute at that point they were primarily in AWS. They signed a deal with Azure a couple years prior. and the Azure partnership was really starting to take off and, you know, teams were annoyed that they had to ship software and like test manually um, and make sure it worked in AWS and Azure. But, you know, it was just two and they were okay with that. But in 2020, um, a couple months after I joined, they actually signed a partnership with GCP as well. And now all of a sudden teams were like, are you kidding me? like I have to not only test in AWS and Azure but also test every feature or like every update I’m shipping in GCP too. So internally you know the platform team kind of recognized that okay look like this isn’t sustainable. We actually need like a centralized team to kind of like maintain a consistent interface across all three clouds. uh and I think everyone knew that was very ambitious and you know I think even in the broader industry everyone would like to be entirely cloud agnostic but the reality of it is everything from IAM permissions to I mean even simple stuff to how like you know container registry permissions are handled are different across um the various clouds so it really became a challenge of like okay well like let’s not promise the world but let’s focus on like what are the common pieces that we can start to abstract away uh and so like computer orchestration Since we’re already leaning pretty heavily on Kubernetes, we started leaning a little more heavily on Kubernetes. we were always using Terraform as well to set up and manage resources. but we started using Kubernetes more for like you know there was like a serverless offering that uh that Databricks just shipped. so you know once we started shipping that there were some concerns around like security because all of a sudden it was a multi-tenant architecture that you know Kata Containers versus Firecracker versus just native um containerd pods and then you know looking into how performance is affected across like all three options. So there’s a lot of like in-depth stuff that happened at Data Bricks and the team there was just excellent. and yeah, you know, I left Data Bricks mostly because my co-founder was tripping away at me for like six months. We’ve been really good friends since college and he was like, “Hey, like you know, I’m working on this thing in the gaming space. I could use your help trying to build the info layer around it.” and then eventually he caught me on a good day and then I was like, “Okay, let’s do it.”

Itiel Shwartz: Mhm. So you left Databricks, your partner was already like in the gaming industry, but you know like to be honest like I never worked in the like the gaming industry, right? Like I gamed or like quite a lot, but I never really did like you know hardcore gaming. I know there’s a lot of challenges in general. So maybe share a bit about what originated you know or the partner to start this idea like what was the background what was the promise and a bit about the journey that you had along the way.

Siddharth Dhulipalla: Yeah, I think the core of it comes down to my partner started try to build stateful applications, right? Applications that required in-memory transactions where you couldn’t wait to for a database to acknowledge you’re right. And the more he started thinking about like, you know, how do these like real-time apps work, the architecture for it was like pretty poor, especially like when you think about how you scale those. typically you know you kind of you know people try to shoehorn the existing load balancer web apps and then you know a database architecture to those kinds of apps as well but you know if you have two different clients that are connected to the same so-called room that need access to the same state but they’re you know load balanced different servers the two servers need to stay in sync. So then you know you introduce like a Redis or some sort of message bus in between and my partner looked at this and was just like this is just obnoxious. in fact, like you know, Uber and Figma, like Uber was like super well known in the industry for using Kafka topics like extensively, but then a couple years ago they like posted a whole series of like blogs on how they moved away from that and into more of these like stateful web servers um to where you know if two two like like a writer and a driver were paired, they were physically connected to the same server so that there wasn’t this like huge complex architecture that needed to be dealt with. So his kind of passion really came from how do we make stateful application management easier at scale and as we started looking into it then you know the hosting piece ended up being like pretty challenging as well because you know you look at AWS like with ALBs or ELBs um or you know any of the go-to advice that’s given it’s the same like throw a load balancer in front advice but for gaming specifically what we discovered is it’s the extreme end of this like staple problem that I’m describing. So the dedicated server part that I’m describing here basically is receiving input 60 times a second from all the connected players. So you know maybe this is like a 3v3 match or a 6v6 or in like you know Fortnite it’s actually like 64 players at the beginning. so all of those guys are sending updates based on, hey, I move my joystick this way or I press this button. And then it’s the server that’s actually reconciling all those actions into what is the actual state of the world. And then it needs to remit the current state of the world or the relevant pieces of this the world to all the connected players as well. So if you can imagine like 60 frames per second, right, that gives you 16 milliseconds to finish the computation. like you have no time to like write to a database, load that up and like when the physics simulation itself takes up that whole envelope. So it’s a very interesting like technical problem and then you know a lot of these like the way these are deployed you know um containers weren’t really being used um the way the underlying hosts were being managed was pretty archaic as well. So we took a look at like what other options were available in the industry and based on like what I’ve seen possible with Kubernetes and containers. I was like we can do better here.

Itiel Shwartz: So you know like first of all super interesting and great explanation on the problem but like listening to what you have to say it feels like there are like two two challenges here. There is the like technical challenge of let’s say you already like built everything to make sure it’s up it is running like the latency is super low like you are not missing any session whatever super infra hardcore problem for like a low latency high high throughput application on the other side what you are saying is that there’s also like an architecture problem in a way of too much complexity and what you’re saying is that you are trying to like manage both of them, right? Like you’re saying, let’s solve both the it’s so hard to host and it’s so hard to build into like a unified solution. Is that correct?

Siddharth Dhulipalla: So that’s actually what our ambition was at the very start of the company. In fact, it was more of the let’s solve the architecture problem and then eventually we’ll monetize that by like you know providing it for solution. Yeah. over time we actually built that. It’s called Hathora Builder and that’s like a software framework that like you know does codegen and kind of makes it really easy for you to stay synchronized across multiple connected clients. Think like Socket.IO but but better. and then we started building the backend component but infrastructure component to it. And then what we discovered was you know the the barrier to entry or barrier to adoption for the software framework was significantly higher than the infrastructure piece. So over time what we did was we stripped we kept stripping down the software piece to make it more and more generic so that more people can be willing to migrate their apps over to the point where we’re like look like there’s no nothing of value left here. Let’s just provide the hosting offering as a standalone thing and we still have the builder as an open source like project but you know usage for the builder component is like almost non-existent but then usage for our infrastructure is exploding.

Itiel Shwartz: No, that’s a that’s a very interesting like like way and and it does make sense because there are like two very hard different problems I feel that you need to solve in order to do like operate it like like in a good way. So at the beginning we really we thought about this as like the spark to Databricks and we’re kind of like building both where this builder component is like you know this new paradigm like spark was and then Databricks kind of monetized like actually running spark at scale and you know our infrastructure business is going to was going to be that but then over time we discovered that look like there’s actually a tremendous interest in just like stateful infrastructure hosting even without like you know providing an opinionated software layer.

Itiel Shwartz: Okay. So, you know, again, like very very interesting, but share with it to me. Let’s say that you built it and it works, right? now it’s up to you to make I don’t know like game companies to start using you, right? Like, and those guys like at least like in big companies are I guess quite strong. They have big budget, right? Like they have a very strong infra team. Share with me like why would someone choose you or how did you start? were of the first customers and a bit about like the company company growth and like changes.

Siddharth Dhulipalla: Yeah, for sure. I think the one good thing is like we started with a very bottoms up approach. So we didn’t instantly say hey we have to like win over like Electronic Arts or Bungie or any of these like larger names. so we were working with essentially hobbyists at the very beginning. the gaming industry is like has tons of hobbyists that you know spin up a Unity client. they like want to build a multiplayer game and then they just want to play with their friends. So, we’re like, “All right, cool. Like, how could we just like get the platform working to the point where like a complete newbie can get up and running within a matter of minutes and over time we kind of upgraded to like that was like a maybe like three month effort for the platform to get to that spot. And then we started kind of putting up a lot of feelers to to smaller studios of like less than 10 people. And then there was another three to six month effort there. And then at that point we also raised our seed round and along with our seed round we brought on a lot of investors that were super well connected in the gaming industry. specifically like venture-backed studios. So then you know we started talking to a lot of the founders at the ventureback studio level and we really learned a lot about like their challenges. So I’m talking about games budgets between like 10 to$100 million. and unfortunately the last year has not been great to those kinds of games that have launched. but what has been great for us is that these studios are kind of building something from the ground up and they’re usually like, you know, ex-Riot or ex-Blizzard and, you know, they’re used to having like a central platform team that they really don’t like want to rethink and solve from scratch. And that’s kind of where like us coming in has been very helpful. And then the other like really big benefit or like kind of like reason why adoption has been great for us is that if you think about like what goes into building a game it is one of the most interdisciplinary efforts in the world. So like you have like you know you have artists, you have your 3D modelers, you have your like motion capture people, you have your like music folks, you have like the story writers and then engineering is like one component of it but even in engineering you have your like you know Unity or Unreal or like front-end engineers who are like you know handling like ray tracing and like all the physics in the front end and then you have like your network engineers and then you have your persistence back end right like what level am I like you know what skins have I unlock like that layer. So when you think about the breadth of like what’s required to ship a game, it’s almost impossible for some of these newer studios to also build the depth required to to make a infrastructure offering or their infra piece great. And so by and large a lot of these companies were interested in buying instead of building a house.

Itiel Shwartz: Okay. Okay. Let’s you know like everything makes sense. share with me a bit on like the challenging side because so far everything sounds quite perfect right and you know like I like as a co-founder myself sometimes life is not perfect right like uh it can’t be the first time that you hear that so what what are like the challenging parts is it the technology is it getting the customer all of the above you tell me yeah so actually you know from a technology so both me and my co-founder studied computer science at Carnegie Mellon and like have a very technical background and we were able to assemble like a super stellar like infra team. So like we feel super confident that if technology is ever the bottleneck which it hasn’t been so far. We’ve like the tech has always been a little ahead of the business like we can solve whatever problems are required and I’m happy to share some of the problems we have to solve along the way. but by and large at this point the bigger challenge for us is like customer acquisition as it is for for most startups. and in the sense that the gaming industry is kind of going through a slight contraction. forecasts when we started were you know gaming is exploding like 10 20% year-over-year growth for the next like 5 10 years. But the reality is since like 2023 the industry has been contracting at a rate of about 3%. So that’s made it really hard for some of these like newer entrance to break in and like maintain their foothold. And given that we’re a usage based pricing model, the games themselves don’t do well, like we’re not doing well essentially. So there’s ways that we’re mitigating against that as well. you know, starting to like expand up market faster than we had originally planned and like, you know, we’re in really good conversations with like the massive like studios of the world. but we’re also exploring like you know we’ve never like explicitly started the company at the very beginning saying like this is going to be a gaming company and gaming company only. It’s very much this like stateful workloads is kind of what we’re interested in. And we’re also starting to think about like what additional stateful workloads exist out there um like you know the Figma like interactive application style or um you know like CI/CD workloads.

Itiel Shwartz: I have like a couple of very interesting blog post on how hard it is what was it for them to implement like all of these like features that looks trivial but they have very very very interesting like blog post on like the technical challenges that you describe right like how do I make sure that when someone moves something everyone can see it and how do I publish and like no it sounds like I read it back then and I was like yeah I never thought about Figma as such a hard product to build right and it

Siddharth Dhulipalla: Yep. Yep. And you know Figma took a very interesting approach which is like you know broke convention uh like the convention around the time Figma was getting started is like you know if you have state that multiple people are modifying at the same time that needs to be synchronized you either use operational transforms or use CRDTs and like Figma was like actually like you know maybe let’s not do that. And the reason OT and CRDTs were helpful is that like you know if you had a load balancer and both users were connected to different web servers OT and CRDTs were actually like helpful in resolving the deltas that way. but with Figma they basically said no each Figma document is going to be pulled into memory in one server somewhere in the world and all editors of that document or viewers of that document are going to be connected to that same physical process. And so that eliminated a lot of this like oh like you know what if someone else connected to a different server updated the database before my right went in. And so Figma’s architecture is very similar to the game like how game servers handle it. it’s just interesting to see like more and more of these like realtime apps are moving towards the gaming architecture then vice versa.

Itiel Shwartz: Huh that that’s cool like very very very interesting and again like as a developer it always looked not that you know it looks a bit trivial right but then when you think about it it is like super complex and so on and you know maybe share a bit about where does kubernetes fit in the picture like are you guys managing it on cloud do you have on prem you are very I guess very latency aware very network aware right like that’s life right So may we share a bit about the technical implementation and maybe like challenges or problems that you guys encounter.

Siddharth Dhulipalla: Yeah, 100%. So let’s separate like how we handle our control plane like our API server and our management layer to our kind of like data plane or like the customer compute that we manage, right? So at its core, what Hora does is we lease bare metal to our customers. and then if they’re starting to run out of the bare metal capacity, we automatically start launching cloud VMs and enrolling it into their cluster as well. Right. So by default they’re saving a huge amount of money by running it and like there’s a ton of like network band like egress happening for these games as well. So running exclusively on the cloud becomes very expensive. Yep. So what this means is we can’t actually use EKS or GKE or any of the cloud managed offerings because we can’t enroll our own bare metal nodes outside of cloud. Yep. So we kind of went through this whole challenge of like all right well how do we like do kubernetes or you know what’s our compute scheduling layer um as we started to introduce bare metal into our offering um and you know we looked into Nomad uh we looked into potentially using something like Fly.io which has like cheaper rates uh we looked into like building our own um orchestrator and scheduler as well um but ultimately we felt kubernetes was good enough or what needed at that time. And then what we discovered was like Talos was a very great packaged option here. and then we started talking to the Sidero Labs team and they basically introduced us to their um kind of like managed Kubernetes offering. It’s not fully managed but um are you familiar with Omni? No. Okay. So, Omni is a product that Sidero Labs has been working on which basically allows you to manage various Talos nodes that boot and like so essentially the way it works right now is like you know we have Talos as the base image that gets installed both on our bare metal servers and as our machine images for VMs as well. So when these boot they have a join token that lets them know like you need to enroll into this Omni cluster. So they speak to Omni and Omni is aware of like all of our compute nodes across the world. And then on the Omni side, we can choose to elect each node that has come up as either a control plane node or a worker node. and so Omni kind of makes this like very easy for us to manage the end toend life cycle of the underlying Kubernetes uh nodes that we’re bringing online. and then from there, you know, there’s a huge amount of like work that went in on our end of like how do we schedule new pods in under 3 seconds where our P95 is under 3 seconds including container pulls, right? And these containers are massive. Like you know, we’re talking like 3 to 10, some are some customers are even above 20 gigs. So like like doing a container pull after the pod is scheduled is almost like a non-starter because that means like in the hot path yeah customers matchmaker says hey I found you know these six players ready to play a matcha give me a game server and if we’re sitting there for like two or three minutes pulling a container the players are stuck watching or waiting for the game server to start uh so yeah you know happy to go into um optimizations around like start times optimizations around container registry pulls uh yeah wherever you want to double click

Itiel Shwartz: and we don’t have a lot of time so I will ask maybe a bit about like you know you talked about where are you currently but I’ll be happy and as like you guys are using tell us maybe share a bit about where do you think the industry is going from a technical perspective like a year from now two years from now what technologies do you believe will make your life simpler maybe like is there any anything that is like up and coming in six like I don’t know enough like the space so I’ll be happy to learn yeah I think some of the like you know open challenges that we have with Kubernetes still which I don’t really know if they are being solved like are one is I think the networking model is still not like a perfect fit for us so between hostPort services uh sorry hostPort NodePort and ingress uh we basically had to resort to hostPort because we just need to expose a port. Yeah.

Siddharth Dhulipalla: but then you know for host port you have to in the PodSpec you have to specify the host port and if there’s never a conflict like two like pod specs with the same host port not be scheduled on the same machine you know that’s a huge problem for us and we don’t have a way around that right now. So like you know we are thinking about maybe switching to like you know to ingress and like having like a like an envoy container that handles like that like routing but it’s not fun. So I’m looking for like ideally what I would want is like hey like I would say like this is my internal port but like map me to any external port that’s available like on the screen like that would be amazing. Then another thing that like we’re struggling with is just the like

Itiel Shwartz: are you using like the default network like CNI like for for or like Cilium or something like that in general

Siddharth Dhulipalla: we’re using Calico. Yeah, but um yeah, the other thing that we’re struggling with is like okay, like even after we’ve optimized, let’s say like you know the image is cached on the machine already, it still takes between like 1 to 3 seconds for a pod to get scheduled and started. And ideally, we would like to get that down to 100 milliseconds, right? And so that goal we’ve kind of given up on if we use Kubernetes if we continue to use Kubernetes as scheduler because you know the kubelet like polls and there’s that delay there and technically now there’s an event-driven kubelet that’s like starting to come up as well which we’re excited to try when like it’s a little more mature but there’s still like a few too many things in ter like Kubernetes I don’t think was designed to be this like it was really I think for that like uh exactly so that that’s one piece and then the other piece pieces like around like container pulls. you know we run these like very large clusters like and you know if another node already has a container on land we would like to pull that over. So there’s a project called Kube-Fledged um that basically does peer-to-peer polls uh where each node kind of advertises like which container images that it has available and then Kube-Fledged can basically try to pull from that before it goes out to the like open internet and like falls from your registry. But the support for it is still like a little iffy and it took us like a lot of effort to like get it working. and then there’s some things around u, you know, observability and monitoring that I feel like could get better with Kubernetes as well. I’m really hoping like you know there’s some like RFCs that are out right now that address some of these, but I haven’t kept up either.

Itiel Shwartz: Okay. No, like it’s it’s a great answer and I think your use case is indeed on one hand it’s quite unique, right? Like it’s not what like I think Kubernetes 1.0 had in mind when they released but it is part of this like bigger movement that they see of like Kubernetes as the new cloud like how can they run everything that they could have run using like a cloud or VMs on top of Kubernetes. I don’t care if it’s like low latency, high resiliency, high throughput network, whatever. Kubernetes just give me the relevant platform in order to support it. So I think like you’re really like taking Kubernetes to the edge in like different areas. But I do hope that the platform itself you know will keep on growing and you know to to support that like even even like you know even your requirement like it’s it’s quite like beautiful that you are able to use Kubernetes right like very low latency very like like like critical workloads and like very network intensive so like it’s nice that it got you so far but yeah I feel that as an industry we have a lot to do yeah I’m very grateful for that right like it’s not like I don’t take that for granted it’s I think our requirements are already pretty crazy and I’m very happy that Kubernetes like is functional enough to the point where our business is successful. but it’s one of those things where like 10 years from now if our business continues to scale at the rate that it’s scaling I hope like you know it yeah it probably makes sense for us to start investing in our own scheduler at that point instead of relying on whatever quirks Kubernetes has and kind of like waiting for like the community to actually like yeah make fixes.

Itiel Shwartz: Okay. Any any last wording like like do you want to publish something say something to the to the to our listeners?

Siddharth Dhulipalla: Yeah. Well, we uh are going to be at the Game Developers Conference uh in San Francisco next month. So, if any of the listeners are going to be around, I would love to talk shop with them.

Itiel Shwartz: Okay, cool. which is like usually like you know you heard it like in during the episode AI gaming or anyone else that has a very strong complex stateful application running on top of Kubernetes or not running on top of Kubernetes and he wants someone to take the burden and let like a team of like experts to manage it for him which sounds quite amazing. okay thank you very much Sid a very interesting and enlightening episode.

Siddharth Dhulipalla: Awesome. Thanks for having me.

[Music] Kubernetes for Humans.

This is an AI generated transcript of the conversation

About the Guest

Siddharth Dhulipalla

CEO and Co-founder, Hathora.dev

Siddharth (Sid) Dhulipalla is the CEO and co-founder of Hathora.dev, which provides game infrastructure to studios — primarily hosting the dedicated server component that keeps real-time multiplayer state in sync. A Carnegie Mellon CS grad, Sid previously worked on multi-cloud standardization at Databricks across AWS, Azure, and GCP, and led the cloud team at Palantir in the pre-container era of Puppet and Ansible. He started Hathora roughly three years ago to tackle stateful, low-latency workloads at scale, and the company now runs a large hybrid Kubernetes footprint spanning bare metal and cloud VMs.

Resources Mentioned

Kubefledged (peer-to-peer image pulls)