#004 – Kubernetes For Humans Podcast with Andy McMahon (NatWest Group)

Itiel Shwartz: Hello everyone, and welcome back to another episode of the Kubernetes for Humans podcast. With me today, we have a special guest, Andy McMahon, who is Scottish but lives in Ireland—or is it the other way around? Andy, please introduce yourself.

Andy McMahon: I have an Irish name, but I’m Scottish. I live in Scotland, so no Irish connection other than the name. I’m Andy McMahon, head of MLOps at NatWest Group, one of the largest financial services organizations in the United Kingdom. We serve over 19 million customers, ranging from individual accounts to huge organizations. We handle a significant amount of traffic and transactions—about a third of all British pound transactions globally come through our systems. My job as head of MLOps is to help operationalize the incredible data science capabilities we have. We take all of that transactional data, customer understanding, and economic insights and turn them into solutions that benefit our stakeholders and customers. My role involves taking these promising ideas and turning them into productionized solutions that drive value.

Itiel Shwartz: That sounds super interesting. I have a lot of follow-up questions, but first, could you give us a brief history of how you ended up in MLOps? Did you start as an engineer, in operations, or as a data scientist?

Andy McMahon: Sure. I started with a PhD in Physics at Imperial College London, working on computational modeling of solar cell materials, quantum mechanics, and other abstract topics. I was always interested in translating that abstract knowledge into practical value, which led me to work on things like solar cell materials. This naturally got me thinking about wider career paths. The program I was in was very good at advising that becoming an academic is a difficult road, and even if you want to become a professor, you might not make it—so consider going into industry.

Around that time, there was a lot of buzz about data science being the sexiest job of the 21st century, so I looked into it more. I realized it was a great place to apply my way of thinking about complex problems, my Python skills, and my knowledge of data analysis and statistics to real-world problems, generating real value. So I transitioned into data science.

I started out in a small startup working in oil and gas and logistics optimization, then moved into the energy space and distributed energy systems. Eventually, that led me to NatWest as an ML engineering lead, and later head of MLOps. Throughout that journey, I became more interested in the engineering aspects rather than just focusing on the coolest algorithms or the most fun analyses. I picked up a lot of software engineering, DevOps practices, and platform knowledge, and became more focused on how to take a model built by a data scientist or AI engineer and translate it into a pipeline that can be run, an application a customer can interact with, or something that can scale to hundreds or thousands of users.

Over time, I found myself more drawn to ML engineering rather than purely data science, which eventually led to my current role, where I focus on strategy and broader questions about MLOps and its future at the bank.

Itiel Shwartz: That’s quite a journey. I’ve met a lot of MLOps professionals, and most come from a data engineering background. They were engineers who transitioned into MLOps, which is essentially data engineering with a bit more focus on models. The fact that you came from the other way around, with a PhD in physics, seems quite rare. Is that common in the banking industry, or is it just rare in general?

Andy McMahon: I think you’re right that it’s a bit rarer. There are a few reasons for that. If you’re a data scientist, you might just enjoy building the coolest models and diving deep into algorithms. When you start getting exposed to the engineering side, it might feel like it takes away from that activity, which is understandable for some people. On the other hand, if you come from a data engineering or ML engineering background, it’s all engineering, so the transition might be less of a cultural shift.

There’s also sometimes a bit of fear around the technologies and practices you have to pick up. That’s part of why I wrote my book, Machine Learning Engineering with Python. I wanted to help break down those barriers and make it less intimidating for engineers to learn about data science and for data scientists to learn about engineering.

More and more, the data science community is having to pick up engineering skills—they just have to become more like ML engineers. This doesn’t diminish the value of what they do; it’s just that taking things into production is now absolutely critical. Organizations can no longer get away with building lots of Jupyter notebooks that look cool but don’t get deployed. There’s still a place for that, but to get a return on investment, these have to become products and solutions that really drive value at the coalface of the business.

I think the transition I’ve been on will become more and more common. A few people in my team have similar backgrounds—data scientists who transitioned into more ML engineering roles—but you’re right, in general, it’s more common to go from a software or data engineering background. That said, we need everyone to be doing this. Even data scientists who are purely focused on data science need to have an understanding and appreciation of engineering. They don’t need to build everything themselves, but they do need to speak the language, and that’s a really important journey for the industry over the next few years.

Itiel Shwartz: Absolutely. Having that knowledge is priceless. In one of my previous jobs, I worked on MLOps—I don’t think it was even called MLOps back then—and the data scientist I worked with had a PhD in physics. We were working on fraud detection algorithms, and he had zero idea how his model would be transformed into something that could be used in production. He was super smart, but I saw him struggling with it. It was so hard to move his model into production, so I ended up telling him, “I’m going to write you a framework. All you need to do is follow these guidelines and write within this framework, and I promise you everything will work in production.” After that, the velocity of writing and adding models to production went from once every few months to once a week or so. It all came down to the framework we built.

He spent all his time thinking about detecting fraud—what are the numbers, what vectors to use, what model to choose. He didn’t think about how Kubernetes clusters interacted with his model, and he didn’t need to—someone else did. It’s hard to be really good at both, so let’s talk a bit about Kubernetes. How did you first encounter Kubernetes? Was it in your current job or a previous one? And what’s your take on it, especially in the context of machine learning?

Andy McMahon: First, I think the story you just told is so typical. It really resonates with me. A big part of being successful in this space is helping incredibly smart people who can build models understand that if they can work out how a neural network works, they can work out how to deploy a model. It’s not super difficult; it’s just a different skill.

As for Kubernetes, in the organization I’m currently with, NatWest, and in most of the other organizations I’ve worked in, Kubernetes is one of the workhorses of traditional application development. There are a lot of people working in web development and other applications in the organization who use Kubernetes. They have strong capabilities and good Kubernetes engineers who understand the whole process from building something, containerizing it in Docker, and then deploying it in Kubernetes and managing those clusters.

In the ML space, though, it’s really not being done. This is part of the journey I’m on with Kubernetes—trying to fly the flag for it as part of our full spectrum of MLOps capabilities. There are pockets of people experimenting with it in the bank or in other data science outfits that I know, but very few ML engineers or data scientists are comfortably using Kubernetes. That’s something I want to change, at least for the cases where it makes the most sense.

What’s important is that if you’re in a large organization, or even in other organizations where you’re already delivering value through whatever route to production you have, you don’t necessarily come in and say, “I don’t like the tool we’re using; we’re now using Kubernetes.” That’s obviously a bad idea. At NatWest, over the past couple of years, we’ve made a big transition in our MLOps capability to using the SageMaker ecosystem on AWS. It provides a lot of easy-to-use capabilities that help data scientists get their models into production, running, and orchestrated. But I don’t think it solves everything, and I’m interested in those 10% of use cases where you need more control, more scale, a deeper understanding, and finer tuning of what you want to happen. I want to have that capability available for people.

This is a statement not just about my organization but more broadly in the data science space. That’s where Kubernetes and its ecosystem, including things like Kubeflow, come in. I think that’s starting to enable people to think about those edge cases where you need dynamic scaling, sharp resource allocation, or something different that isn’t covered by traditional platform-as-a-service ML offerings.

The journey I’ve been on is about how to help sell that vision to people. In my book, which the second edition is about to come out in the next few weeks, I’ve included a large section with practical examples using Kubeflow. I go through and show that this isn’t terrifying; it’s not scary. It’s just another pipelining tool, another way to look at the question of having a model, a process, and deploying it. If you’ve played with tools like ZenML, Airflow, or others, Kubeflow is just a variant of that with a Kubernetes flavor. I see it as a gateway for people to get more into Kubernetes so that if they want to build a more low-level web application that includes ML but also has other elements, they’re not terrified to take that leap.

Itiel Shwartz: I think that’s a great approach. A lot of people talk about ML in Kubernetes and tools like Kubeflow, and it’s gaining a lot of popularity. But at the end of the day, most companies I meet are still wary of it. It’s not their go-to tool or platform. Why do you think that is? Kubeflow is great if you’re using Kubernetes, so why isn’t it more popular? What’s holding people back?

Andy McMahon: That’s a great question. From diving deep into it, going through it with my eyes as an ML engineer, and using it, I think there’s still a learning curve that’s steeper than with other tools. Sometimes, that’s just due to the language used. Kubeflow naturally leverages a lot of terminology from the Kubernetes world—pods, for example—and that doesn’t always resonate with what ML engineers are used to. There are also more considerations around containerization as a fundamental concept. While data scientists are comfortable with containerization, thinking about replicas of containers being managed and resources being dynamically allocated are newer concepts.

To your earlier point about the scientist who likes to focus on the model, the emphasis in something like Kubeflow is still not on the model; it’s on the surrounding infrastructure and the pipelines, which it should be. But that does require some translation.

Once you start using it and building with it, the barrier to entry for something like Kubeflow is much lower than with vanilla Kubernetes. I think that will encourage more people to use it.

Itiel Shwartz: Let’s say I’m a DevOps engineer or an SRE in a company, and I’m thinking about using Kubeflow. What makes Kubeflow such a good fit when working with Kubernetes? Why should I use it? Give me your pitch.

Andy McMahon: The key thing about Kubeflow is the core components that come with it, which have been designed with data scientists and ML engineers in mind. You have the classic central dashboard for managing workloads, but you also get Kubeflow Notebooks. These allow you to leverage the underlying Kubernetes infrastructure while still using a notebook, like Jupyter locally or Colab. This makes it an easier pill to swallow because you can still play around with notebooks while starting to use the Kubeflow API.

Then, there are the training operators, which provide nice API wrappers for scaling up models while still focusing on writing your code. You write functions that don’t look too different from what you’re used to writing, using the right decorators and API components to run things in a scalable fashion, managed by Kubeflow. This abstraction is key, and these operators are easy and intuitive to understand. They work with TensorFlow, PyTorch, XGBoost, and there are ways to make them work with libraries like scikit-learn.

The final piece that’s super important is Kubeflow Pipelines. This isn’t unique to Kubeflow—there are tons of tools that do this—but combined with everything else, it’s very powerful. I’ve said this in talks and on other podcasts: the concept of a pipeline is the key concept for people transitioning from scientists to engineers in the ML space. A pipeline becomes an entity you host, run, orchestrate, and manage. It’s something you start building software engineering practices around, and the Kubeflow Pipelines API makes it easy to build that up.

If you’re used to tools like ZenML or Airflow, the concept of individual steps and how they become part of a wider whole is intuitive. That coupled with everything else means you can play around in your notebooks, use the operators to build scalable versions of your ML code, and wrap it all in a pipeline within this one self-contained ecosystem.

The final point is that it’s open-source and based on Kubernetes, so it’s completely platform-independent. That’s appealing to larger organizations, in particular, who might be finding themselves locked into specific vendors like AWS, Google Cloud, or Azure. You might get to a point where you want to be multi-cloud, run things on-premise, or change what you’re doing tomorrow. Running in a Kubeflow or Kubernetes world makes those decisions easier.

Itiel Shwartz: I really think Kubernetes is the new cloud in a lot of ways. You get so deep into Kubernetes that you don’t even care what’s running behind the scenes anymore. It’s very interesting. But with any good abstraction layer, something usually gets lost. What’s the ugly side of Kubeflow? What’s the downside, or why shouldn’t people use it tomorrow?

Andy McMahon: That’s a fair question. Related to the points we’ve been discussing about adoption, there still isn’t a critical mass of understanding. If you use many other tools in the ML or data science community, there’s a huge user base, countless examples online, courses, documentation—it’s all been ironed out. If you search for something like getting started with scikit-learn, you’ll find millions of articles. But if you look for good examples of Kubeflow in production, there’s far less out there. That means building up practical examples and experience can be more of a trial-and-error process, which some organizations might not like.

Kubeflow is getting there, and they’re making progress, but it still hasn’t reached that critical mass. That can make it harder to find resources, and organizations might not be able to point to many successful cases and say, “Look, that’s the go-to example.” This can impact adoption, at least for a while.

Another issue is that Kubernetes can be really hard to debug. You get these errors that, if you’re not an expert or haven’t been doing it for a while, can be tough to understand. That can become a time sink.

And then, as more people use Kubeflow in production, questions around security, networking, and connections to other systems will come up—just like with Kubernetes itself. This is the flip side of the benefits we mentioned earlier. It’s open-source and platform-independent, but you often have to think about networking, security, and other connections a bit more.

So, those are reasons not to avoid it entirely, but to recognize that it’s not all a beautiful sunset hill. There are always challenges you have to work through.

Itiel Shwartz: That makes total sense. Regarding your point about the difficulties of understanding Kubernetes when something fails, I actually built a whole company around the idea of empowering people who aren’t necessarily Kubernetes experts by giving them the tools and platform to solve those problems. So, I’m with you on that.

You work in a very big organization, in a fairly senior position. Could you share a bit about what it’s like to push new technologies in a company of that size? There are so many people and opinions, and if you ask two engineers, they’ll probably suggest different tools and methodologies. So, let’s say I love Kubeflow and want to get others in the organization to use it—what’s your strategy? Is it more top-down, where you just enforce it, or do you try to get buy-in from people?

Andy McMahon: In an organization as large as NatWest, with around 60,000 people across the UK, Poland, and India, there’s a combination of approaches for all technologies. Some tools are going to be an important part of our stack across everything—engineering, web applications, ML, and so on—like our core security infrastructure, for example.

When it comes to data science and ML, my team works closely with our Data Innovation function and other partners across data analytics. We have a lot of innovative teams constantly pushing the envelope, and our Chief Data and Analytics Officer, Zach Anderson, encourages us to do that. If we see something good, we bring it to the community.

We have lots of communities of practice. I run the MLOps Center of Excellence, but we also have a Data Science Community of Practice and a Data Engineering Community of Practice. People often bring things to those communities—500 data scientists and engineers might get on a call, and someone might say, “I was playing around with this new tool, and it worked really well inside the company. Maybe we should consider it.” Sometimes it’s about building interest and critical mass, which then becomes a case for investment in resources, energy, time, and money.

Another part of my job is to scout the horizon, look at what’s coming, and think about these questions. Sometimes that leads to experimentation because we see where the future is heading and decide to explore certain things.

What I will say is that we’re pretty good at not following hype for hype’s sake. We like to prove out value and bootstrap. If we don’t see value from the initial experiments and don’t know how to fix that, it’s probably a sign that it’s not going to work. If the first few people using Kubeflow are good engineers and data scientists, and they try it but there’s a big list of issues we don’t know how to fix, it’s probably a bad sign. That doesn’t mean it’s dead in the water, but it does mean we need to think about the broader ecosystem or what else might need to change.

Finally, when we do commit to a technology or path, we make sure it’s a community-wide initiative. For example, when we transitioned to AWS, our wider team did a lot of work on creating educational paths, internal tutorials, documentation, and examples. We made sure everyone in data and analytics was involved and that people were excited to be the first use cases. It’s always a badge of honor to be the first.

Top-down mandates generally don’t work unless it’s something core like security infrastructure or a decision to ban a tool for ethical reasons. Otherwise, people will resist. But if they’re along for the journey and excited about it, that changes everything. They’ll adopt it on their own and work through the issues themselves, which is the kind of organization you want to create.

Itiel Shwartz: That makes a lot of sense. I think the term everyone on this podcast has used so far is “critical mass”—how do you gain the trust of enough people in the organization to make it the obvious choice? Once you reach that tipping point, people don’t even think about using something else.

To finish up, I’d love to hear your predictions about the future of MLOps, Kubernetes, or the combination of the two. What’s going to happen in the upcoming years?

Andy McMahon: I’ll give you two predictions that are somewhat related. First, in the broader MLOps space, I think large language models (LLMs) and generative AI are going to be huge. There will be a massive need for upskilling in these new capabilities—understanding LLMs, fine-tuning them with your data to give your version of the model an edge. Because if everyone is using the same model, like LLaMA 2 or GPT-4, you need to fine-tune it to stand out.

This will drive a need for large-scale, scalable infrastructure and understanding. That’s where I see Kubernetes and tools like Kubeflow playing a bigger role, especially for large-scale data processing or model serving. So, on the MLOps side, it’s all about LLMs and LLMOps, as it’s being termed. On the back of that, we’ll see a greater need for understanding large-scale infrastructure and fine-tuning. Everything will become more compute-heavy, and tools like Kubernetes, Ray, and Spark will be used more frequently to build and scale these large models.

The future is bright if you’re interested in scale. This is the time for ML engineers to shine—over the next few years, scale will be king. That’s my prediction. It’s vague enough that I won’t lose any bets, but I think it’s on point.

Itiel Shwartz: We’ll have to bring you back in two years to see how things have panned out.

Udi Hofesh: I just wanted to ask something I’ve been waiting to ask ever since you mentioned your degree in physics. There’s a famous line by Richard Feynman where he said, “If I could explain it to you, I wouldn’t understand it myself.” When you’re trying to enable Kubernetes for data engineers or debugging Kubernetes, do you feel the same way—like it just works, and you can’t explain why?

Andy McMahon: I think I’m wise enough now to know that I really don’t know how anything works, and that definitely applies to Kubernetes as well. What I can say is that I probably understand enough now to sell the benefits and use it effectively. But we’ll always be learning. Anyone who says they know all the answers is lying. It’s good to be humble—I’m definitely humbled by tools like Kubernetes and others. It’s a great question, Udi. It’s important to remain humble and realize you don’t know everything.

Udi Hofesh: That’s a good message for our listeners as well. I also wanted to mention that you’ve had one successful delivery recently, and another one coming up—one is a baby, the other is a book. Do you want to speak about that?

Andy McMahon: Yes, a successful delivery of my second baby in production—that’s the birth of my second son, Alfie, at the end of May. And in the next few weeks, we’re releasing the second edition of my book, Machine Learning Engineering with Python, published by Packt. This is my third child, so to speak, and it probably occupies me way too much. It’s a really big improvement on the first edition, with 150 new pages, making it 400 pages long. There’s a lot of new content about Kubeflow, Ray, ZenML, large language models, and generative AI. It’s been brought right up to date, and I’m really excited about it. The book is for bridging that gap between the science and engineering world, breaking down barriers, and helping data scientists and software engineers understand each other’s language to build really cool applications.

Itiel Shwartz: Where can people get it?

Andy McMahon: They can get it on Amazon—Amazon.com, .co.uk, or wherever you shop. It’s available for pre-order now and will be out in early September. You can also get it on Packt’s website and probably anywhere else that sells books. I’d say go to Amazon, and we’ll try to include a link in the show notes.

Itiel Shwartz: I had a great time, Andy. I wish you the best of luck with the future of MLOps. It’s really the bleeding edge—Kubernetes is new, MLOps is new, and doing things in that space is really living on the edge. But you make it sound like you have things under control. I wish you the best of luck, and I was really happy to have you on the show.

Andy McMahon: Thanks, Itiel. It’s been great chatting, and best of luck with Komodor and everything you’re doing. I love all the stuff you’re putting out—the podcast is great, and the content on your website is great too. Best of luck as well.

Itiel Shwartz: Thank you very much. Goodbye.

Andy McMahon: Cheers, thanks.

[music]

Andrew P. McMahon is a data scientist and machine learning engineer with several years of experience leading teams that deliver value using cutting-edge technology. He specializes in helping organizations take their initial machine learning and data proof-of-concept solutions through to production. He led machine learning development in companies working across logistics optimization, distributed energy systems, and now in financial services.

He also has a PhD in theoretical condensed matter physics from Imperial College London and has been a part-time science consultant for the Discovery Channel. 

His second book on “Machine Learning Engineering with Python” is out now! 

Itiel Shwartz is CTO and co-founder of Komodor, a company building the next-gen Kubernetes management platform for Engineers.

Worked at eBay, Forter, and Rookout as the first developer.

Backend & Infra developer turned ‘DevOps’, an avid public speaker who loves talking about infrastructure, Kubernetes, Python observability, and the evolution of R&D culture.  He is also the host of the Kubernetes for Humans Podcast. 

Please note: This transcript was generated using automatic transcription software. While we strive for accuracy, there may be slight discrepancies between the text and the audio. For the most precise understanding, we recommend listening to the podcast episode