Kubeflow: Architecture, Tutorial, and Best Practices

What Is Kubeflow? 

Kubeflow is an open-source platform that simplifies the deployment and management of machine learning (ML) workflows on Kubernetes. It provides a framework to manage ML tasks, such as experimentation, training, and deployment, all within a Kubernetes-based infrastructure. Originally developed by Google, Kubeflow improves Kubernetes’ ability to support ML workloads by providing components for various stages of the ML lifecycle.

The platform aims to make ML model operations more manageable by leveraging Kubernetes for scaling and orchestration. With Kubeflow, developers can craft end-to-end ML workflows, taking advantage of Kubernetes’ scalability, resilience, and distributed nature. 

This is part of a series of articles about Kubernetes tools

Key Components of Kubeflow 

Kubeflow’s architecture comprises several components, each addressing different aspects of the ML pipeline.

Kubeflow Pipelines

Kubeflow Pipelines is a feature for constructing and managing complex ML workflows on Kubernetes. It allows users to create automated, portable, and scalable pipelines, integrating with task orchestration tools. Pipelines provide a UI for tracking experiments, jobs, and outputs, improving reproducibility and collaboration among teams.

With its SDK, Kubeflow Pipelines supports the definition of ML workflows using Python, simplifying pipeline creation and maintenance. This enables users to quickly iterate on models and leverage Kubernetes’ ability to handle diverse workloads.

Kubeflow Notebooks

Kubeflow Notebooks provide an interactive environment to develop and fine-tune ML models. These notebooks run in Kubernetes pods, offering scalability and resource optimization for data scientists. By utilizing Jupyter Notebooks, Kubeflow enables collaborative experimentation and quick prototyping for ML projects.

The integration with Kubernetes means that resources can be dynamically allocated or reallocated, optimizing for the workload’s demands. Kubeflow Notebooks can also be integrated with other Kubeflow components, fostering a continuum from experimentation to deployment for ML workflows.

Kubeflow Central Dashboard

The Kubeflow Central Dashboard is the command center for managing and deploying ML workflows across Kubeflow’s ecosystem. It aggregates all major components, allowing users to configure deployments, monitor resource usage, and simplify operations. Its interface provides visibility into all running experiments and tasks, enabling efficient oversight.

Katib

Katib is Kubeflow’s component dedicated to hyperparameter tuning, an essential step in optimizing ML models. It enables automated experiments to test different configurations, leveraging Kubernetes to manage and scale computational resources dynamically. Katib integrates with various ML frameworks.

Through its design, Katib supports multiple algorithms for hyperparameter search, catering to diverse optimization needs. This ensures that models can achieve desired performance metrics efficiently.

Kubeflow Training Operator

The Kubeflow Training Operator manages distributed training jobs across Kubernetes clusters. It supports various ML frameworks, enabling users to leverage multiple GPUs and nodes efficiently. By enabling distributed training, it significantly reduces model training times, scaling workloads in dynamic environments.

This component capitalizes on Kubernetes’ orchestration capabilities, utilizing cluster resources to support large-scale training operations. It abstracts complex setup requirements, allowing data scientists to focus on model refinement rather than infrastructure management.

KServe (Previously KFServing)

KServe, formerly known as KFServing, focuses on serving ML models at scale. This component simplifies model deployment, enabling resilient inference services. It supports multiple frameworks and autoscaling, adapting to varying demands while ensuring high availability and low latency.

KServe’s design allows for different serving runtimes, alongside integration with Kubernetes’ event-driven architecture. This ensures that inference workloads can adjust to usage patterns.

Related content: Read our guide to Kubernetes service

Kubeflow Tutorial: Create Your First Pipeline 

This section demonstrates how to create and execute a basic Kubeflow pipeline using the Kubeflow Pipelines SDK. Instructions are adapted from the Kubeflow documentation.

Step 1: Install the Kubeflow Pipelines SDK

To begin, install the Kubeflow Pipelines SDK, which provides the tools needed to define, compile, and submit pipelines. Run the following command:

pip install kfp

Step 2: Define a Pipeline Component

In Kubeflow, a component is a reusable step in a pipeline. Users define components using Python functions decorated with @dsl.component. Here’s an example:

from kfp import dsl

@dsl.component
def say_hello(name: str) -> str:
hello_text = f'Hello, {name}!'
print(hello_text)
return hello_text

Some important details about the code:

  • The function say_hello takes a string input (name) and prints a greeting message.
  • The @dsl.component decorator converts this function into a Kubeflow component, allowing it to be used as a pipeline step.

Step 3: Create a Pipeline

A pipeline is a sequence of components that perform tasks. Pipelines are defined using the @dsl.pipeline decorator:

@dsl.pipeline
def hello_pipeline(recipient: str) -> str:
hello_task = say_hello(name=recipient)
return hello_task.output

Some important details about the code:

  • The hello_pipeline function orchestrates the say_hello component.
  • The function passes an input (recipient) to the component and returns its output.

Step 4: Compile the Pipeline

Before submitting a pipeline, developers must compile it into a YAML file using the Kubeflow Pipelines SDK Compiler:

from kfp import compiler
compiler.Compiler().compile(hello_pipeline, 'pipeline.yaml')

Some important details about the code:

  • The compile method converts the pipeline definition into a portable YAML file (pipeline.yaml).
  • This YAML file contains the pipeline’s structure and configuration.

Step 5: Submit the Pipeline for Execution

To execute the pipeline, use the Kubeflow Pipelines SDK Client. First, ensure there is a Kubeflow Pipelines backend deployed and note its endpoint. Then, use the following code:

from kfp.client import Client

client = Client(host='<EXAMPLE-KFP-ENDPOINT>')
run = client.create_run_from_pipeline_package(
'pipeline.yaml',
arguments={
'recipient': 'World',
},
)

Some important details about the code:

  • Replace <EXAMPLE-KFP-ENDPOINT> with the URL of the Kubeflow Pipelines deployment.
  • The create_run_from_pipeline_package method submits the pipeline YAML file along with input arguments (recipient='World' in this example).

Step 6: Monitor Execution

Once the pipeline is submitted, the client outputs a link to the Kubeflow UI, where users can monitor the execution. The pipeline will run a single task (say_hello), which prints and returns the message “Hello, World!”.

expert-icon-header

Tips from the expert

Itiel Shwartz

Co-Founder & CTO

Itiel is the CTO and co-founder of Komodor. He’s a big believer in dev empowerment and moving fast, has worked at eBay, Forter and Rookout (as the founding engineer). Itiel is a backend and infra developer turned “DevOps”, an avid public speaker that loves talking about things such as cloud infrastructure, Kubernetes, Python, observability, and R&D culture.

In my experience, here are tips that can help you optimize Kubeflow usage for managing machine learning workflows effectively:

Use custom Docker images for pipelines:

Create and maintain custom Docker images tailored to the pipeline components. These images should include specific libraries and dependencies needed for the workflows. This ensures reproducibility and simplifies debugging in shared environments.

Integrate with distributed storage solutions:

Pair Kubeflow with distributed storage solutions like Google Cloud Storage for efficient data handling. These solutions provide scalable and shared storage, essential for managing large datasets in ML pipelines.

Adopt multi-tenancy for resource isolation:

Use namespaces and role-based access control (RBAC) within Kubernetes to create isolated environments for different teams or projects. This ensures secure and resource-efficient operations, especially in large, collaborative setups.

Enable automated retraining with event triggers:

Set up event-driven triggers using tools like Argo Workflows or Kubernetes-native events to automate model retraining when datasets are updated or model performance degrades. This keeps models relevant without manual intervention.

Monitor resource usage with Prometheus and Grafana:

Integrate Prometheus with Kubeflow to track resource consumption, and use Grafana for visual dashboards. Monitoring these metrics helps optimize resource allocation and detect bottlenecks in pipelines.

Kubeflow vs. MLflow 

Kubeflow and MLflow are two popular platforms to simplify machine learning workflows, but they cater to different needs and approaches. Here is a comparison based on their core features, use cases, and architecture:

Primary focus

  • Kubeflow: Focuses on leveraging Kubernetes to manage and scale ML workflows. It is suitable for complex, large-scale deployments.
  • MLflow: Focuses on experiment tracking, reproducibility, and model deployment. It is framework-agnostic and simpler to use in environments without Kubernetes.

Architecture

  • Kubeflow: Built on Kubernetes, it provides components for every stage of the ML lifecycle, including training, hyperparameter tuning, pipeline orchestration, and serving. Its integration with Kubernetes makes it a choice for cloud-native applications.
  • MLflow: Offers a lightweight, centralized platform with four key modules—Tracking, Projects, Models, and Model Registry. It does not require Kubernetes and can be deployed on various environments, from local machines to cloud servers.

Ease of use

  • Kubeflow: Offers extensive capabilities but has a steeper learning curve due to its reliance on Kubernetes concepts. It is suitable for teams already familiar with containerized applications and Kubernetes orchestration.
  • MLflow: Simpler to set up and use, making it more accessible for teams or individuals without expertise in Kubernetes. It requires minimal infrastructure knowledge for basic functionality.

Experimentation and tracking

  • Kubeflow: Provides tools like Katib for hyperparameter tuning and integrates with Kubeflow Pipelines for tracking experiments. However, its tracking capabilities are less comprehensive compared to MLflow.
  • MLflow: Excels in experiment tracking with a user-friendly interface. It allows detailed logging of parameters, metrics, and artifacts, and provides version control for models and experiments.

Model deployment

  • Kubeflow: Uses components like KServe for scalable, distributed model serving. It integrates with Kubernetes’ scaling and resource management features.
  • MLflow: Simplifies model deployment using its Models module, which supports multiple serving platforms, such as Docker and cloud-based APIs.

Target audience

  • Kubeflow: Best suited for organizations working on large-scale, distributed ML workflows requiring Kubernetes’ orchestration capabilities.
  • MLflow: Suitable for smaller teams or projects prioritizing simplicity, quick setup, and framework-agnostic tracking and deployment.

Related content: Read our guide to Argo workflows

Best Practices for Using Kubeflow 

Implementing Kubeflow effectively requires certain best practices to ensure effective integration and high performance.

1. Ensure Component Composability

Designing components within the Kubeflow pipeline to be composable ensures flexibility and reusability. By adhering to modular design principles, development teams can build pipelines where each component serves distinct functions. This composability enables rapid changes without overhauling entire workflows.

Component composability ensures that as individual tasks evolve, they can integrate within the existing pipeline, preserving system integrity. This approach minimizes disruptions and maximizes the utility of developed components.

2. Parameterize Input and Output Paths

Parameterizing input and output paths within Kubeflow components is key for improving workflow flexibility. This practice decouples code from data specifics, allowing for easier updates and modifications without altering underlying scripts. It simplifies the passage of data through each pipeline stage, accommodating different datasets effortlessly.

By using parameters for paths, the pipelines cater to different environments and datasets. This adaptability reduces the friction associated with managing distinct ML tasks.

3. Version Control Models and Code

Effective version control of models and code within Kubeflow is imperative for traceability and reproducibility. Implement source control strategies, utilizing Git or similar systems to track changes and manage model versions. This practice ensures rollback capability in response to unexpected performance shifts or errors.

Incorporating version control enables collaboration by providing transparency into model evolution and code adjustments. Teams benefit from shared history, which aids in consistent development even if members work remotely or across different regions.

4. Design for Horizontal Scalability

Bearing horizontal scalability in mind when designing Kubeflow pipelines ensures that components can handle increasing workloads. By leveraging Kubernetes’ scaling capabilities, components can be deployed on multiple nodes, utilizing more resources as demands grow, thus maintaining performance efficiency.

Strategic design for scalability improves workflow reliability, preventing bottlenecks that could hinder model training or inference. This foresight enables adaptation to varying workload requirements, ensuring consistent application performance irrespective of scale.

5. Ensure Environment Consistency with Containers

Using containers in Kubeflow ensures consistent environments across ML pipeline stages, aiding in dependency management. Containers encapsulate software, dependencies, and configurations, fostering reproducibility and reducing environmental discrepancies that could affect performance or accuracy.

Containers enable transitions between development, testing, and production environments, preserving system integrity. This consistency minimizes unexpected behavior, ensuring reliability in ML model execution.

Managing ML Workflows on Kubernetes with Komodor

Komodor is the Continuous Kubernetes Reliability Platform, designed to democratize K8s expertise across the organization and enable engineering teams to leverage its full value.

Komodor’s platform empowers developers to confidently monitor and troubleshoot their workloads while allowing cluster operators to enforce standardization and optimize performance. 

By leveraging Komodor, companies of all sizes significantly improve reliability, productivity, and velocity. Or, to put it simply – Komodor helps you spend less time and resources on managing Kubernetes, and more time on innovating at scale.

If you are interested in checking out Komodor, use this link to sign up for a Free Trial.