Home
Komodor Blog
Leveraging Argo Workflows for MLOps

Leveraging Argo Workflows for MLOps

Nir Ben Atar, DevOps Team Lead

11 min read December 21st, 2023

As the demand for AI-based solutions continues to rise, there’s a growing need to build machine learning pipelines quickly without sacrificing quality or reliability. However, since data scientists, software engineers, and operations engineers use specialized tools specific to their fields, synchronizing their workflows to create optimized ML pipelines is challenging.

Enter machine learning operations (MLOps), a set of practices, tools, and cultural principles inspired by DevOps. Under the MLOps model, collaboration between data scientists, DevOps engineers, and IT is encouraged so that ML pipelines can be efficiently built at scale. Unsurprisingly, this requires adopting some DevOps concepts such as automation and CI/CD pipelines, artifact storage, versioning and reproducibility, model testing, monitoring of operational metrics, event-driven workflows, modularity and reusability of components, and more.

In this article, we’ll take a closer look at MLOps, including best practices and real-life success stories. While MLOps bridges the gap between data science and operations, it does not tell you which tools to use or which platform is best. This particular article explores the main concepts of MLOps using Argo Workflows.

Understanding Argo Workflows

Let’s start by addressing why you should consider Argo Workflows for MLOps from a macro perspective.

The ideal tool should satisfy the needs of all the stakeholders in the MLOps process. For DevOps teams, a significant concern when building ML pipelines is the considerable resource demand. They’ll be worried about how efficiently the tool can scale compute and storage resources and how it can distribute the load.

On the other hand, data scientists are primarily concerned about whether a new ML pipeline platform supports their preferred tools and enhances productivity. This relates to popular Python frameworks like TensorFlow and PyTorch for machine learning, data analysis and manipulation tools like pandas, or interactive development environments such as Jupyter Notebook—in other words, the scientific Python ecosystem.

To that end, Argo Workflows checks all the right boxes.

Argo Workflows is a cloud-native workflow engine that enables MLOps teams to run complex, data-intensive jobs on top of Kubernetes. This gives you all the inherent advantages of Kubernetes, such as multicloud capability, great scalability, and resilience, as well as excellent resource management. Additionally, containers running on the Kubernetes cluster can use all kinds of specialized data science tools and frameworks, no matter what language they’re written in.

What’s more, Argo Workflows brings additional benefits to MLOps teams that this article will go over next.

MLOps with Argo Workflows

Beyond the advantages of being cloud-native, the Argo Workflows architecture also offers great flexibility. Basically, Argo Workflows consists of two main components:

The Workflow Controller is a Kubernetes custom resource that handles changes in workflow steps
The Argo Server consists of a Kubernetes deployment and a service that is responsible for serving both the Argo API and Argo UI

Here’s an example of such components deployed in the Kubernetes cluster:

NAME                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE service/argo-server   ClusterIP   10.43.128.186   <none>        2746/TCP   2d23h  NAME                                  READY   UP-TO-DATE   AVAILABLE   AGE deployment.apps/workflow-controller   1/1     1            1           2d23h deployment.apps/argo-server           1/1     1            1           2d23h

This design allows engineers to manage workflows either through the Argo CLI or graphically through the Argo UI. The latter is especially useful for team members who need to visualize task sequences or dependencies between them using a directed acyclic graph (DAG) or who simply prefer to work in a user-friendly graphical environment.

The Argo UI

Furthermore, since the Workflow Controller is implemented as a Kubernetes CRD, it integrates seamlessly with other tools in the Kubernetes ecosystem, such as Prometheus for monitoring or MinIO for artifact management, thus creating a robust MLOps environment.

On top of all this, Argo Workflows allows MLOps teams to orchestrate ML pipelines that leverage the declarative approach of Kubernetes. Let’s review some examples of how MLOps teams can get the most out of this approach.

Workflow-Driven ML Pipelines

Defining end-to-end ML workflows in Argo is as simple as writing YAML manifests that outline the steps, dependencies, and resources required for each process phase, namely data preprocessing, model training, and deployment.

Here’s a simplified example of an ML preprocessing workflow in Argo:

apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata:   name: ml-pipeline spec:   entrypoint: preprocessing   templates:   - name: preprocessing     container:       image: my-data-preprocessing-image:latest       command: ["python", "preprocess.py"]     outputs:       artifacts:       - name: processed-data         path: /data/processed ...

The code above shows how Argo uses the Workflow resource to create a preprocessing pipeline. Likewise, you can see how the preprocessing entry point is defined, which serves to name this specific stage of the workflow. Next, it’s specified that this stage uses the my-data-preprocessing-image:latest container image to run the Python tool or code that will be used to process raw data during the preprocessing stage. The artifact that will store the processed data is also defined in this workflow. This makes it easier for other workflows or additional steps within the same workflow to access such data. Furthermore, this task can be reused as many times as necessary, thanks to an important feature of Argo Workflows known as templates.

Additional tasks (such as model training, evaluation, and deployment) can follow a similar structure, including multiple steps and dependencies on other tasks or workflows as required.

Versioning and Reproducibility

Argo Workflows facilitates versioning and reproducibility by managing workflow versions and revisions. This allows you to track and roll back changes, fostering consistency and accountability. Each workflow iteration can be stored and recalled, ensuring the reproducibility of results. This is critical in machine learning, where model performance and reliability are paramount. The traceability of Argo Workflows supports collaborative environments by providing a clear and understandable history of alterations. This also enhances synchronization, fosters collaboration, and promotes the development of high-quality AI-based solutions.

The ability to save multiple versions of different workflow templates also encourages experimentation with more complex workflows that involve dependencies and the parallel execution of tasks.

Dependency Management and Parallel Execution

Two powerful features of Argo Workflows are its ability to handle enhanced dependency logic and its ability to set synchronization limits for parallel execution of workflows.

On the one hand, parallel execution of nondependent tasks allows for improved performance and efficiency. Here is a code sample:

apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata:   generateName: parallel-   labels:     workflows.argoproj.io/test: "true"   annotations:     workflows.argoproj.io/description: |       This workflow demonstrates running a parallel containers within a single pod.     workflows.argoproj.io/version: ">= 3.1.0" spec:   entrypoint: main   templates:     - name: main       containerSet:         containers:           - name: a             image: argoproj/argosay:v2           - name: b             image: argoproj/argosay:v2

As you can see, this workflow runs two tasks in parallel.

On the other hand, Argo Workflows allows you to specify dependencies in DAG templates using the depends field, which provides the ability to define dependent tasks, their statuses, as well as any complex Boolean logic. Here is an example of DAG targets:

...        tasks:       - name: A         template: echo         arguments:           parameters: [{name: message, value: A}]       - name: B         depends: "A"         template: echo         arguments:           parameters: [{name: message, value: B}]       - name: C         depends: "A"         template: echo         arguments:           parameters: [{name: message, value: C}]       - name: D         depends: "B && C"         template: echo         arguments:           parameters: [{name: message, value: D}]       - name: E         depends: "C"         template: echo         arguments:           parameters: [{name: message, value: E}]    - name: echo     inputs:       parameters:       - name: message     container:       image: alpine:3.7       command: [echo, "{{inputs.parameters.message}}"]

Error Handling and Fault Tolerance

Argo Workflows provides robust error handling mechanisms, including automated retries and exit handlers for managing partial failures. These mechanisms allow you to define the number of retries in case of a task failure, and in the case of a partial failure, they allow the workflow to continue executing the remaining tasks if necessary. This ensures the resiliency of ML pipelines orchestrated with Argo Workflows.

Here is a sample retry strategy that sets a limit of ten attempts on failure:

apiVersion: argoproj.io/v1alpha1 kind: Workflow metadata:   generateName: retry-container- spec:   entrypoint: retry-container   templates:   - name: retry-container     retryStrategy:       limit: "10"     container:       image: python:alpine3.6       command: ["python", -c]       # fail with a 66% probability       args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]

Advanced Features of Argo Workflows for MLOps

The points discussed so far cover the basic features expected of a workflow engine. However, MLOps teams often require more advanced features such as reproducibility, robust observability mechanisms, and resource management. These features make Argo Workflows an ideal tool in a variety of MLOps scenarios.

Artifact Management

Argo Workflows enables efficient artifact management thanks to its ability to generate and consume artifacts during determined workflow steps. This, for example, enables MLOps teams to create output artifacts on the fly in one step that can be used as input for later steps. Moreover, Argo’s built-in integration with popular storage systems like S3, GCS, MinIO, and Azure Blob Storage makes it easy to set up artifact repositories that best fit your MLOps team’s needs.

These artifacts can be logs, files, ML models, or other data stored in the specified storage system. This encourages traceability, reduces the risk of data loss, and helps maintain a single source of truth for ML pipelines.

Other artifact management capabilities worth mentioning include data compression in tarballs to save resources, advanced artifact garbage collection so that unnecessary artifacts are automatically removed, parameterization of S3 keys, and the ability to create special service accounts or IAM annotations for S3 buckets.

Tracking and Visualizing Workflow Results and Metrics

As already mentioned, Argo Workflows has an intuitive and convenient UI. This graphical interface is useful for many tasks.

For example, in addition to deploying workflows via the Argo CLI or kubectl apply, you can use the UI to submit new workflows:

Sample workflow in the Argo UI

You can also get an overview of how many workflows are running, how many succeeded, how many failed, and how many had errors:

Workflows list in the Argo UI

The Argo UI can also track and visualize workflow results and metrics, which is one of its strongest advantages:

Visualizing an error in the Argo UI

The screenshot above shows a workflow that failed because it required artifact storage, which was not properly configured. While it may seem simplistic for intricate ML tasks, Argo Workflows also seamlessly integrates with advanced observability tools, as you’ll see in the next section.

Workflow Observability and Monitoring

In addition to monitoring workflow progress and status through the Argo UI, Argo offers Prometheus Metrics for the Workflow Controller and custom metrics that report on the status of one or more particular workflows. The Argo CLI lets you view pod or workflow logs.

Argo can also expand its monitoring and observability capabilities by integrating with tools like Grafana or Komodor. For example, integrating with Komodor would provide a powerful dashboard to monitor any pipeline running in the cluster, data tracking for all changes and Kubernetes events, and real-time monitoring of all Kubernetes resources, which is invaluable for optimizing ML tasks and troubleshooting the root cause of issues.

Workflow Scheduling and Resource Management

Natively, Argo Workflows can run ML workflows on a preset schedule using the CronWorkflow specification. You can think of this feature as Argo’s version of Kubernetes CronJob, with the difference that it can be used to trigger particular workflows when it is most convenient. This feature can be crucial in use cases like the one explained in the success story of BlackRock’s data science platform. BlackRock uses CronWorkflow to give investors access to the financial data created by its algorithms only at certain times, like at the end of a Wall Street trading day. That way, it can save on costs since running ML pipelines consumes a significant amount of resources.

Speaking of resources, it’s worth mentioning that Argo Workflows automatically provides an estimate of the amount of resources each workflow uses. This feature, combined with step-level memoization, exit handlers, and lifecycle hooks, is useful for managing resources efficiently. To achieve a higher degree of cost and resource optimization, your MLOps team can integrate the Komodor dashboard, which would allow it to right-size resources, establish advanced optimization strategies, and more.

Event-Driven Workflows and Triggers

By this point, you’ve learned about the benefits of running Argo Workflows on top of Kubernetes and the added value of easily integrating with other tools in the Kubernetes ecosystem. However, Argo Workflows is also part of the Argo Project ecosystem, which also has native Kubernetes tools such as Argo Events, Argo CD, and Argo Rollouts.

While an in-depth discussion of these tools is beyond the scope of this article, it’s worth highlighting that they can take the functionality of Argo Workflows to the next level. For instance, Argo Events is a complete event-driven workflow automation framework that allows MLOps teams to trigger any Kubernetes object, including Argo Workflows, based on events from over twenty event sources.

Event-based workflows play an integral role in data-driven modeling. For example, a video from the ArgoCon virtual summit illustrates how ML event-based workflows can be used to create insights that allow banks to approve or reject credit applications from individuals based on their income and previous history.

Integrating Argo Workflows into the MLOps Lifecycle

Now that you’ve had a brief tour of Argo Workflows’ advanced capabilities, let’s explore how these features, along with others, can be seamlessly incorporated into the MLOps lifecycle to establish efficient ML pipelines.

When it comes to MLOps, continuous integration and continuous deployment (CI/CD) are pivotal for efficient model training, testing, and deployment. This is because CI/CD enables automation, thus speeding up the overall MLOps lifecycle.

By leveraging Argo Workflows and Argo CD, MLOps teams can define declarative and version-controlled ML pipelines that automatically trigger model training upon detecting code changes, run validation tests, and deploy the models in a production environment. This process ensures consistency and reduces the potential for human error. For an overview of the potential benefits of this approach, it’s worth checking out Major League Baseball’s success story in implementing Argo CD for its ML pipelines.

As mentioned, Argo Workflows also effectively handles model versioning and management. It enables MLOps teams to track model changes, compare performance across versions, and roll back if necessary. Furthermore, when combined with Argo Rollouts, MLOps teams gain powerful capabilities such as blue-green, canary, and progressive delivery features, which allow them to experiment with new features with ease.

Argo Workflows also provides significant advantages for collaboration, eliminating data silos, and promoting teamwork. By defining workflow templates, teams can collaborate, save time, and align their efforts towards a common goal, all from a convenient UI. Additionally, Argo Workflows supports role-based access control, ensuring that only authorized members can access specific workflows. This feature enhances security and fosters a controlled development environment, enabling teams to work seamlessly while maintaining the integrity of their workflows.

Overall, Argo integration into the MLOps lifecycle streamlines the process of model training, testing, deployment, versioning, and management while fostering a collaborative and secure workflow development environment.

Best Practices for Leveraging Argo Workflows in MLOps

There is no doubt that Argo Workflows is a great fit for MLOps teams. Nevertheless, to ensure your team gets the most out of this amazing tool, it’s a good idea to follow best practices for leveraging Argo in MLOps:

Design modular and reusable workflows through templates
Implement workflow testing and validation using both Argo CD and Argo Workflows versioning capabilities
Implement a robust monitoring, tracking, and alerting system for workflow execution using tools like Grafana or Komodor
Encourage collaboration among the members of the MLOps team

In addition to these best practices, it’s worth noting the importance of properly documenting the workflows used for ML pipelines. This simple practice can save time and effort, as it makes it easier for team members to understand the objectives to be achieved with each workflow.

Throughout this article, you’ve seen some strong real-world best practice examples and success stories that demonstrate how Argo Workflows can help improve ML pipelines. However, there are many other examples, use cases, and experiences that can provide valuable insights that can help you decide if Argo Workflows is the ultimate solution for your MLOps team. For instance, you might be interested in how to build complex R forecast applications using Argo Workflows or how to implement ML as code using Kubeflow and Argo CD. Perhaps you want to learn more about cloud-native distributed ML pipelines at scale. For a full list of resources and use cases related to machine learning with Argo Workflows, check out the documentation.

Conclusion

All in all, Argo Workflows presents a compelling value proposition for enhancing MLOps capabilities. Its robust artifact management system offers traceability, mitigates data loss, and optimizes performance, thus providing a streamlined development environment. Integrating Argo Workflows into the MLOps lifecycle accelerates the automation of critical processes, including model training, testing, and deployment, while ensuring controlled development and collaboration.

As the MLOps field continues to evolve, it will likely lead to the rise of more sophisticated workflow automation tools with an increased focus on security, scalability, and cross-functional collaboration. Organizations should consider Argo Workflows as part of their MLOps strategy, leveraging its robustness and versatility to drive innovation.

About Komodor

Komodor reduces the cost and complexity of managing large-scale Kubernetes environments by automating day-to-day operations. As well as health and cost optimization. The Komodor Platform proactively identifies risks that can impact application availability, reliability and performance, while providing AI-assisted root-cause analysis, troubleshooting and automated remediation playbooks. Fortune 500 companies in a wide range of industries including financial services, retail and more. Rely on Komodor to empower developers, reduce TicketOps, and harness the full power of Kubernetes to accelerate their business. The company has received $67M in funding from Accel, Felicis, NFX Capital, OldSlip Group, Pitango First, Tiger Global, and Vine Ventures. For more information visit Komodor website, join the Komodor Kommunity, and follow us on LinkedIn and X.

To request a demo, visit the Contact Sales page.

Media Contact:
Marc Gendron
Marc Gendron PR for Komodor
[email protected]
617-877-7480

Latest Blogs

Komodor Autonomous AI SRE Platform Selected by Nebius to Support Reliability Operations

Komodor Autonomous AI SRE Platform Selected by Nebius to Support Reliability Operations

AI-native cloud company adopts Komodor to automate operational performance and reliability.

Klaudia Under the Hood: How We Built an AI SRE That Actually Earns Trust

Klaudia Under the Hood: How We Built an AI SRE That Actually Earns Trust

In reliability engineering, being ‘mostly right’ is a liability. An AI SRE that sometimes misses root cause or gives a confident wrong answer at 2:17 AM has no place in an enterprise cloud environment.

The Two-Sided Scheduling Problem: Reaching the Next Layer of Cloud Savings

The Two-Sided Scheduling Problem: Reaching the Next Layer of Cloud Savings

You’ve deployed Karpenter and tightened your resource requests, but while you saw an initial dip in your cloud bill, your savings have flatlined.