Running managed Ray on Azure Kubernetes Service

jeudi 13 novembre 2025, 10:00 , par InfoWorld

The move to building and training AI models at scale has had interesting second-order effects; one of the most important is improving how we run and manage massively distributed computing applications. AI training and inferencing both require huge distributed applications to build and refine the models that are at the heart of modern machine learning systems.

As Brendan Burns, Microsoft CVP Azure OSS Cloud Native, notes in a recent Azure blog post, “Scaling from a laptop experiment to a production-grade workload still feels like reinventing the wheel.” Understanding how to break down and orchestrate these workloads takes time and requires significant work in configuring and deploying the resulting systems, even when building on top of existing platforms like Kubernetes.

Design decisions made in the early days of cloud-native development focused on orchestrating large amounts of data rather than managing significant amounts of compute, both CPU and GPU. Our tools make it hard to orchestrate these new workloads that need modern batch computing techniques. Burns’ blog post announced a new Azure partnership to help resolve these issues, working with Anyscale to use a managed version of Ray, its open source, Python-based tool, on Azure Kubernetes Service. You get the tools you need to build and run AI workloads without the work of managing the necessary infrastructure.

What is Ray?

Ray is a set of tools for building large-scale Python distributed applications, with a focus on AI. You can take existing code and quickly make it run on a distributed platform without changing how it works. It provides its own scheduling services for CPU and GPU operations and has a set of native libraries to help train and run AI models that work with familiar tools, including PyTorch.

Anyscale’s enterprise managed edition of Ray adds an enhanced runtime to speed up creating and running clusters and to improve how resources are allocated for both development and production. This all sits on top of AKS, using its provisioning and scaling features for the necessary distributed infrastructure. Thanks to its roots as an open source project founded by Anyscale’s team, it can provide direction and use Ray’s own ecosystem to extend the platform.

The combination makes sense. Microsoft has been using AKS to support both its own and third-party AI development and operations, and as a result, it has developed tools for working with GPU resources alongside the more familiar CPU options, with its own KAITO (Kubernetes AI Toolchain Operator) as well as with Ray.

With the new service currently available in a private preview and added support options for organizations using Anyscale’s commercially licensed managed Ray, it’s possible to see how Microsoft envisions users working with the combined platform, according to Microsoft’s notes on using the open source version on Azure.

Using Ray on Azure

As with any other open source Kubernetes project running on AKS, Microsoft doesn’t provide any support for Ray and redirects users to the Ray project. However, the build Microsoft offers has been compiled and tested by the AKS team and provides signed binaries and containers.

It’s used in conjunction with another open source tool: KubeRay, which provides a Kubernetes operator for Ray, allowing you to use familiar declarative techniques to configure and manage your installation. You’re not limited to using Ray for AI; any large-scale Python distributed application can take advantage of its core libraries. These help parallelize your code as well as build and deploy a cluster.

Ray provides a set of model libraries for AI development, each focused on a specific part of the model life cycle. For example, if you’re training a model in PyTorch, you’ll need to install Ray’s train model alongside PyTorch. Ray provides a set of functions that prepare both your model and your data for data parallel operations, which can then be used as a function called by Ray’s TorchTrainer workers. You can do this to ensure it uses available GPU resources to speed up training.

Other Ray tools allow you to quickly tune model hyperparameters, with functions that manage searches in just a few lines of code. Once a model has been trained and tuned, Ray supports running it in a scalable environment.

Start working with Ray on AKS

Getting started with Ray on AKS is simple enough, as Microsoft provides a set of samples that can automate the process of deploying a basic Ray implementation from a single shell script running in the Azure CLI’s Bash environment. It’s quick, but it’s a good idea to walk through the process manually to understand how KubeRay works. You’ll need some prerequisites: the Azure CLI with the AKS Preview extension, Helm, and a Terraform (or OpenTofu) client.

Along with enabling KubeRay, take your existing Ray-based PyTorch code and use it to build a Docker container that can be deployed across your AKS nodes. When your job runs, the container will be loaded by KubeRay and deployed to worker nodes ready for training.

If you’re deploying by hand, start with a KubeRay job description. This is a YAML file that contains descriptions of all the necessary resources needed to train your model: the number of pods, along with their CPUs, GPUs, and memory; the container that hosts the PyTorch model; and the number of Ray workers that will run the job. You can start relatively small, with as few as 8 virtual CPUs, adding more as job complexity increases or you need to run the job more quickly.

You can track training using Ray’s log files and then evaluate results using its dashboard. This will require configuring access and setting up a suitable Kubernetes ingress controller to provide dashboard access. If you’re tuning an existing model, you can use a similar architecture, with Azure storage for training and test data. Microsoft recommends blob storage, as it offers a good balance of performance and cost.

A platform for open source AI applications

PyTorch is one of the most popular tools for AI model development and tuning. With KubeRay and Ray on AKS, you can quickly work with models at scale, using code running on your laptop to train and tune your model in the cloud. You can also train and tune off-the-shelf, open source models from sites like Hugging Face and customize them for your specific use cases. This means you don’t have to invest in expensive GPUs or large data centers. Instead, you can treat Azure and Ray as a batch-processing environment that only runs when you need it, keeping costs down and letting you quickly deploy custom models in your own network.

There’s a lot more to modern AI than chatbots, and by supporting Ray, AKS becomes a place to train and tune computer vision and other models, using image data stored in Azure blobs or time-series operational data in Fabric, Azure’s big-data service. Once trained, those models can then be downloaded and used in your own applications. For example, you can use NPUs designed for computer vision to run custom-trained models that find flaws in products or that spot safety violations and then trigger warnings. Similar models working with log file data could spot fraud or request preemptive equipment maintenance.

By training and tuning on your own data and your own infrastructure, you get the model you need for a specific task that otherwise might be too expensive to implement. AKS and Ray provide an on-demand, cloud-native training environment, so you’re not only able to get that model in production quickly but also to keep it updated as you identify new source data that can make it more accurate or tuning parameters that will make it more responsive.

You can concentrate on building applications and let Microsoft manage your platform, ensuring you have an up-to-date, secure Kubernetes and Ray environment ready for your code and your users.

Lire la suite sur InfoWorld