Marketplace

Ray Cluster

Updated April 25, 2024

Ray is an open-source distributed computing framework built for the deployment and orchestration of scalable distributed computing environments for a variety of large-scale AI workloads. Ray Cluster provides a robust infrastructure for training complex machine learning models and running reinforcement learning algorithms at scale. Leveraging Kubernetes orchestration capabilities, Ray Cluster simplifies the deployment process, allowing users to efficiently allocate resources and manage workloads across clusters. With support for distributed execution and parallelism, Ray Cluster optimizes resource utilization and accelerates model training, enabling faster iteration and experimentation in AI research and development.
You can deploy KubeRay, the Kubernetes operator officially supported by Ray, in your Nebius AI Managed Service for Kubernetes clusters using this Marketplace product.

Warning

Before installing Ray Cluster, you must install NVIDIA® GPU Operator on the cluster. For details, see the deployment instructions below.

Deployment instructions

Before installing this product:

  1. Create a Kubernetes cluster and a node group with GPUs in it. The product supports the following VM platforms with GPUs:

    • NVIDIA® H100 NVLink with Intel Sapphire Rapids (Types A, B, C)
    • NVIDIA® V100 NVLink with Intel Cascade Lake
    • NVIDIA® V100 PCIe with Intel Broadwell

    Note

    It is strongly recommended that each node has at least 4 vCPUs and 8 GB of RAM.

  2. Install kubectl and configure it to work with the created cluster.

To install the product:

  1. Click the button in this card to go to the cluster selection form.

  2. Select your cluster and click Continue.

  3. Configure the application:

    • Namespace: Select a namespace or create one.

    • Application name: Enter an application name.

    • Head pod vCPUs: Enter the number of vCPUs that the head pod will use on its node. Default value: 4.

    • Head pod RAM: Enter the size of RAM that the head pod will use on its node. Default value: 8Gi.

      Note

      It is stronly recommended to keep the default values for the head pod so that it takes up an entire node. For more details, see the Ray documentation.

    • GPU worker platform: Select the same VM platform as the platform you selected when creating a node group with GPUs.

    • Max. number of GPU workers: Enter the maximum number of worker pods with GPUs. Each worker will use one GPU.

    • Disable non-GPU workers: If this option is selected, only worker pods with GPUs will be created, and the following settings for non-GPU workers will be ignored.

    • Max. number of non-GPU workers: Enter the maximum number of worker pods without GPUs. Default value: 3.

    • Non-GPU worker vCPUs: Enter the number of vCPUs that each worker pod without GPUs will use on its node. Default value: 16.

    • Non-GPU worker RAM: Enter the size of RAM that the each worker pod without GPUs will use on its node. Default value: 30Gi.

      Note

      It is stronly recommended to keep the default values for worker pods without GPUs so that they take up their entire nodes. For more details, see the Ray documentation.

    • Ray Docker image: Enter the URL of a custom Ray Docker image for the head and worker pods. The image must carry version 2.9.3 of Ray. By default, the official rayproject/ray-ml:2.9.3-gpu image hosted in the Nebius AI container registry is used. For more details about Ray Docker images, see the Ray documentation.

  4. Click Install.

  5. Wait for the application to change its status to Deployed.

  6. To check that the Ray cluster is working, access the Ray dashboard:

    1. Set up port forwarding:

      kubectl -n <namespace> port-forward \
        services/<application_name>-kuberay-head-svc 8265:8265
      
    2. Go to http://localhost:8265/ in your web browser.

Billing type
Free
Type
Kubernetes® Application
Category
Training
LLM apps framework
Publisher
Nebius
Use cases
  • Reinforcement learning research and development.
  • Distributed model training for deep learning applications.
  • High-performance computing for scientific simulations and data analysis.
  • Large-scale data processing and analytics.
  • Experimentation with parallel algorithms and distributed systems.
  • Development and deployment of AI-powered applications in production environments.
Technical support

Nebius AI does not provide technical support for the product. If you have any issues, please refer to the developer’s information resources.

Product composition
Helm chartVersion
Pull-command
Documentation
cr.nemax.nebius.cloud/yc-marketplace/nebius/ray-cluster/chart/ray-cluster1.1.0Open
Docker imageVersion
Pull-command
cr.nemax.nebius.cloud/yc-marketplace/nebius/ray-cluster/rayproject/ray-ml17139007773042751290118425291204356127590992150982.9.3-gpu
cr.nemax.nebius.cloud/yc-marketplace/nebius/ray-cluster/kuberay/operator1713900777304275129011842529120435612759099215098v1.1.0
cr.nemax.nebius.cloud/yc-marketplace/nebius/ray-cluster/redis17139007773042751290118425291204356127590992150987.2.4-debian-12-r9
Terms
By using this product you agree to the Nebius AI Marketplace Terms of Service and the terms and conditions of the following software: Apache 2.0
Billing type
Free
Type
Kubernetes® Application
Category
Training
LLM apps framework
Publisher
Nebius