Updated May 16, 2024

Horovod is an open-source distributed training framework optimized for cloud-based machine learning workloads. The primary motivation for Horovod is to make it easy to take a single-GPU training script and scale it to train across many GPUs in parallel. With Horovod, you can use the same infrastructure to train models with any framework, making it easy to switch between TensorFlow, PyTorch, MXNet and other frameworks.

You can deploy Horovod in your Nebius AI Managed Service for Kubernetes clusters using this Marketplace product.

Deployment instructions

Before installing this product:

  1. Create a Kubernetes cluster and a node group in it.

  2. Install kubectl and configure it to work with the created cluster.

  3. Install GPU drivers in the cluster if required.

  4. (optional) If you want Horovod to use custom SSH keys when connecting to workers:

    1. Create an SSH key pair

    2. Encode the pair into Base64, for example:

      cat id_ed25519 | base64 -w 0
      cat id_ed25519.pub | base64 -w 0
      

To install the product:

  1. On the cluster page in the management console, go to the Marketplace tab, select the product, and click Install.

  2. Configure the application:

    • Namespace: Select a namespace or create a new one.

    • Application name: Enter an application name.

    • Private Ssh key, Public Ssh key: Keep the default SSH key pair that Horovod will use when connecting to workers, or paste a custom Base64-encoded key pair.

    • Horovod type: Select a Horovod configuration (for details, see Horovod Docker images on GitHub):

    • Number of worker replicas: Select the number of Horovod workers.

    • Driver argument: Enter the command that Horovod will distribute between workers to execute. By default, Horovod will run the TensorFlow MNIST example:

      mpiexec -n 1 --hostfile /horovod/generated/hostfile \
        --mca orte_keep_fqdn_hostnames t --display-map \
        --allow-run-as-root --tag-output --timestamp-output \
        sh -c 'LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.3/targets/x86_64-linux/lib/stubs \
          python /horovod/examples/tensorflow2/tensorflow2_mnist.py'
      
  3. Click Install.

  4. Wait for the application to change its status to Deployed.

  5. To check that Horovod is working, check that its pods are running:

    kubectl get pods -n <namespace> | grep 'horovod'
    
Billing type
Free
Type
Kubernetes® Application
Category
Training
Publisher
Nebius
Use cases
  • Distributed training of complex neural networks.

  • Deep learning research, experimentation with various architectures, larger models, datasets, and hyperparameter tuning.

  • Deploying scalable deep learning pipelines in production environments.

Technical support

Nebius AI does not provide technical support for the product. If you have any issues, please refer to the developer’s information resources.

Product composition
Helm chartVersion
Pull-command
Documentation
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/chart/horovod1.0.3Open
Docker imageVersion
Pull-command
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/horovod-cpu17086004719145752117102009362155813015611891303560.28.1
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/horovod17086004719145752117102009362155813015611891303560.28.1
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/horovod-cpu17086004719145752117102009362155813015611891303560.28.1
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/horovod-ray17086004719145752117102009362155813015611891303560.28.1
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/horovod-nvtabular17086004719145752117102009362155813015611891303560.28.1
Terms
By using this product you agree to the Nebius AI Marketplace Terms of Service and the terms and conditions of the following software: Apache 2.0
Billing type
Free
Type
Kubernetes® Application
Category
Training
Publisher
Nebius