Horovod is an open-source distributed training framework optimized for cloud-based machine learning workloads. The primary motivation for Horovod is to make it easy to take a single-GPU training script and scale it to train across many GPUs in parallel. With Horovod, you can use the same infrastructure to train models with any framework, making it easy to switch between TensorFlow, PyTorch, MXNet and other frameworks.
You can deploy Horovod in your Nebius AI Managed Service for Kubernetes clusters using this Marketplace product.
Before installing this product:
-
Create a Kubernetes cluster and a node group in it.
-
Install kubectl and configure it to work with the created cluster.
-
Install GPU drivers in the cluster if required.
-
(optional) If you want Horovod to use custom SSH keys when connecting to workers:
-
Encode the pair into Base64, for example:
cat id_ed25519 | base64 -w 0 cat id_ed25519.pub | base64 -w 0
To install the product:
-
On the cluster page in the management console, go to the Marketplace tab, select the product, and click Install.
-
Configure the application:
-
Namespace: Select a namespace or create a new one.
-
Application name: Enter an application name.
-
Private Ssh key, Public Ssh key: Keep the default SSH key pair that Horovod will use when connecting to workers, or paste a custom Base64-encoded key pair.
-
Horovod type: Select a Horovod configuration (for details, see Horovod Docker images on GitHub):
horovodCpu
: Horovod built for CPU training.horovodRay
: Horovod built for GPU training from Ray, a framework for scaling AI applications. For the integration details, see the Horovod documentation. This type requires that the GPU drivers are installed.horovod
: Horovod built for GPU training. This type requires that the GPU drivers are installed.
-
Number of worker replicas: Select the number of Horovod workers.
-
Driver argument: Enter the command that Horovod will distribute between workers to execute. By default, Horovod will run the TensorFlow MNIST example:
mpiexec -n 1 --hostfile /horovod/generated/hostfile \ --mca orte_keep_fqdn_hostnames t --display-map \ --allow-run-as-root --tag-output --timestamp-output \ sh -c 'LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.3/targets/x86_64-linux/lib/stubs \ python /horovod/examples/tensorflow2/tensorflow2_mnist.py'
-
-
Click Install.
-
Wait for the application to change its status to
Deployed
. -
To check that Horovod is working, check that its pods are running:
kubectl get pods -n <namespace> | grep 'horovod'
-
Distributed training of complex neural networks.
-
Deep learning research, experimentation with various architectures, larger models, datasets, and hyperparameter tuning.
-
Deploying scalable deep learning pipelines in production environments.
Nebius AI does not provide technical support for the product. If you have any issues, please refer to the developer’s information resources.
Helm chart | Version | Pull-command | Documentation |
---|---|---|---|
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/chart/horovod | 1.0.3 | Open |
Docker image | Version | Pull-command |
---|---|---|
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/horovod-cpu1708600471914575211710200936215581301561189130356 | 0.28.1 | |
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/horovod1708600471914575211710200936215581301561189130356 | 0.28.1 | |
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/horovod-cpu1708600471914575211710200936215581301561189130356 | 0.28.1 | |
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/horovod-ray1708600471914575211710200936215581301561189130356 | 0.28.1 | |
cr.nemax.nebius.cloud/yc-marketplace/nebius/horovod/horovod-nvtabular1708600471914575211710200936215581301561189130356 | 0.28.1 |