Marketplace

NVIDIA Triton™ Inference Server

Updated April 4, 2024

NVIDIA Triton Inference Server enables teams to deploy any AI model from multiple deep learning and machine learning frameworks, including TensorRT, TensorFlow, PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton supports inference across cloud, data center, edge and embedded devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. It delivers optimized performance for many query types, including real time, batched, ensembles and audio/video streaming.

Major features include:

You can deploy Triton in your Nebius AI Managed Service for Kubernetes clusters using this Marketplace product.

Deployment instructions
  1. Create a model repository and configure access to it:

    a. Create an Object Storage bucket.
    b. Upload your models to the bucket.
    c. Create a service account and assign it the storage.editor role.
    d. Create a static access key for the service account.
    e. Encode the key contents and ID into Base64, for example:

    echo -n '<key_ID>' | base64
    echo -n '<key>' | base64
    
  2. If you haven’t done it yet, create a Kubernetes cluster and a node group in it.

  3. Install kubectl and configure it to work with the created cluster.

  4. Click the button in this card to go to the cluster selection form.

  5. Select the folder and the cluster, and click Continue.

  6. Configure the application:

    • Model repository path: Path to the model repository bucket created earlier, in the following format: s3://https://storage.ai.nebius.cloud:443/<bucket-name>.
    • Service Type: Kubernetes service type that should be used to expose NVIDIA Triton Inference Server: ClusterIP, LoadBalancer, or NodePort.
    • NEBIUS_KEY_ID: ID of the static access key created earlier, encoded into Base64.
    • NEBIUS_ACCESS_KEY: Contents of the static access key created earlier, encoded into Base64.
  7. Click Install.

  8. Wait for the application to change its status to Deployed.

  9. To check that Triton is working, run the following command:

    curl -v <cluster_public_IP_address>:8000/v2/health/ready
    

    If Triton is ready, you will receive an HTTP 200 response.

Billing type
Free
Type
Kubernetes® Application
Category
Inference
Publisher
Nebius
Use cases
  • Model deployment, inference requests execution.
  • Benchmarking, troubleshooting and improving model performance.
  • Exploring advanced strategies to optimize model configuration.
Technical support

Nebius AI does not provide technical support for the product. If you have any issues, please refer to the developer’s information resources.

Product composition
Helm chartVersion
Pull-command
Documentation
cr.nemax.nebius.cloud/yc-marketplace/nebius/triton/chart/triton-inference-server1.0.0Open
Docker imageVersion
Pull-command
cr.nemax.nebius.cloud/yc-marketplace/nebius/triton/tritonserver171224180674436282265046163590921615162966571738223.11-py3
Terms
By using this product you agree to the Nebius AI Marketplace Terms of Service and the terms and conditions of the following software: BSD
Billing type
Free
Type
Kubernetes® Application
Category
Inference
Publisher
Nebius