vLLM

Name: vLLM
Brand: Nebius

Updated June 20, 2024

vLLM is a fast and easy-to-use library for LLM inference and serving. With its PagedAttention algorithm that manages attention keys and values efficiently, vLLM delivers state-of-the-art high-throughput serving. The library is flexible and easy-to-use as it provides seamless integration with popular Hugging Face models, support of various decoding algorithms, an OpenAI-compatible API server, and more. With vLLM, you can deploy and scale LLM applications seamlessly, leveraging Kubernetes' flexibility and scalability to meet the demands of modern AI workloads.

You can deploy vLLM (v0.4.2) in your Nebius AI Managed Service for Kubernetes clusters using this Marketplace product. It includes Gradio (v0.0.1) which you can use as the user interface for vLLM; installing it is optional.

Warning

Before installing vLLM, you must install NVIDIA^® GPU Operator on the cluster. For details, see the deployment instructions below.

Deployment instructions

Before installing this product:

Create a Kubernetes cluster and a node group with GPUs in it. The product supports the following VM platforms with GPUs:
- NVIDIA^® H100 NVLink with Intel Sapphire Rapids (Types A, B, C)
- NVIDIA^® V100 NVLink with Intel Cascade Lake
- NVIDIA^® V100 PCIe with Intel Broadwell
Install kubectl and configure it to work with the created cluster.
Install the NVIDIA^® GPU Operator on the cluster.

To install the product:

Click the button in this card to go to the cluster selection form.
Select your cluster and click Continue.
Configure the application:
- Namespace: Select a namespace or create one.
- Application name: Enter an application name.
- GPU Platform: Select the same VM platform as the platform you selected when creating a node group with GPUs.
- Number of vLLM replicas: Enter the number of worker pods with GPUs. Each worker will use one GPU.
- Model for vLLM engine: Enter the Hugging Face name of the model that you want to use. Make sure that you have access to the model. Default: h2oai/h2o-danube2-1.8b-chat (see the Hugging Face card).
- Hugging Face access token: Paste your Hugging Face access token.
- Enable Hugging Face cache: If this option is selected, model files downloaded from Hugging Face are cached to a persistent volume.
- Hugging Face cache size Gi: Enter the size of the persistent volume for Hugging Face caching. If caching is disabled, this parameter is ignored. Default: 80
- Enable Gradio UI: If this option is selected, you can use Gradio as the user interface for vLLM.
- Gradio username: Create a username to access Gradio.
- Gradio password: Create a password to access Gradio.
  
  Note
  
  To enable authentication in Gradio, you should set both username and password. If they are not set, Gradio will be available without authentication.
Click Install.
Wait for the application to change its status to Deployed.

To check that vLLM is working, test the OpenAI Completions API served by vLLM:

Set up port forwarding:

kubectl -n <namespace> port-forward \
  services/<application_name> 8000:8000

Send a request to the API (the example uses the default h2oai/h2o-danube2-1.8b-chat model; you can modify it):

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "h2oai/h2o-danube2-1.8b-chat",
    "messages": [
      {"role": "user", "content": "Hello"}
    ],
    "temperature": 0.7
  }'

If you enabled Gradio, to check that it is working, access it:
1. Set up port forwarding:
```
kubectl -n <namespace> port-forward \
  services/<application_name>-gradio-ui 7860:7860
```
2. Go to http://localhost:7860/ in your web browser. If you have set credentials when installing the product, use them to log into the UI.

Use cases

Natural language understanding (NLU) applications requiring efficient inference with large language models.
Sentiment analysis and text classification tasks in various industries such as social media monitoring, customer feedback analysis, and content moderation.
Language translation services requiring high-throughput and low-latency inference for real-time translation.
Question answering systems for knowledge bases, customer support, and virtual assistants.
Recommendation systems for personalized content delivery in e-commerce, streaming platforms, and social networks.
Chatbots and conversational AI applications for interactive user experiences in customer service, healthcare, and education.
Text summarization and information retrieval for content curation, search engines, and document management systems.
Named entity recognition (NER) and entity linking for information extraction and knowledge graph construction in data analytics and research.

Links

vLLM website vLLM on GitHub vLLM documentation Gradio website Gradio on GitHub Gradio documentation

Technical support

Nebius AI does not provide technical support for the product. If you have any issues, please refer to the developer’s information resources.

Product composition

Helm chart	Version	Pull-command	Documentation
cr.nemax.nebius.cloud/yc-marketplace/nebius/vllm/chart/vllm	1.0.0		Open

Docker image	Version	Pull-command
cr.nemax.nebius.cloud/yc-marketplace/nebius/vllm/vllm-openai1715783148629718682479328181494055691782918773250	v0.4.2
cr.nemax.nebius.cloud/yc-marketplace/nebius/vllm/gradio-ui1715783148629718682479328181494055691782918773250	v0.0.1

Terms

By using this product you agree to the Nebius AI Marketplace Terms of Service and the terms and conditions of the following software: Apache 2.0 (vLLM), Apache 2.0 (Gradio)

vLLM

Platform

Resources

Solutions

Prices

Company

Legal