vLLM 从 0 到 1:完整部署教程(含 FastAPI 调用)
革命性的 KV Cache 管理机制,显存利用率接近 100%,大幅提升并发吞吐连续批处理(Continuous Batching):动态将请求打包处理,GPU 利用率极高兼容 OpenAI API:启动后直接可以用和接口支持主流模型:LLaMA、Qwen、Mistral、DeepSeek、Gemma、Falcon 等数百种模型量化支持:GPTQ、AWQ、SqueezeLLM 等量化格式开箱即用用
vLLM 从 0 到 1:完整部署教程(含 FastAPI 调用)
本教程面向 AI 推理部署新手,从零开始带你完成 vLLM 的安装、模型权重配置、服务启动,以及用 FastAPI 封装调用接口的全流程。每一步都有详细说明,照着做就能跑起来。
目录
1. 什么是 vLLM? {#what}
vLLM 是由 UC Berkeley 开源的高性能大语言模型推理与服务框架,是目前最主流的 LLM 部署方案之一,核心特点:
- PagedAttention:革命性的 KV Cache 管理机制,显存利用率接近 100%,大幅提升并发吞吐
- 连续批处理(Continuous Batching):动态将请求打包处理,GPU 利用率极高
- 兼容 OpenAI API:启动后直接可以用
/v1/chat/completions和/v1/completions接口 - 支持主流模型:LLaMA、Qwen、Mistral、DeepSeek、Gemma、Falcon 等数百种模型
- 量化支持:GPTQ、AWQ、SqueezeLLM 等量化格式开箱即用
用户请求
↓
FastAPI(你的业务层)
↓
vLLM Server(推理引擎 + PagedAttention)
↓
GPU(模型权重)
2. 环境准备 {#env}
硬件要求
| 配置 | 最低 | 推荐 |
|---|---|---|
| GPU | NVIDIA 16GB(如 RTX 4090、A100) | A100 80GB / H100 |
| 显存 | 根据模型大小决定 | 越大越好 |
| CUDA | 11.8+ | 12.1+ |
| 系统 | Ubuntu 20.04+ | Ubuntu 22.04 |
| Python | 3.9+ | 3.10 / 3.11 |
检查 CUDA 版本
nvidia-smi
nvcc --version
正常输出示例:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.xx Driver Version: 535.xx CUDA Version: 12.2 |
+-----------------------------------------------------------------------------+
| GPU Name Persistence-M | ... |
| 0 A100-SXM4 Off | ... 80GB |
+-----------------------------------------------------------------------------+
创建虚拟环境(强烈推荐)
# 使用 conda(推荐)
conda create -n vllm python=3.10 -y
conda activate vllm
# 或使用 venv
python3 -m venv ~/.venv/vllm
source ~/.venv/vllm/bin/activate
3. 安装 vLLM {#install}
方式一:pip 安装(最简单,推荐)
# 基础安装
pip install vllm
# 指定 CUDA 版本安装(以 CUDA 12.1 为例)
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
注意:vLLM 安装包含 PyTorch,首次安装可能需要几分钟,耐心等待。
方式二:从源码安装(获取最新特性)
git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e . # 开发模式安装
方式三:Docker 安装(最省心)
# 拉取官方镜像(包含 CUDA 12.1)
docker pull vllm/vllm-openai:latest
# 运行容器
docker run --runtime nvidia --gpus all \
-v /data/models:/models \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model /models/Qwen2.5-7B-Instruct \
--served-model-name qwen7b
验证安装
python -c "import vllm; print(vllm.__version__)"
输出版本号即安装成功,例如 0.6.3。
4. 下载 / 存放模型权重 {#weights}
vLLM 支持 Hugging Face 格式的模型,需要先把权重下载到本地。
目录结构建议
/data/models/
├── Qwen2.5-7B-Instruct/
│ ├── config.json
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ ├── special_tokens_map.json
│ ├── generation_config.json
│ ├── model.safetensors # 小模型单文件
│ └── model-00001-of-00004.safetensors # 大模型分片
├── Llama-3.1-8B-Instruct/
│ └── ...
└── DeepSeek-R1-Distill-Qwen-7B/
└── ...
从 Hugging Face 下载
pip install huggingface_hub
# 下载模型(以 Qwen2.5-7B-Instruct 为例)
huggingface-cli download \
Qwen/Qwen2.5-7B-Instruct \
--local-dir /data/models/Qwen2.5-7B-Instruct \
--local-dir-use-symlinks False
提示:国内下载较慢,设置镜像加速:
export HF_ENDPOINT=https://hf-mirror.com
从 ModelScope 下载(国内推荐)
pip install modelscope
python3 - << 'EOF'
from modelscope import snapshot_download
snapshot_download(
'Qwen/Qwen2.5-7B-Instruct',
cache_dir='/data/models'
)
EOF
验证权重完整性
ls -lh /data/models/Qwen2.5-7B-Instruct/
# 确认有 config.json、tokenizer.json 以及 .safetensors 文件
python3 -c "
from transformers import AutoConfig
cfg = AutoConfig.from_pretrained('/data/models/Qwen2.5-7B-Instruct')
print('模型架构:', cfg.model_type)
print('隐藏层维度:', cfg.hidden_size)
"
5. 启动 vLLM 推理服务 {#launch}
基础启动命令
python -m vllm.entrypoints.openai.api_server \
--model /data/models/Qwen2.5-7B-Instruct \
--host 0.0.0.0 \
--port 8000
或者使用更简洁的 vllm serve 命令(vLLM >= 0.4.1):
vllm serve /data/models/Qwen2.5-7B-Instruct \
--host 0.0.0.0 \
--port 8000
常用参数说明
| 参数 | 说明 | 示例 |
|---|---|---|
--model |
模型路径(必填) | /data/models/Qwen2.5-7B |
--host |
监听地址 | 0.0.0.0 |
--port |
监听端口,默认 8000 | 8000 |
--tensor-parallel-size |
张量并行度(多卡) | --tensor-parallel-size 4 |
--pipeline-parallel-size |
流水线并行度 | --pipeline-parallel-size 2 |
--gpu-memory-utilization |
GPU 显存利用率(0~1) | --gpu-memory-utilization 0.85 |
--max-model-len |
最大上下文长度 | --max-model-len 8192 |
--served-model-name |
API 中暴露的模型名 | --served-model-name qwen |
--dtype |
精度类型 | --dtype float16 或 bfloat16 |
--quantization |
量化格式 | --quantization awq |
--max-num-seqs |
最大并发序列数 | --max-num-seqs 256 |
--trust-remote-code |
允许执行模型自定义代码 | (无参数值,直接加) |
多 GPU 启动示例(4 卡)
vllm serve /data/models/Qwen2.5-72B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.85 \
--dtype bfloat16 \
--max-model-len 32768
加载量化模型(AWQ)
vllm serve /data/models/Qwen2.5-7B-Instruct-AWQ \
--host 0.0.0.0 \
--port 8000 \
--quantization awq \
--dtype float16
后台运行(推荐生产环境)
nohup vllm serve /data/models/Qwen2.5-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.85 \
> /var/log/vllm.log 2>&1 &
echo "vLLM PID: $!"
# 实时查看日志
tail -f /var/log/vllm.log
启动成功的标志
日志中出现如下内容,说明服务已就绪:
INFO: Started server process [xxxxx]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
以及:
INFO 12-01 ... vllm_engine.py: ... Model loaded in X.XXs
6. 验证服务是否正常 {#verify}
方式一:curl 测试 Chat 接口
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen2.5-7B-Instruct",
"messages": [
{"role": "user", "content": "你好,介绍一下你自己"}
],
"max_tokens": 200,
"temperature": 0.7
}'
正常响应示例:
{
"id": "chatcmpl-xxx",
"object": "chat.completion",
"model": "Qwen2.5-7B-Instruct",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "你好!我是通义千问..."
},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 45,
"total_tokens": 57
}
}
方式二:查看模型列表
curl http://localhost:8000/v1/models
响应:
{
"object": "list",
"data": [{
"id": "Qwen2.5-7B-Instruct",
"object": "model",
"created": 1701234567,
"owned_by": "vllm"
}]
}
方式三:Python openai SDK 测试
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # vLLM 默认不验证 API Key
)
response = client.chat.completions.create(
model="Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "你好"}],
max_tokens=100
)
print(response.choices[0].message.content)
7. 用 FastAPI 封装调用接口 {#fastapi}
生产环境建议在 vLLM 前加一层 FastAPI 服务,用于:
- API Key 鉴权
- 参数校验和默认值管理
- 限流、日志、监控
- 多模型路由
项目结构
vllm-api/
├── main.py # FastAPI 入口
├── config.py # 配置(模型名、vLLM 地址等)
├── models.py # 请求/响应数据模型
├── requirements.txt
└── .env
requirements.txt
fastapi
uvicorn[standard]
httpx
python-dotenv
pydantic
config.py
import os
from dotenv import load_dotenv
load_dotenv()
VLLM_BASE_URL = os.getenv("VLLM_BASE_URL", "http://localhost:8000")
DEFAULT_MODEL = os.getenv("DEFAULT_MODEL", "Qwen2.5-7B-Instruct")
API_KEY = os.getenv("API_KEY", "") # 留空则不验证
MAX_TOKENS_DEFAULT = int(os.getenv("MAX_TOKENS_DEFAULT", "512"))
models.py
from pydantic import BaseModel, Field
from typing import List, Optional
class Message(BaseModel):
role: str # "system" | "user" | "assistant"
content: str
class ChatRequest(BaseModel):
messages: List[Message]
model: Optional[str] = None
max_tokens: Optional[int] = Field(default=512, ge=1, le=8192)
temperature: Optional[float] = Field(default=0.7, ge=0.0, le=2.0)
top_p: Optional[float] = Field(default=1.0, ge=0.0, le=1.0)
stream: Optional[bool] = False
stop: Optional[List[str]] = None
class ChatResponse(BaseModel):
content: str
model: str
prompt_tokens: int
completion_tokens: int
total_tokens: int
finish_reason: str
main.py(完整版)
from fastapi import FastAPI, HTTPException, Depends, Header
from fastapi.responses import StreamingResponse
from fastapi.middleware.cors import CORSMiddleware
import httpx
from config import VLLM_BASE_URL, DEFAULT_MODEL, API_KEY, MAX_TOKENS_DEFAULT
from models import ChatRequest, ChatResponse
app = FastAPI(title="vLLM API Gateway", version="1.0.0")
# CORS
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_methods=["*"],
allow_headers=["*"],
)
# ── 可选鉴权 ──────────────────────────────────────────
def verify_api_key(authorization: str = Header(default="")):
if not API_KEY:
return
if authorization != f"Bearer {API_KEY}":
raise HTTPException(status_code=401, detail="Invalid API Key")
# ── 健康检查 ──────────────────────────────────────────
@app.get("/health")
async def health():
"""检查 vLLM 是否可达"""
try:
async with httpx.AsyncClient(timeout=5) as client:
resp = await client.get(f"{VLLM_BASE_URL}/health")
return {"status": "ok", "vllm": resp.status_code == 200}
except Exception as e:
return {"status": "error", "detail": str(e)}
# ── 查询可用模型 ───────────────────────────────────────
@app.get("/models")
async def list_models():
"""透传 vLLM 的模型列表"""
async with httpx.AsyncClient(timeout=10) as client:
resp = await client.get(f"{VLLM_BASE_URL}/v1/models")
return resp.json()
# ── 非流式对话 ─────────────────────────────────────────
@app.post("/chat", response_model=ChatResponse,
dependencies=[Depends(verify_api_key)])
async def chat(req: ChatRequest):
"""
非流式对话接口
POST /chat
{
"messages": [{"role": "user", "content": "你好"}],
"max_tokens": 512,
"temperature": 0.7
}
"""
payload = {
"model": req.model or DEFAULT_MODEL,
"messages": [m.dict() for m in req.messages],
"max_tokens": req.max_tokens or MAX_TOKENS_DEFAULT,
"temperature": req.temperature,
"top_p": req.top_p,
"stream": False,
}
if req.stop:
payload["stop"] = req.stop
async with httpx.AsyncClient(timeout=120) as client:
try:
resp = await client.post(
f"{VLLM_BASE_URL}/v1/chat/completions",
json=payload
)
resp.raise_for_status()
except httpx.HTTPStatusError as e:
raise HTTPException(
status_code=e.response.status_code,
detail=f"vLLM 错误: {e.response.text}"
)
except httpx.RequestError as e:
raise HTTPException(status_code=502, detail=f"vLLM 不可达: {str(e)}")
data = resp.json()
choice = data["choices"][0]
usage = data.get("usage", {})
return ChatResponse(
content=choice["message"]["content"],
model=data.get("model", DEFAULT_MODEL),
prompt_tokens=usage.get("prompt_tokens", 0),
completion_tokens=usage.get("completion_tokens", 0),
total_tokens=usage.get("total_tokens", 0),
finish_reason=choice.get("finish_reason", "stop"),
)
# ── 流式对话 ───────────────────────────────────────────
@app.post("/chat/stream", dependencies=[Depends(verify_api_key)])
async def chat_stream(req: ChatRequest):
"""
流式对话接口(SSE)
POST /chat/stream
前端通过 EventSource 或 fetch stream 接收
"""
payload = {
"model": req.model or DEFAULT_MODEL,
"messages": [m.dict() for m in req.messages],
"max_tokens": req.max_tokens or MAX_TOKENS_DEFAULT,
"temperature": req.temperature,
"top_p": req.top_p,
"stream": True,
}
if req.stop:
payload["stop"] = req.stop
async def event_generator():
async with httpx.AsyncClient(timeout=120) as client:
async with client.stream(
"POST",
f"{VLLM_BASE_URL}/v1/chat/completions",
json=payload
) as resp:
async for line in resp.aiter_lines():
if line.startswith("data: "):
yield f"{line}\n\n"
return StreamingResponse(event_generator(), media_type="text/event-stream")
# ── 补全接口(Completions)────────────────────────────
@app.post("/completions", dependencies=[Depends(verify_api_key)])
async def completions(
prompt: str,
model: str = None,
max_tokens: int = 512,
temperature: float = 0.7,
):
"""
文本补全接口(非对话场景)
POST /completions?prompt=xxx
"""
payload = {
"model": model or DEFAULT_MODEL,
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": temperature,
}
async with httpx.AsyncClient(timeout=120) as client:
resp = await client.post(
f"{VLLM_BASE_URL}/v1/completions",
json=payload
)
resp.raise_for_status()
data = resp.json()
return {
"text": data["choices"][0]["text"],
"usage": data.get("usage", {}),
}
if __name__ == "__main__":
import uvicorn
uvicorn.run("main:app", host="0.0.0.0", port=8001, reload=True)
.env 示例
VLLM_BASE_URL=http://localhost:8000
DEFAULT_MODEL=Qwen2.5-7B-Instruct
API_KEY=your_secret_key_here # 留空则不验证
MAX_TOKENS_DEFAULT=512
8. 完整调用示例 {#example}
启动 FastAPI 服务
cd vllm-api
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8001 --reload
curl 调用示例
# 非流式对话
curl http://localhost:8001/chat \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_secret_key_here" \
-d '{
"messages": [
{"role": "system", "content": "你是一个有用的助手"},
{"role": "user", "content": "用 Python 写一个快速排序"}
],
"max_tokens": 512,
"temperature": 0.7
}'
# 流式对话
curl http://localhost:8001/chat/stream \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_secret_key_here" \
-d '{
"messages": [{"role": "user", "content": "给我讲个笑话"}],
"stream": true
}'
# 查看模型列表
curl http://localhost:8001/models
# 健康检查
curl http://localhost:8001/health
Python 调用示例
import requests
import httpx
import asyncio
BASE_URL = "http://localhost:8001"
HEADERS = {
"Content-Type": "application/json",
"Authorization": "Bearer your_secret_key_here"
}
# ── 非流式 ────────────────────────────────────────────
def chat(messages, max_tokens=512, temperature=0.7):
resp = requests.post(
f"{BASE_URL}/chat",
headers=HEADERS,
json={
"messages": messages,
"max_tokens": max_tokens,
"temperature": temperature,
},
timeout=120
)
resp.raise_for_status()
return resp.json()["content"]
# 使用示例
result = chat([
{"role": "system", "content": "你是一个代码助手"},
{"role": "user", "content": "解释一下 PagedAttention 的原理"},
])
print(result)
# ── 流式(异步) ──────────────────────────────────────
async def chat_stream(messages):
async with httpx.AsyncClient(timeout=120) as client:
async with client.stream(
"POST",
f"{BASE_URL}/chat/stream",
headers=HEADERS,
json={"messages": messages, "stream": True}
) as resp:
async for line in resp.aiter_lines():
if line.startswith("data: ") and "[DONE]" not in line:
import json
try:
data = json.loads(line[6:])
delta = data["choices"][0]["delta"].get("content", "")
if delta:
print(delta, end="", flush=True)
except Exception:
pass
print() # 换行
asyncio.run(chat_stream([
{"role": "user", "content": "用简单的语言解释 Transformer"}
]))
FastAPI 自动文档
服务启动后访问:
- Swagger UI:
http://localhost:8001/docs - ReDoc:
http://localhost:8001/redoc
9. 常见问题排查 {#faq}
Q: 安装时报 torch 版本冲突
ERROR: pip's dependency resolver does not currently consider all the packages
解决:先卸载旧版 torch,再安装 vLLM:
pip uninstall torch torchvision torchaudio -y
pip install vllm
Q: 启动时报 CUDA out of memory
torch.cuda.OutOfMemoryError: CUDA out of memory
解决:
# 方法一:降低显存利用率
vllm serve /data/models/xxx \
--gpu-memory-utilization 0.7 # 默认 0.9,调小
# 方法二:限制最大上下文长度
vllm serve /data/models/xxx \
--max-model-len 4096
# 方法三:使用量化版本
vllm serve /data/models/xxx-AWQ \
--quantization awq
Q: 报错 The model's max seq len (xxx) is larger than the maximum number of tokens that can be stored in KV cache
解决:显存不足以支持当前 max_model_len,调小它:
vllm serve /data/models/xxx \
--max-model-len 8192 # 根据显存大小调整
Q: 多卡启动报 NCCL error
# 指定通信网卡
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1
# 或者禁用 IB 网络
export NCCL_P2P_DISABLE=1
Q: 模型名不匹配导致 404
vLLM 默认用 --model 路径的最后一级目录名作为模型名。可以用 --served-model-name 自定义:
vllm serve /data/models/Qwen2.5-7B-Instruct \
--served-model-name qwen7b
请求时填 "model": "qwen7b" 即可。
Q: 请求超时
vLLM 首次加载模型需要几分钟,FastAPI 侧要调大超时:
async with httpx.AsyncClient(timeout=300) as client:
...
Q: 如何开启 API Key 验证
vLLM 本身支持 --api-key 参数:
vllm serve /data/models/xxx \
--api-key my_secret_key
之后所有请求需要带上 Authorization: Bearer my_secret_key。
Q: 显存占用过高,如何优化
# 1. 开启前缀缓存(相同前缀的请求复用 KV Cache)
vllm serve /data/models/xxx \
--enable-prefix-caching
# 2. 限制并发
vllm serve /data/models/xxx \
--max-num-seqs 64
# 3. 使用 bfloat16(比 float16 更稳定)
vllm serve /data/models/xxx \
--dtype bfloat16
总结
至此你已经完成了:
| 步骤 | 内容 |
|---|---|
| ✅ 安装 vLLM | pip / Docker 安装 |
| ✅ 准备权重 | HF / ModelScope 下载,规范存放 |
| ✅ 启动服务 | 单卡 / 多卡 / 量化命令,后台运行 |
| ✅ 验证接口 | curl / openai SDK 测试 |
| ✅ FastAPI 封装 | 非流式 + 流式 + 补全接口,含鉴权 |
| ✅ 调用示例 | curl / Python 完整示例 |
如有问题欢迎评论区交流!
更多推荐

所有评论(0)