vLLM 从 0 到 1：完整部署教程（含 FastAPI 调用）

革命性的 KV Cache 管理机制，显存利用率接近 100%，大幅提升并发吞吐连续批处理（Continuous Batching）：动态将请求打包处理，GPU 利用率极高兼容 OpenAI API：启动后直接可以用和接口支持主流模型：LLaMA、Qwen、Mistral、DeepSeek、Gemma、Falcon 等数百种模型量化支持：GPTQ、AWQ、SqueezeLLM 等量化格式开箱即用用

RayJueW

935人浏览 · 2026-03-04 09:01:19

RayJueW · 2026-03-04 09:01:19 发布

vLLM 从 0 到 1：完整部署教程（含 FastAPI 调用）

本教程面向 AI 推理部署新手，从零开始带你完成 vLLM 的安装、模型权重配置、服务启动，以及用 FastAPI 封装调用接口的全流程。每一步都有详细说明，照着做就能跑起来。

1. 什么是 vLLM？ {#what}

vLLM 是由 UC Berkeley 开源的高性能大语言模型推理与服务框架，是目前最主流的 LLM 部署方案之一，核心特点：

PagedAttention：革命性的 KV Cache 管理机制，显存利用率接近 100%，大幅提升并发吞吐
连续批处理（Continuous Batching）：动态将请求打包处理，GPU 利用率极高
兼容 OpenAI API：启动后直接可以用 /v1/chat/completions 和 /v1/completions 接口
支持主流模型：LLaMA、Qwen、Mistral、DeepSeek、Gemma、Falcon 等数百种模型
量化支持：GPTQ、AWQ、SqueezeLLM 等量化格式开箱即用

用户请求
   ↓
FastAPI（你的业务层）
   ↓
vLLM Server（推理引擎 + PagedAttention）
   ↓
GPU（模型权重）

2. 环境准备 {#env}

硬件要求

配置	最低	推荐
GPU	NVIDIA 16GB（如 RTX 4090、A100）	A100 80GB / H100
显存	根据模型大小决定	越大越好
CUDA	11.8+	12.1+
系统	Ubuntu 20.04+	Ubuntu 22.04
Python	3.9+	3.10 / 3.11

检查 CUDA 版本

nvidia-smi
nvcc --version

正常输出示例：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.xx     Driver Version: 535.xx    CUDA Version: 12.2         |
+-----------------------------------------------------------------------------+
| GPU  Name        Persistence-M | ...                                        |
|  0   A100-SXM4   Off           | ...   80GB                                |
+-----------------------------------------------------------------------------+

创建虚拟环境（强烈推荐）

# 使用 conda（推荐）
conda create -n vllm python=3.10 -y
conda activate vllm

# 或使用 venv
python3 -m venv ~/.venv/vllm
source ~/.venv/vllm/bin/activate

3. 安装 vLLM {#install}

方式一：pip 安装（最简单，推荐）

# 基础安装
pip install vllm

# 指定 CUDA 版本安装（以 CUDA 12.1 为例）
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

注意：vLLM 安装包含 PyTorch，首次安装可能需要几分钟，耐心等待。

方式二：从源码安装（获取最新特性）

git clone https://github.com/vllm-project/vllm.git
cd vllm
pip install -e .   # 开发模式安装

方式三：Docker 安装（最省心）

# 拉取官方镜像（包含 CUDA 12.1）
docker pull vllm/vllm-openai:latest

# 运行容器
docker run --runtime nvidia --gpus all \
    -v /data/models:/models \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model /models/Qwen2.5-7B-Instruct \
    --served-model-name qwen7b

验证安装

python -c "import vllm; print(vllm.__version__)"

输出版本号即安装成功，例如 0.6.3。

4. 下载 / 存放模型权重 {#weights}

vLLM 支持 Hugging Face 格式的模型，需要先把权重下载到本地。

目录结构建议

/data/models/
├── Qwen2.5-7B-Instruct/
│   ├── config.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   ├── special_tokens_map.json
│   ├── generation_config.json
│   ├── model.safetensors              # 小模型单文件
│   └── model-00001-of-00004.safetensors  # 大模型分片
├── Llama-3.1-8B-Instruct/
│   └── ...
└── DeepSeek-R1-Distill-Qwen-7B/
    └── ...

从 Hugging Face 下载

pip install huggingface_hub

# 下载模型（以 Qwen2.5-7B-Instruct 为例）
huggingface-cli download \
    Qwen/Qwen2.5-7B-Instruct \
    --local-dir /data/models/Qwen2.5-7B-Instruct \
    --local-dir-use-symlinks False

提示：国内下载较慢，设置镜像加速：
export HF_ENDPOINT=https://hf-mirror.com

从 ModelScope 下载（国内推荐）

pip install modelscope

python3 - << 'EOF'
from modelscope import snapshot_download
snapshot_download(
    'Qwen/Qwen2.5-7B-Instruct',
    cache_dir='/data/models'
)
EOF

验证权重完整性

ls -lh /data/models/Qwen2.5-7B-Instruct/
# 确认有 config.json、tokenizer.json 以及 .safetensors 文件
python3 -c "
from transformers import AutoConfig
cfg = AutoConfig.from_pretrained('/data/models/Qwen2.5-7B-Instruct')
print('模型架构:', cfg.model_type)
print('隐藏层维度:', cfg.hidden_size)
"

5. 启动 vLLM 推理服务 {#launch}

基础启动命令

python -m vllm.entrypoints.openai.api_server \
    --model /data/models/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000

或者使用更简洁的 vllm serve 命令（vLLM >= 0.4.1）：

vllm serve /data/models/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000

常用参数说明

参数	说明	示例
`--model`	模型路径（必填）	`/data/models/Qwen2.5-7B`
`--host`	监听地址	`0.0.0.0`
`--port`	监听端口，默认 8000	`8000`
`--tensor-parallel-size`	张量并行度（多卡）	`--tensor-parallel-size 4`
`--pipeline-parallel-size`	流水线并行度	`--pipeline-parallel-size 2`
`--gpu-memory-utilization`	GPU 显存利用率（0~1）	`--gpu-memory-utilization 0.85`
`--max-model-len`	最大上下文长度	`--max-model-len 8192`
`--served-model-name`	API 中暴露的模型名	`--served-model-name qwen`
`--dtype`	精度类型	`--dtype float16` 或 `bfloat16`
`--quantization`	量化格式	`--quantization awq`
`--max-num-seqs`	最大并发序列数	`--max-num-seqs 256`
`--trust-remote-code`	允许执行模型自定义代码	（无参数值，直接加）

多 GPU 启动示例（4 卡）

vllm serve /data/models/Qwen2.5-72B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --tensor-parallel-size 4 \
    --gpu-memory-utilization 0.85 \
    --dtype bfloat16 \
    --max-model-len 32768

加载量化模型（AWQ）

vllm serve /data/models/Qwen2.5-7B-Instruct-AWQ \
    --host 0.0.0.0 \
    --port 8000 \
    --quantization awq \
    --dtype float16

后台运行（推荐生产环境）

nohup vllm serve /data/models/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization 0.85 \
    > /var/log/vllm.log 2>&1 &

echo "vLLM PID: $!"

# 实时查看日志
tail -f /var/log/vllm.log

启动成功的标志

日志中出现如下内容，说明服务已就绪：

INFO:     Started server process [xxxxx]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

以及：

INFO 12-01 ...  vllm_engine.py: ... Model loaded in X.XXs

6. 验证服务是否正常 {#verify}

方式一：curl 测试 Chat 接口

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "你好，介绍一下你自己"}
    ],
    "max_tokens": 200,
    "temperature": 0.7
  }'

正常响应示例：

{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "model": "Qwen2.5-7B-Instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "你好！我是通义千问..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 45,
    "total_tokens": 57
  }
}

方式二：查看模型列表

curl http://localhost:8000/v1/models

响应：

{
  "object": "list",
  "data": [{
    "id": "Qwen2.5-7B-Instruct",
    "object": "model",
    "created": 1701234567,
    "owned_by": "vllm"
  }]
}

方式三：Python openai SDK 测试

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"   # vLLM 默认不验证 API Key
)

response = client.chat.completions.create(
    model="Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "你好"}],
    max_tokens=100
)

print(response.choices[0].message.content)

7. 用 FastAPI 封装调用接口 {#fastapi}

生产环境建议在 vLLM 前加一层 FastAPI 服务，用于：

API Key 鉴权
参数校验和默认值管理
限流、日志、监控
多模型路由

项目结构

vllm-api/
├── main.py          # FastAPI 入口
├── config.py        # 配置（模型名、vLLM 地址等）
├── models.py        # 请求/响应数据模型
├── requirements.txt
└── .env

requirements.txt

fastapi
uvicorn[standard]
httpx
python-dotenv
pydantic

config.py

import os
from dotenv import load_dotenv

load_dotenv()

VLLM_BASE_URL = os.getenv("VLLM_BASE_URL", "http://localhost:8000")
DEFAULT_MODEL  = os.getenv("DEFAULT_MODEL", "Qwen2.5-7B-Instruct")
API_KEY        = os.getenv("API_KEY", "")   # 留空则不验证
MAX_TOKENS_DEFAULT = int(os.getenv("MAX_TOKENS_DEFAULT", "512"))

models.py

from pydantic import BaseModel, Field
from typing import List, Optional

class Message(BaseModel):
    role: str       # "system" | "user" | "assistant"
    content: str

class ChatRequest(BaseModel):
    messages: List[Message]
    model: Optional[str] = None
    max_tokens: Optional[int] = Field(default=512, ge=1, le=8192)
    temperature: Optional[float] = Field(default=0.7, ge=0.0, le=2.0)
    top_p: Optional[float] = Field(default=1.0, ge=0.0, le=1.0)
    stream: Optional[bool] = False
    stop: Optional[List[str]] = None

class ChatResponse(BaseModel):
    content: str
    model: str
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int
    finish_reason: str

main.py（完整版）

from fastapi import FastAPI, HTTPException, Depends, Header
from fastapi.responses import StreamingResponse
from fastapi.middleware.cors import CORSMiddleware
import httpx

from config import VLLM_BASE_URL, DEFAULT_MODEL, API_KEY, MAX_TOKENS_DEFAULT
from models import ChatRequest, ChatResponse

app = FastAPI(title="vLLM API Gateway", version="1.0.0")

# CORS
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

# ── 可选鉴权 ──────────────────────────────────────────
def verify_api_key(authorization: str = Header(default="")):
    if not API_KEY:
        return
    if authorization != f"Bearer {API_KEY}":
        raise HTTPException(status_code=401, detail="Invalid API Key")

# ── 健康检查 ──────────────────────────────────────────
@app.get("/health")
async def health():
    """检查 vLLM 是否可达"""
    try:
        async with httpx.AsyncClient(timeout=5) as client:
            resp = await client.get(f"{VLLM_BASE_URL}/health")
            return {"status": "ok", "vllm": resp.status_code == 200}
    except Exception as e:
        return {"status": "error", "detail": str(e)}

# ── 查询可用模型 ───────────────────────────────────────
@app.get("/models")
async def list_models():
    """透传 vLLM 的模型列表"""
    async with httpx.AsyncClient(timeout=10) as client:
        resp = await client.get(f"{VLLM_BASE_URL}/v1/models")
        return resp.json()

# ── 非流式对话 ─────────────────────────────────────────
@app.post("/chat", response_model=ChatResponse,
          dependencies=[Depends(verify_api_key)])
async def chat(req: ChatRequest):
    """
    非流式对话接口
    POST /chat
    {
        "messages": [{"role": "user", "content": "你好"}],
        "max_tokens": 512,
        "temperature": 0.7
    }
    """
    payload = {
        "model": req.model or DEFAULT_MODEL,
        "messages": [m.dict() for m in req.messages],
        "max_tokens": req.max_tokens or MAX_TOKENS_DEFAULT,
        "temperature": req.temperature,
        "top_p": req.top_p,
        "stream": False,
    }
    if req.stop:
        payload["stop"] = req.stop

    async with httpx.AsyncClient(timeout=120) as client:
        try:
            resp = await client.post(
                f"{VLLM_BASE_URL}/v1/chat/completions",
                json=payload
            )
            resp.raise_for_status()
        except httpx.HTTPStatusError as e:
            raise HTTPException(
                status_code=e.response.status_code,
                detail=f"vLLM 错误: {e.response.text}"
            )
        except httpx.RequestError as e:
            raise HTTPException(status_code=502, detail=f"vLLM 不可达: {str(e)}")

    data = resp.json()
    choice = data["choices"][0]
    usage  = data.get("usage", {})

    return ChatResponse(
        content=choice["message"]["content"],
        model=data.get("model", DEFAULT_MODEL),
        prompt_tokens=usage.get("prompt_tokens", 0),
        completion_tokens=usage.get("completion_tokens", 0),
        total_tokens=usage.get("total_tokens", 0),
        finish_reason=choice.get("finish_reason", "stop"),
    )

# ── 流式对话 ───────────────────────────────────────────
@app.post("/chat/stream", dependencies=[Depends(verify_api_key)])
async def chat_stream(req: ChatRequest):
    """
    流式对话接口（SSE）
    POST /chat/stream
    前端通过 EventSource 或 fetch stream 接收
    """
    payload = {
        "model": req.model or DEFAULT_MODEL,
        "messages": [m.dict() for m in req.messages],
        "max_tokens": req.max_tokens or MAX_TOKENS_DEFAULT,
        "temperature": req.temperature,
        "top_p": req.top_p,
        "stream": True,
    }
    if req.stop:
        payload["stop"] = req.stop

    async def event_generator():
        async with httpx.AsyncClient(timeout=120) as client:
            async with client.stream(
                "POST",
                f"{VLLM_BASE_URL}/v1/chat/completions",
                json=payload
            ) as resp:
                async for line in resp.aiter_lines():
                    if line.startswith("data: "):
                        yield f"{line}\n\n"

    return StreamingResponse(event_generator(), media_type="text/event-stream")

# ── 补全接口（Completions）────────────────────────────
@app.post("/completions", dependencies=[Depends(verify_api_key)])
async def completions(
    prompt: str,
    model: str = None,
    max_tokens: int = 512,
    temperature: float = 0.7,
):
    """
    文本补全接口（非对话场景）
    POST /completions?prompt=xxx
    """
    payload = {
        "model": model or DEFAULT_MODEL,
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature,
    }

    async with httpx.AsyncClient(timeout=120) as client:
        resp = await client.post(
            f"{VLLM_BASE_URL}/v1/completions",
            json=payload
        )
        resp.raise_for_status()

    data = resp.json()
    return {
        "text": data["choices"][0]["text"],
        "usage": data.get("usage", {}),
    }


if __name__ == "__main__":
    import uvicorn
    uvicorn.run("main:app", host="0.0.0.0", port=8001, reload=True)

.env 示例

VLLM_BASE_URL=http://localhost:8000
DEFAULT_MODEL=Qwen2.5-7B-Instruct
API_KEY=your_secret_key_here     # 留空则不验证
MAX_TOKENS_DEFAULT=512

8. 完整调用示例 {#example}

启动 FastAPI 服务

cd vllm-api
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8001 --reload

curl 调用示例

# 非流式对话
curl http://localhost:8001/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_secret_key_here" \
  -d '{
    "messages": [
      {"role": "system", "content": "你是一个有用的助手"},
      {"role": "user", "content": "用 Python 写一个快速排序"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

# 流式对话
curl http://localhost:8001/chat/stream \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_secret_key_here" \
  -d '{
    "messages": [{"role": "user", "content": "给我讲个笑话"}],
    "stream": true
  }'

# 查看模型列表
curl http://localhost:8001/models

# 健康检查
curl http://localhost:8001/health

Python 调用示例

import requests
import httpx
import asyncio

BASE_URL = "http://localhost:8001"
HEADERS = {
    "Content-Type": "application/json",
    "Authorization": "Bearer your_secret_key_here"
}

# ── 非流式 ────────────────────────────────────────────
def chat(messages, max_tokens=512, temperature=0.7):
    resp = requests.post(
        f"{BASE_URL}/chat",
        headers=HEADERS,
        json={
            "messages": messages,
            "max_tokens": max_tokens,
            "temperature": temperature,
        },
        timeout=120
    )
    resp.raise_for_status()
    return resp.json()["content"]

# 使用示例
result = chat([
    {"role": "system", "content": "你是一个代码助手"},
    {"role": "user",   "content": "解释一下 PagedAttention 的原理"},
])
print(result)


# ── 流式（异步） ──────────────────────────────────────
async def chat_stream(messages):
    async with httpx.AsyncClient(timeout=120) as client:
        async with client.stream(
            "POST",
            f"{BASE_URL}/chat/stream",
            headers=HEADERS,
            json={"messages": messages, "stream": True}
        ) as resp:
            async for line in resp.aiter_lines():
                if line.startswith("data: ") and "[DONE]" not in line:
                    import json
                    try:
                        data = json.loads(line[6:])
                        delta = data["choices"][0]["delta"].get("content", "")
                        if delta:
                            print(delta, end="", flush=True)
                    except Exception:
                        pass
            print()  # 换行

asyncio.run(chat_stream([
    {"role": "user", "content": "用简单的语言解释 Transformer"}
]))

FastAPI 自动文档

服务启动后访问：

Swagger UI：http://localhost:8001/docs
ReDoc：http://localhost:8001/redoc

9. 常见问题排查 {#faq}

Q: 安装时报 `torch` 版本冲突

ERROR: pip's dependency resolver does not currently consider all the packages

解决：先卸载旧版 torch，再安装 vLLM：

pip uninstall torch torchvision torchaudio -y
pip install vllm

Q: 启动时报 `CUDA out of memory`

torch.cuda.OutOfMemoryError: CUDA out of memory

解决：

# 方法一：降低显存利用率
vllm serve /data/models/xxx \
    --gpu-memory-utilization 0.7   # 默认 0.9，调小

# 方法二：限制最大上下文长度
vllm serve /data/models/xxx \
    --max-model-len 4096

# 方法三：使用量化版本
vllm serve /data/models/xxx-AWQ \
    --quantization awq

Q: 报错 `The model's max seq len (xxx) is larger than the maximum number of tokens that can be stored in KV cache`

解决：显存不足以支持当前 max_model_len，调小它：

vllm serve /data/models/xxx \
    --max-model-len 8192   # 根据显存大小调整

Q: 多卡启动报 `NCCL error`

# 指定通信网卡
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1

# 或者禁用 IB 网络
export NCCL_P2P_DISABLE=1

Q: 模型名不匹配导致 404

vLLM 默认用 --model 路径的最后一级目录名作为模型名。可以用 --served-model-name 自定义：

vllm serve /data/models/Qwen2.5-7B-Instruct \
    --served-model-name qwen7b

请求时填 "model": "qwen7b" 即可。

Q: 请求超时

vLLM 首次加载模型需要几分钟，FastAPI 侧要调大超时：

async with httpx.AsyncClient(timeout=300) as client:
    ...

Q: 如何开启 API Key 验证

vLLM 本身支持 --api-key 参数：

vllm serve /data/models/xxx \
    --api-key my_secret_key

之后所有请求需要带上 Authorization: Bearer my_secret_key。

Q: 显存占用过高，如何优化

# 1. 开启前缀缓存（相同前缀的请求复用 KV Cache）
vllm serve /data/models/xxx \
    --enable-prefix-caching

# 2. 限制并发
vllm serve /data/models/xxx \
    --max-num-seqs 64

# 3. 使用 bfloat16（比 float16 更稳定）
vllm serve /data/models/xxx \
    --dtype bfloat16

总结

至此你已经完成了：

步骤	内容
✅ 安装 vLLM	pip / Docker 安装
✅ 准备权重	HF / ModelScope 下载，规范存放
✅ 启动服务	单卡 / 多卡 / 量化命令，后台运行
✅ 验证接口	curl / openai SDK 测试
✅ FastAPI 封装	非流式 + 流式 + 补全接口，含鉴权
✅ 调用示例	curl / Python 完整示例

如有问题欢迎评论区交流！

参考资料：vLLM 官方文档 | GitHub

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

Babel幽灵注释：删节点为何删不掉注释？

Babel中"幽灵注释"问题的核心原因是注释并非节点的属性，而是通过leadingComments/trailingComments关联的独立对象。当使用path.remove()删除节点时，注释对象及其位置信息仍保留在内存中，导致生成代码时注释被错误保留或漂移。解决方案包括：1)删除前手动清空注释引用；2)用空语句替换节点；3)清除位置元数据。理解Babel"宁留勿漏"的设计哲学，按照"清注释→

2048 AI社区

Claude Code 使用技巧

Claude Code 使用摘要 Claude Code 提供三种交互模式（默认/自动接受/计划模式），支持多种快捷键和斜杠命令管理对话、记忆和任务。用户可通过CLI启动，使用!执行Shell命令，利用Skill复用常用指令，并通过Subagents处理独立任务。记忆系统分为项目级和用户级，支持图片输入和Hooks自动化。MCP协议可扩展外部工具集成，插件系统增强功能边界。

2048 AI社区

Flutter 框架跨平台鸿蒙开发 - 生活中的书法练习应用开发文档

2048 AI社区

所有评论(0)

查看更多评论

RayJueW

@RayJueW

已为社区贡献2条内容

vLLM 从 0 到 1：完整部署教程（含 FastAPI 调用）

RayJueW

vLLM 从 0 到 1：完整部署教程（含 FastAPI 调用）

目录

1. 什么是 vLLM？ {#what}

2. 环境准备 {#env}

硬件要求

检查 CUDA 版本

创建虚拟环境（强烈推荐）

3. 安装 vLLM {#install}

方式一：pip 安装（最简单，推荐）

方式二：从源码安装（获取最新特性）

方式三：Docker 安装（最省心）

验证安装

4. 下载 / 存放模型权重 {#weights}

目录结构建议

从 Hugging Face 下载

从 ModelScope 下载（国内推荐）

验证权重完整性

5. 启动 vLLM 推理服务 {#launch}

基础启动命令

常用参数说明

多 GPU 启动示例（4 卡）

加载量化模型（AWQ）

后台运行（推荐生产环境）

启动成功的标志

6. 验证服务是否正常 {#verify}

方式一：curl 测试 Chat 接口

方式二：查看模型列表

方式三：Python openai SDK 测试

7. 用 FastAPI 封装调用接口 {#fastapi}

项目结构

requirements.txt

config.py

models.py

main.py（完整版）

.env 示例

8. 完整调用示例 {#example}

启动 FastAPI 服务

curl 调用示例

Python 调用示例

FastAPI 自动文档

9. 常见问题排查 {#faq}

Q: 安装时报 torch 版本冲突

Q: 启动时报 CUDA out of memory

Q: 报错 The model's max seq len (xxx) is larger than the maximum number of tokens that can be stored in KV cache

Q: 多卡启动报 NCCL error

Q: 模型名不匹配导致 404

Q: 请求超时

Q: 如何开启 API Key 验证

Q: 显存占用过高，如何优化

总结

所有评论(0)

RayJueW

Q: 安装时报 `torch` 版本冲突

Q: 启动时报 `CUDA out of memory`

Q: 报错 `The model's max seq len (xxx) is larger than the maximum number of tokens that can be stored in KV cache`

Q: 多卡启动报 `NCCL error`