SGLang 从 0 到 1:完整部署教程(含 FastAPI 调用)

本教程面向 AI 推理部署新手,从零开始带你完成 SGLang 的安装、模型权重配置、服务启动,以及用 FastAPI 封装调用接口的全流程。每一步都有详细说明,照着做就能跑起来。


目录

  1. 什么是 SGLang?
  2. 环境准备
  3. 拉取 SGLang
  4. 下载 / 存放模型权重
  5. 启动 SGLang 推理服务
  6. 验证服务是否正常
  7. 用 FastAPI 封装调用接口
  8. 完整调用示例
  9. 常见问题排查

1. 什么是 SGLang? {#what}

SGLang(Structured Generation Language)是由学术界开源的高性能 LLM 推理框架,核心特点:

  • RadixAttention:KV Cache 自动复用,多请求共享前缀显著降低显存占用
  • 吞吐量高:相比 vLLM 在多并发场景下通常有 1.5x~3x 的吞吐优势
  • 兼容 OpenAI API:启动后直接可以用 /v1/chat/completions 接口
  • 支持主流模型:LLaMA、Qwen、Mistral、DeepSeek、Gemma 等
用户请求
   ↓
FastAPI(你的业务层)
   ↓
SGLang Server(推理引擎)
   ↓
GPU(模型权重)

2. 环境准备 {#env}

硬件要求

配置 最低 推荐
GPU NVIDIA 16GB(如 A100、RTX 4090) A100 80GB / H100
显存 根据模型大小决定 越大越好
CUDA 11.8+ 12.1+
系统 Ubuntu 20.04+ Ubuntu 22.04
Python 3.9+ 3.10 / 3.11

检查 CUDA 版本

nvidia-smi
nvcc --version

输出类似如下表示 CUDA 环境正常:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.xx     Driver Version: 535.xx    CUDA Version: 12.2         |
+-----------------------------------------------------------------------------+
| GPU  Name        Persistence-M | ...                                        |
|  0   A100-SXM4   Off           | ...   80GB                                |
+-----------------------------------------------------------------------------+

创建虚拟环境(强烈推荐)

# 使用 conda(推荐)
conda create -n sglang python=3.10 -y
conda activate sglang

# 或使用 venv
python3 -m venv ~/.venv/sglang
source ~/.venv/sglang/bin/activate

3. 拉取 SGLang {#install}

方式一:pip 安装(最简单)

# 安装 SGLang(含 runtime 和 language 组件)
pip install "sglang[all]"

# 如果需要指定 CUDA 版本(以 CUDA 12.1 为例)
pip install "sglang[all]" --extra-index-url https://download.pytorch.org/whl/cu121

方式二:从源码安装(获取最新特性)

# 克隆仓库
git clone https://github.com/sgl-project/sglang.git
cd sglang

# 安装依赖
pip install -e "python[all]"

验证安装

python -c "import sglang; print(sglang.__version__)"

输出版本号即表示安装成功,例如 0.3.6


4. 下载 / 存放模型权重 {#weights}

SGLang 支持 Hugging Face 格式的模型,你需要先把权重下载到本地。

目录结构建议

/data/models/
├── Qwen2.5-7B-Instruct/
│   ├── config.json
│   ├── tokenizer.json
│   ├── tokenizer_config.json
│   ├── special_tokens_map.json
│   ├── model.safetensors          # 小模型单文件
│   └── model-00001-of-00004.safetensors  # 大模型分片
├── Llama-3.1-8B-Instruct/
│   └── ...
└── DeepSeek-R1-Distill-Qwen-7B/
    └── ...

从 Hugging Face 下载

# 安装 huggingface_hub
pip install huggingface_hub

# 下载模型(以 Qwen2.5-7B-Instruct 为例)
huggingface-cli download \
    Qwen/Qwen2.5-7B-Instruct \
    --local-dir /data/models/Qwen2.5-7B-Instruct \
    --local-dir-use-symlinks False

提示:国内下载 HF 模型较慢,可以设置镜像:

export HF_ENDPOINT=https://hf-mirror.com

从 ModelScope 下载(国内推荐)

pip install modelscope

python3 - << 'EOF'
from modelscope import snapshot_download
snapshot_download(
    'Qwen/Qwen2.5-7B-Instruct',
    cache_dir='/data/models'
)
EOF

验证权重完整性

ls -lh /data/models/Qwen2.5-7B-Instruct/
# 确认有 config.json、tokenizer.json 以及 .safetensors 文件

5. 启动 SGLang 推理服务 {#launch}

基础启动命令

python -m sglang.launch_server \
    --model-path /data/models/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 30000

常用参数说明

参数 说明 示例
--model-path 模型权重路径(必填) /data/models/Qwen2.5-7B
--host 监听地址 0.0.0.0(所有网卡)
--port 监听端口,默认 30000 30000
--tp Tensor Parallel 数量(多卡并行) --tp 2
--dp Data Parallel 数量 --dp 2
--mem-fraction-static 静态显存占比(0~1) --mem-fraction-static 0.85
--max-total-tokens 最大 token 数(影响并发) --max-total-tokens 4096
--served-model-name API 中暴露的模型名 --served-model-name qwen
--dtype 精度类型 --dtype float16bfloat16
--disable-cuda-graph 禁用 CUDA Graph(调试用)

多 GPU 启动示例(4 卡)

python -m sglang.launch_server \
    --model-path /data/models/Qwen2.5-72B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --tp 4 \
    --mem-fraction-static 0.85 \
    --dtype bfloat16

后台运行(推荐生产环境)

# 使用 nohup
nohup python -m sglang.launch_server \
    --model-path /data/models/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    > /var/log/sglang.log 2>&1 &

echo "SGLang PID: $!"

# 查看日志
tail -f /var/log/sglang.log

启动成功的标志

日志中出现如下内容,表示服务已就绪:

INFO:     Started server process [xxxxx]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)

6. 验证服务是否正常 {#verify}

方式一:curl 测试

# 测试 chat completions 接口
curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "user", "content": "你好,介绍一下你自己"}
    ],
    "max_tokens": 200
  }'

正常响应示例:

{
  "id": "chatcmpl-xxx",
  "object": "chat.completion",
  "model": "Qwen2.5-7B-Instruct",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "你好!我是通义千问..."
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 12,
    "completion_tokens": 45,
    "total_tokens": 57
  }
}

方式二:Python 测试

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="dummy"   # SGLang 不验证 API Key,随便填
)

response = client.chat.completions.create(
    model="Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "你好"}],
    max_tokens=100
)

print(response.choices[0].message.content)

7. 用 FastAPI 封装调用接口 {#fastapi}

直接裸调 SGLang 也可以,但生产环境一般需要在前面加一层 FastAPI 服务,用于:

  • 鉴权(加 API Key 验证)
  • 参数校验和默认值管理
  • 限流、日志、监控
  • 多模型路由

项目结构

sglang-api/
├── main.py          # FastAPI 入口
├── config.py        # 配置(模型名、SGLang 地址等)
├── models.py        # 请求/响应数据模型
├── client.py        # SGLang 调用封装
├── requirements.txt
└── .env

requirements.txt

fastapi
uvicorn[standard]
httpx
python-dotenv
pydantic

config.py

import os
from dotenv import load_dotenv

load_dotenv()

SGLANG_BASE_URL = os.getenv("SGLANG_BASE_URL", "http://localhost:30000")
DEFAULT_MODEL = os.getenv("DEFAULT_MODEL", "Qwen2.5-7B-Instruct")
API_KEY = os.getenv("API_KEY", "")          # 留空则不验证
MAX_TOKENS_DEFAULT = int(os.getenv("MAX_TOKENS_DEFAULT", "512"))

models.py

from pydantic import BaseModel, Field
from typing import List, Optional

class Message(BaseModel):
    role: str                    # "system" | "user" | "assistant"
    content: str

class ChatRequest(BaseModel):
    messages: List[Message]
    model: Optional[str] = None
    max_tokens: Optional[int] = Field(default=512, ge=1, le=8192)
    temperature: Optional[float] = Field(default=0.7, ge=0, le=2.0)
    stream: Optional[bool] = False

class ChatResponse(BaseModel):
    content: str
    model: str
    prompt_tokens: int
    completion_tokens: int
    total_tokens: int

client.py

import httpx
from config import SGLANG_BASE_URL, DEFAULT_MODEL

async def chat_completion(
    messages: list,
    model: str = None,
    max_tokens: int = 512,
    temperature: float = 0.7,
    stream: bool = False,
):
    """调用 SGLang chat completions 接口"""
    url = f"{SGLANG_BASE_URL}/v1/chat/completions"

    payload = {
        "model": model or DEFAULT_MODEL,
        "messages": [m.dict() for m in messages],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stream": stream,
    }

    async with httpx.AsyncClient(timeout=120) as client:
        if stream:
            # 流式返回
            async with client.stream("POST", url, json=payload) as resp:
                resp.raise_for_status()
                async for line in resp.aiter_lines():
                    if line.startswith("data: ") and line != "data: [DONE]":
                        yield line[6:]  # 去掉 "data: " 前缀
        else:
            resp = await client.post(url, json=payload)
            resp.raise_for_status()
            yield resp.json()

main.py(完整版)

from fastapi import FastAPI, HTTPException, Depends, Header
from fastapi.responses import StreamingResponse
from fastapi.middleware.cors import CORSMiddleware
import httpx
import json

from config import SGLANG_BASE_URL, DEFAULT_MODEL, API_KEY, MAX_TOKENS_DEFAULT
from models import ChatRequest, ChatResponse

app = FastAPI(title="SGLang API Gateway", version="1.0.0")

# CORS(按需开启)
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

# ── 可选鉴权 ──────────────────────────────────────────
def verify_api_key(authorization: str = Header(default="")):
    if not API_KEY:
        return  # 未配置 Key,跳过验证
    if authorization != f"Bearer {API_KEY}":
        raise HTTPException(status_code=401, detail="Invalid API Key")

# ── 健康检查 ──────────────────────────────────────────
@app.get("/health")
async def health():
    """检查 SGLang 是否可达"""
    try:
        async with httpx.AsyncClient(timeout=5) as client:
            resp = await client.get(f"{SGLANG_BASE_URL}/health")
            return {"status": "ok", "sglang": resp.status_code == 200}
    except Exception as e:
        return {"status": "error", "detail": str(e)}

# ── 非流式对话 ─────────────────────────────────────────
@app.post("/chat", response_model=ChatResponse, dependencies=[Depends(verify_api_key)])
async def chat(req: ChatRequest):
    """
    非流式对话接口
    POST /chat
    {
        "messages": [{"role": "user", "content": "你好"}],
        "max_tokens": 512
    }
    """
    payload = {
        "model": req.model or DEFAULT_MODEL,
        "messages": [m.dict() for m in req.messages],
        "max_tokens": req.max_tokens or MAX_TOKENS_DEFAULT,
        "temperature": req.temperature or 0.7,
        "stream": False,
    }

    async with httpx.AsyncClient(timeout=120) as client:
        try:
            resp = await client.post(
                f"{SGLANG_BASE_URL}/v1/chat/completions",
                json=payload
            )
            resp.raise_for_status()
        except httpx.HTTPStatusError as e:
            raise HTTPException(status_code=e.response.status_code, detail=e.response.text)
        except httpx.RequestError as e:
            raise HTTPException(status_code=502, detail=f"SGLang 不可达: {str(e)}")

    data = resp.json()
    choice = data["choices"][0]
    usage = data.get("usage", {})

    return ChatResponse(
        content=choice["message"]["content"],
        model=data.get("model", DEFAULT_MODEL),
        prompt_tokens=usage.get("prompt_tokens", 0),
        completion_tokens=usage.get("completion_tokens", 0),
        total_tokens=usage.get("total_tokens", 0),
    )

# ── 流式对话 ───────────────────────────────────────────
@app.post("/chat/stream", dependencies=[Depends(verify_api_key)])
async def chat_stream(req: ChatRequest):
    """
    流式对话接口(SSE)
    POST /chat/stream
    """
    payload = {
        "model": req.model or DEFAULT_MODEL,
        "messages": [m.dict() for m in req.messages],
        "max_tokens": req.max_tokens or MAX_TOKENS_DEFAULT,
        "temperature": req.temperature or 0.7,
        "stream": True,
    }

    async def event_generator():
        async with httpx.AsyncClient(timeout=120) as client:
            async with client.stream(
                "POST",
                f"{SGLANG_BASE_URL}/v1/chat/completions",
                json=payload
            ) as resp:
                async for line in resp.aiter_lines():
                    if line.startswith("data: "):
                        yield f"{line}\n\n"

    return StreamingResponse(event_generator(), media_type="text/event-stream")


if __name__ == "__main__":
    import uvicorn
    uvicorn.run("main:app", host="0.0.0.0", port=8000, reload=True)

.env 示例

SGLANG_BASE_URL=http://localhost:30000
DEFAULT_MODEL=Qwen2.5-7B-Instruct
API_KEY=your_secret_key_here     # 留空则不验证
MAX_TOKENS_DEFAULT=512

8. 完整调用示例 {#example}

启动 FastAPI 服务

cd sglang-api
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000 --reload

curl 调用示例

# 非流式
curl http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_secret_key_here" \
  -d '{
    "messages": [
      {"role": "system", "content": "你是一个有用的助手"},
      {"role": "user", "content": "用 Python 写一个冒泡排序"}
    ],
    "max_tokens": 512,
    "temperature": 0.7
  }'

# 流式
curl http://localhost:8000/chat/stream \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_secret_key_here" \
  -d '{
    "messages": [{"role": "user", "content": "给我讲个故事"}],
    "stream": true
  }'

Python 调用示例

import requests

BASE_URL = "http://localhost:8000"
HEADERS = {
    "Content-Type": "application/json",
    "Authorization": "Bearer your_secret_key_here"
}

# 非流式
def chat(messages, max_tokens=512, temperature=0.7):
    resp = requests.post(
        f"{BASE_URL}/chat",
        headers=HEADERS,
        json={
            "messages": messages,
            "max_tokens": max_tokens,
            "temperature": temperature,
        }
    )
    resp.raise_for_status()
    return resp.json()["content"]

# 使用示例
result = chat([
    {"role": "system", "content": "你是一个代码助手"},
    {"role": "user", "content": "帮我解释什么是 Transformer"},
])
print(result)


# 流式
import httpx

async def chat_stream(messages):
    async with httpx.AsyncClient(timeout=120) as client:
        async with client.stream(
            "POST",
            f"{BASE_URL}/chat/stream",
            headers=HEADERS,
            json={"messages": messages, "stream": True}
        ) as resp:
            async for line in resp.aiter_lines():
                if line.startswith("data: ") and "[DONE]" not in line:
                    import json
                    data = json.loads(line[6:])
                    delta = data["choices"][0]["delta"].get("content", "")
                    if delta:
                        print(delta, end="", flush=True)

FastAPI 自动文档

服务启动后访问:

  • 交互式文档http://localhost:8000/docs
  • ReDochttp://localhost:8000/redoc

9. 常见问题排查 {#faq}

Q: 启动时报 CUDA out of memory

torch.cuda.OutOfMemoryError: CUDA out of memory

解决:减小显存占比

python -m sglang.launch_server \
    --model-path /data/models/xxx \
    --mem-fraction-static 0.7   # 默认 0.9,调小一些

或换用量化模型(GPTQ/AWQ)。


Q: 启动后 curl 没有响应

确认端口监听正常:

ss -tlnp | grep 30000

防火墙检查:

sudo ufw allow 30000

Q: 多卡启动报 NCCL error

# 尝试指定网卡
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_DISABLE=1

Q: 模型名不匹配导致 404

SGLang 默认用 --model-path 的最后一级目录名作为模型名。如果你的路径是 /data/models/Qwen2.5-7B-Instruct,则模型名就是 Qwen2.5-7B-Instruct

可以用 --served-model-name 自定义:

python -m sglang.launch_server \
    --model-path /data/models/Qwen2.5-7B-Instruct \
    --served-model-name qwen7b

然后请求时填 "model": "qwen7b" 即可。


Q: FastAPI 调用超时

SGLang 首次加载模型可能需要数分钟。httpx/requests 的默认超时是 5 秒,请调大:

async with httpx.AsyncClient(timeout=300) as client:
    ...

总结

至此你已经完成了:

步骤 内容
✅ 安装 SGLang pip 或源码安装
✅ 准备权重 HF / ModelScope 下载,规范存放
✅ 启动服务 单卡 / 多卡命令,后台运行
✅ 验证接口 curl / openai SDK 测试
✅ FastAPI 封装 非流式 + 流式接口,含鉴权
✅ 调用示例 curl / Python 完整示例

如有问题欢迎评论区交流!


参考资料:SGLang 官方文档 | GitHub

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐