大模型部署方式(本地化部署+云端部署+混合部署+边缘段部署)

GPU类型：g5.xlarge（NVIDIA A10G，24GB显存）或p3.2xlarge（V100，16GB显存），根据模型大小调整（如13B模型需g5.2xlarge）CPU≥16核，内存≥64GB（用于加载模型和缓存）可选优化：使用量化工具（如bitsandbytes）将模型从FP32转为4/8位整数（INT4/INT8），降低显存占用（7B模型INT4仅需约6GB显存）编写推理脚本（in

A尘埃

651人浏览 · 2025-12-31 17:27:16

A尘埃 · 2025-12-31 17:27:16 发布

本地化部署

适用场景：企业对数据隐私要求高（如金融、医疗）、需低延迟响应（如实时对话）、或有固定算力资源（自有GPU集群）

①、硬件：至少1张A100（40GB显存）或等效GPU（如H800、RTX 4090 24GB需量化）；CPU≥16核，内存≥64GB（用于加载模型和缓存）
软件：安装CUDA 11.7+、cuDNN 8.5+；Python 3.9+；PyTorch 2.0+（支持CUDA）

②、模型获取与预处理

从Hugging Face Hub下载模型权重（需申请访问权限）：git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf

可选优化：使用量化工具（如bitsandbytes）将模型从FP32转为4/8位整数（INT4/INT8），降低显存占用（7B模型INT4仅需约6GB显存）

from transformers import AutoModelForCausalLM,BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", quantization_config=bnb_config)

③、搭建推理服务

使用高性能推理框架（如vLLM、Text Generation Interface(TGI)）替代原生的Transformers，提升吞吐量

pip install vllm
python -m vllm.entrypoints.api_server --model meta-llama/Llama-2-7b-chat-hf --quantization awq  # 若已量化

暴露REST API：通过FastAPI封装vLLM接口，支持JSON输入

from fastapi import FastAPI
from vllm import LLM,SamplingParams

app = FastAPI()
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")

@app.post("/generate")
def generate(prompt:str)
	sampling_params = SamplingParams(max_tokens=200)
	outputs = llm.generate(prompt,sampling_params)
	return {"text":outputs[0].outputs[0].text}

云端部署

适用场景：企业需弹性扩展（如促销活动期间流量激增）、缺乏本地算力、或希望降低运维成本（如中小公司）

以AWS SageMaker部署GPT-2-Large为例

①、选择云实例

GPU类型：g5.xlarge（NVIDIA A10G，24GB显存）或p3.2xlarge（V100，16GB显存），根据模型大小调整（如13B模型需g5.2xlarge）
存储：使用S3存储模型权重（避免重复下载），配置IAM角色授予SageMaker访问权限

②、模型打包与上传

编写推理脚本（inference.py）：定义model_fn（加载模型）\predict_fn（处理请求）

import torch
from transformers import GPT2LMHeadModel,GPT2Tokenizer

def model_fn(model_dir): #定义模型
	tokenizer = GPT2Tokenizer.from_pretrained(model_dir)
	model = GPT2LMHeadModel.from_pretrained(model_dir)
	return {"model":model,"tokenizer":tokenizer}

def predict_fn(data,model_dict): # 处理请求
	input_text = data["inputs"]
	inputs = model_dict["tokenizer"](input_text,return_tensors="pt")
	outputs = model_dict["model"].generate(**inputs, max_length=100)
	return {"generated_text":model_dict["tokenizer"].decode(outputs[0])}

打包为.tar.gz文件：tar czvf model.tar.gz inference.py model_weights/

上传至S3：aws s3 cp model.tar.gz s3://my-bucket/models/gpt2-large/

③、创建SageMarker端点

控制台操作：进入SageMaker → Endpoints → Create endpoint → 选择“Custom”容器 → 指定S3路径和推理脚本

自动扩缩容：配置最小/最大实例数（如2-10台），基于QPS触发扩容（如QPS＞50时增加实例）

④、集成企业系统，调用API

通过AWS SDK（如boto3）发送请求

import boto3

runtime = boto3.client("sagemaker-runtime")
response = runtime.invoke_endpoint(
	EndpointName="gpt2-endpoint",
	ContentType="application/json",
	Body=json.dumps({"inputs": "Hello, how are you?"})
)
print(response["Body"].read().decode())

混合部署

适用场景：企业需兼顾本地隐私与云端弹性（如银行智能客服：简单问题本地处理，复杂问题上传云）

边缘端轻量级部署

适用场景：终端设备（如手机、工业传感器）需离线推理，或对延迟要求极高（如自动驾驶辅助）

①、模型压缩

蒸馏：用BERT-base作为教师模型，训练轻量级学生模型DistilBERT（参数量减少40%，速度提升60%）

量化：将FP32权重转为INT8（通过TensorRT的校准工具），精度损失＜2%

②、转换为边缘友好格式

使用ONNX Runtime或TensorRT导出模型：torch.onnx.export(model, dummy_input, “distilbert.onnx”)

针对移动端优化：通过TensorFlow Lite Converter转换为.tflite格式，启用NNAPI加速（Android）或Core ML（iOS）

③、集成到终端

Interpreter tflite = new Interpreter(loadModelFile(context, "distilbert.tflite"));
float[][] input = preprocess(text); // 文本转词向量
float[][] output = new float[1][2]; // 正负情感概率
tflite.run(input, output);

实际项目中，三者可结合使用：例如用 Ollama 本地测试模型效果，用 vLLM 部署生产 API，用 SGLang 处理需要结构化输出的子任务

Ollama

开源的本地大模型运行框架

使用步骤：

①、安装

macOS/Linux：

curl -fsSL https://ollama.com/install.sh | sh  # 自动安装并启动服务（默认端口 11434）

window:从 Ollama 官网下载安装包，双击运行

②、拉去并运行模型

Ollama 支持直接通过名称拉取官方仓库模型（需模型已适配 Ollama 格式）：

ollama pull llama3:8b  # 拉取 LLaMA-3-8B（默认 4-bit 量化版，显存占用 ~5GB）
ollama run llama3:8b    # 启动交互式对话（CLI 模式）

自定义模型：若有本地模型权重（如微调后的 LLaMA-2），可通过 Modelfile导入：

FROM ./local_llama2_weights  # 本地权重路径
PARAMETER temperature 0.7      # 生成温度
SYSTEM "You are a helpful assistant."  # 系统提示词

然后构建并运行

ollama create my-llama2 -f Modelfile  # 创建自定义模型
ollama run my-llama2                  # 运行自定义模型

③、通过API调用（变成交互）

Ollama 启动后默认监听 http://localhost:11434，可通过 REST API 或 Python SDK 调用：

REST API生成文本

curl http://localhost:11434/api/generate -d '{
  "model": "llama3:8b",
  "prompt": "Explain quantum computing in simple terms",
  "stream": false  # 非流式输出（true 为流式）
}'

Python SDK

import requests

response = requests.post(
	"http://localhost:11434/api/generate",
	json={"model": "llama3:8b", "prompt": "写一首关于秋天的诗"}
)

print(response.json()["response"]) #输出生成结果

Vllm

高性能推理框架，核心创新是 PagedAttention 机制（类比操作系统分页管理内存），解决大模型推理时的显存碎片化和吞吐量瓶颈。

通过 PagedAttention 将显存利用率提升 2-4 倍，相同硬件下吞吐量远超 Hugging Face Transformers
支持连续批处理（Continuous Batching）和流式输出（Streaming），适合高并发场景（如 API 服务）
内置分布式推理（多 GPU/多机）、量化支持（AWQ、GPTQ）、监控指标（Prometheus）

以部署 LLaMA-3-8B 并启动 API 服务为例

①、安装vLLM

需 Python 3.8+、CUDA 11.7+，推荐通过 pip 安装：

pip install vllm  # 基础安装（支持主流模型）
# 如需量化支持（如 AWQ）：pip install vllm[awq]

②、启动推理服务（OpenAI 兼容 API）

通过命令行启动 API 服务器，指定模型路径（本地或 Hugging Face Hub）：

python -m vllm.entrypoints.api_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \  # 模型名称（HF Hub）或本地路径
  --tensor-parallel-size 1 \  # 单 GPU 并行（多 GPU 设为 N，如 2 卡设为 2）
  --quantization awq \        # 启用 AWQ 量化（可选，降低显存占用）
  --host 0.0.0.0 \            # 允许外部访问
  --port 8000                 # 服务端口

# 关键参数
# --max-num-batched-tokens：批处理最大 token 数（默认 4096，调大提升吞吐量）
# --gpu-memory-utilization：显存利用率（默认 0.9，避免 OOM）

③、调用API（模拟OpenAI接口）

vLLM 兼容 OpenAI API 格式，可直接用 openai库调用

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1",api_key="dummy") # 无需真实key

# 文本生成
response = client.completions.create(
	model="meta-llama/Meta-Llama-3-8B-Instruct",
    prompt="写一段 Python 代码实现快速排序",
    max_tokens=500,
    stream=False  # 非流式（stream=True 为流式输出）
)

print(response.choices[0].text)

# 聊天对话（ChatCompletion）
chat_response = client.chat.completions.create(
	model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "解释一下 PagedAttention 原理"}]
)
print(chat_response.choice[0].message.content)

SGlang

由 LMSYS 开发的大模型服务化框架，专注于结构化输出优化（如 JSON、代码、表格）和多模态推理，支持低延迟流式输出和工具调用（Function Calling）

结构化生成：通过“语法约束”（如 JSON Schema）强制模型输出指定格式，避免格式错误（传统模型常出现 JSON 语法错误）
多模态支持：原生集成图像、音频输入（如 LLaVA 多模态模型），支持图文混合推理
高效服务化：提供 OpenAI 兼容 API，支持批量请求、流式输出、工具调用，内置监控和自动扩缩容

以结构化 JSON 输出为例

①、安装SGLang

pip install sglang  # 基础安装（支持 LLaMA、Mistral、LLaVA 等模型）

②、启动服务（指定结构化输出）

通过命令行启动服务器，加载模型并配置结构化输出规则：

python -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \  # 模型名称
  --port 30000 \                                 # 服务端口
  --mem-fraction-static 0.8                      # 静态显存占比（避免 OOM）

③、调用API（强制JSON输出）

SGLang 支持通过 response_format参数指定输出格式（如 JSON Schema）

import requests
import json

url = "http://localhost:30000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

data = {
	"model": "meta-llama/Meta-Llama-3-8B-Instruct",
    "messages": [
        {"role": "user", "content": "提取以下文本中的姓名、年龄、职业：'张三，28岁，是一名软件工程师，毕业于清华大学'"}
    ],
    "response_format": {  # 指定 JSON 结构
        "type": "json_object",
        "schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"},
                "job": {"type": "string"}
            },
            "required": ["name", "age", "job"]
        }
    },
    "stream": False
}

response = requests.post(url,headers=headers,data=json.dumps(data))
result = response.json()["choices"][0]["message"]["content"]
print(json.loads(reuslt)) # 输出：{"name": "张三", "age": 28, "job": "软件工程师"}