第10章：工具与框架地图

LLMOps技术栈概述本章系统梳理了LLMOps领域的技术栈，呈现为五层架构：基础设施层：Kubernetes/Prometheus提供基础支撑数据层：Milvus/Qdrant等向量数据库处理检索任务训练层：Transformers/PEFT/DeepSpeed实现高效微调推理层：vLLM/TensorRT-LLM等优化推理性能应用层：LangChain/LlamaIndex等框架实

Pozicaiman

308人浏览 · 2025-11-19 20:57:33

Pozicaiman · 2025-11-19 20:57:33 发布

第10章：工具与框架地图

一、LLMOps技术栈全景

1.1 技术栈分层

二、应用编排框架

2.1 LangChain生态

核心组件

组件	功能	使用场景
Models	LLM封装	统一接口调用各种模型
Prompts	提示模板	结构化Prompt管理
Chains	链式调用	多步骤任务编排
Agents	智能体	动态工具调用
Memory	记忆管理	对话历史存储
Retrievers	检索器	文档检索接口
VectorStores	向量存储	向量数据库集成

2.2 LlamaIndex

特性对比

特性	LangChain	LlamaIndex
定位	通用应用框架	专注数据索引
数据加载	较弱	强大(100+源)
索引结构	基础	丰富多样
Agent支持	强大	基础
适用场景	复杂应用	RAG系统

2.3 Haystack

三、推理与Serving

3.1 推理引擎对比

3.2 vLLM详解

使用示例

# 启动vLLM服务
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-13b-chat-hf \
    --tensor-parallel-size 2 \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9

# 调用API
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-2-13b-chat-hf",
        "prompt": "Once upon a time",
        "max_tokens": 100,
        "temperature": 0.7
    }'

3.3 TensorRT-LLM

四、训练与微调

4.1 训练框架

4.2 PEFT使用

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

# 加载基础模型
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.float16
)

# LoRA配置
peft_config = LoraConfig(
    r=16,  # 低秩维度
    lora_alpha=32,  # 缩放因子
    target_modules=["q_proj", "v_proj"],  # 目标层
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# 应用PEFT
model = get_peft_model(model, peft_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.062

4.3 DeepSpeed集成

{
  "fp16": {
    "enabled": true
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9
  },
  "gradient_accumulation_steps": 4,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": 1
}

五、向量数据库

5.1 向量数据库对比

数据库	类型	性能	规模	特点
FAISS	库	快	中	Meta开源、纯内存
Milvus	分布式	快	大	云原生、高可用
Qdrant	单机/集群	快	中大	Rust编写、高性能
Weaviate	分布式	中	大	GraphQL、多模态
Pinecone	云服务	快	大	全托管、易用
Chroma	嵌入式	中	小	轻量级、易集成

5.2 Milvus架构

5.3 Qdrant特性

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

# 连接Qdrant
client = QdrantClient(url="http://localhost:6333")

# 创建集合
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(
        size=768,  # 向量维度
        distance=Distance.COSINE  # 距离度量
    )
)

# 插入向量
client.upsert(
    collection_name="documents",
    points=[
        PointStruct(
            id=1,
            vector=[0.1, 0.2, ...],  # 768维
            payload={"text": "文档内容", "source": "doc1.pdf"}
        )
    ]
)

# 搜索
results = client.search(
    collection_name="documents",
    query_vector=[0.1, 0.3, ...],
    limit=5,
    query_filter={
        "must": [
            {"key": "source", "match": {"value": "doc1.pdf"}}
        ]
    }
)

六、监控与可观测性

6.1 监控工具栈

6.2 LangSmith

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__..."
os.environ["LANGCHAIN_PROJECT"] = "my-project"

from langchain.chat_models import ChatOpenAI
from langchain.agents import create_react_agent

# 自动追踪
llm = ChatOpenAI(model="gpt-4")
agent = create_react_agent(llm, tools, prompt)

# 每次调用都会上报到LangSmith
result = agent.invoke({"input": "查询天气"})

# LangSmith Dashboard显示:
# - 完整调用链路
# - 每步Token使用
# - 延迟分析
# - 错误追踪

6.3 Phoenix (Arize)

import phoenix as px
from phoenix.trace import LangChainInstrumentor

# 启动Phoenix服务器
session = px.launch_app()

# 自动埋点
LangChainInstrumentor().instrument()

# 运行LLM应用
# 自动采集: Token使用、延迟、上下文、输出等

# 访问 http://localhost:6006 查看:
# - 请求追踪
# - 幻觉检测
# - Embedding可视化
# - 性能分析

七、实验管理

7.1 MLflow

使用示例

import mlflow
from transformers import Trainer, TrainingArguments

# 启动MLflow追踪
mlflow.set_experiment("llama-2-finetuning")

with mlflow.start_run():
    # 记录参数
    mlflow.log_params({
        "model": "llama-2-7b",
        "learning_rate": 2e-5,
        "epochs": 3,
        "lora_r": 16
    })
    
    # 训练
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset
    )
    trainer.train()
    
    # 记录指标
    mlflow.log_metrics({
        "train_loss": trainer.state.log_history[-1]["loss"],
        "eval_accuracy": eval_results["accuracy"]
    })
    
    # 保存模型
    mlflow.transformers.log_model(
        transformers_model={"model": model, "tokenizer": tokenizer},
        artifact_path="model"
    )

7.2 Weights & Biases

import wandb
from transformers import Trainer

# 初始化
wandb.init(
    project="llm-finetuning",
    config={
        "model": "llama-2-7b",
        "learning_rate": 2e-5,
        "epochs": 3
    }
)

# 自动记录
trainer = Trainer(
    model=model,
    args=TrainingArguments(
        report_to="wandb",  # 自动上报
        logging_steps=10
    )
)
trainer.train()

# Dashboard显示:
# - 训练曲线
# - 系统资源
# - 模型比较
# - 超参扫描

八、数据处理工具

8.1 数据工程

8.2 Label Studio

# Label Studio配置
labeling:
  - type: Text
    name: text
    value: $text

  - type: TextArea
    name: response
    toName: text
    placeholder: "输入模型应该生成的回答"
    required: true

  - type: Choices
    name: quality
    toName: text
    choice: multiple
    choices:
      - value: accurate
        text: "准确"
      - value: relevant
        text: "相关"
      - value: complete
        text: "完整"

九、部署与基础设施

9.1 容器编排

9.2 Ray Serve

from ray import serve
from transformers import pipeline

@serve.deployment(
    num_replicas=2,
    ray_actor_options={"num_gpus": 1}
)
class LLMDeployment:
    def __init__(self):
        self.model = pipeline(
            "text-generation",
            model="gpt2",
            device=0
        )
    
    def __call__(self, request):
        prompt = request.query_params["prompt"]
        return self.model(prompt, max_length=100)[0]

# 部署
serve.run(LLMDeployment.bind())

# 调用
import requests
resp = requests.get(
    "http://localhost:8000/",
    params={"prompt": "Once upon a time"}
)

十、工具选型建议

10.1 按场景选型

10.2 技术栈推荐

快速原型

LLM: OpenAI API
框架: LangChain
向量库: Chroma (嵌入式)
监控: LangSmith

生产环境

LLM: 自建开源模型
推理: vLLM/TensorRT-LLM
框架: LlamaIndex + 自研
向量库: Milvus集群
编排: Kubernetes + Helm
监控: Prometheus + Grafana + Phoenix

十一、总结：工具选型原则

关键要点：

没有银弹：根据场景选择最合适的工具
生态优先：优先选择生态丰富的框架
性能权衡：原型阶段易用性优先，生产环境性能优先
可观测性：从一开始就规划监控和追踪
持续演进：工具栈会随着项目发展而调整

本章提供了LLMOps工具全景图和选型指南。

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

安装cuda+cuDNN+PyTorch+yolo教程

print(f'2. CUDA是否可用: {torch.cuda.is_available()}');print(f'3. GPU设备: {torch.cuda.get_device_name(0)}') if torch.cuda.is_available() else None;print(f'4. Ultralytics版本: {ultralytics.__version__}');prin

2048 AI社区

人工智能与工业互联网的融合：推动智能制造的未来

在数字化转型的浪潮下，人工智能（AI）和工业互联网（IIoT）正逐步融入到智能制造的各个环节，推动传统制造业向更高效、更灵活的方向发展。AI与工业互联网的融合不仅使得生产流程更加智能化，还优化了供应链管理、质量控制、设备维护等多个领域的业务模式。本文将探讨人工智能和工业互联网如何相互作用，如何在智能制造中发挥关键作用，并展望未来它们如何改变全球制造业格局。

2048 AI社区

机器人软件平台化四大支柱（2025 年终极落地版）

本文提出2025年机器人软件平台化四大核心支柱：协议、日志、监控和诊断。协议采用Protobuf定义+多传输层支持；日志使用mcap格式+云端自动上传；监控通过sidecar+OpenTelemetry实现秒级告警；诊断结合健康树+大模型自动分析。10万台规模实测数据显示，该架构可将故障修复时间从2小时缩短至4.5分钟，80%故障无需人工干预。文章强调必须按协议→日志→监控→诊断顺序实施，否则将面