第8章:DevOps/SRE实践
文章摘要: 本章介绍了DevOps/SRE在AI系统中的应用实践,主要包括四个方面:1) LLMOps CI/CD全流程设计,展示从代码提交到生产发布的完整流水线;2) 代码质量管理体系,包含代码检查流程和分层测试策略;3) 容器化与镜像管理技术,重点是多阶段构建和优化方法;4) Kubernetes部署方案,示例展示了GPU节点的资源配置。这些实践通过Mermaid流程图和代码示例,系统性地呈现
·
第8章:DevOps/SRE实践
一、LLMOps CI/CD全流程
1.1 完整流水线
1.2 GitOps工作流
二、代码质量管理
2.1 代码检查流程
2.2 测试策略
Pytest示例
# tests/test_rag.py
import pytest
from app.rag import RAGSystem
@pytest.fixture
def rag_system():
"""测试用RAG系统"""
return RAGSystem(
model_name="test-model",
vector_db="memory" # 使用内存数据库
)
def test_retrieval(rag_system):
"""测试检索功能"""
query = "什么是Transformer?"
docs = rag_system.retrieve(query, top_k=3)
assert len(docs) == 3
assert all(doc.score > 0.5 for doc in docs)
def test_generation(rag_system):
"""测试生成功能"""
query = "解释注意力机制"
answer = rag_system.generate(query)
assert len(answer) > 0
assert "注意力" in answer or "Attention" in answer
@pytest.mark.slow
def test_end_to_end(rag_system):
"""端到端测试"""
result = rag_system.query("LLM的工作原理")
assert result.answer is not None
assert len(result.sources) > 0
assert result.latency < 5.0 # 5秒内返回
三、容器化与镜像管理
3.1 多阶段构建
# Dockerfile多阶段构建
# 阶段1: 构建环境
FROM python:3.11-slim as builder
WORKDIR /build
# 安装构建依赖
RUN apt-get update && apt-get install -y \
build-essential \
git \
&& rm -rf /var/lib/apt/lists/*
# 安装Python依赖
COPY requirements.txt .
RUN pip install --user --no-cache-dir -r requirements.txt
# 阶段2: 运行环境
FROM python:3.11-slim
# 仅安装运行时依赖
RUN apt-get update && apt-get install -y \
libgomp1 \
&& rm -rf /var/lib/apt/lists/*
# 从builder复制Python包
COPY --from=builder /root/.local /root/.local
ENV PATH=/root/.local/bin:$PATH
WORKDIR /app
# 复制应用代码
COPY app/ ./app/
COPY models/ ./models/
# 非root用户运行
RUN useradd -m appuser && chown -R appuser:appuser /app
USER appuser
EXPOSE 8000
CMD ["python", "-m", "app.main"]
3.2 镜像优化
3.3 镜像版本管理
四、Kubernetes部署
4.1 资源配置
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
labels:
app: vllm
spec:
replicas: 2
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
# GPU节点选择
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
# 容忍GPU污点
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
containers:
- name: vllm
image: registry.example.com/vllm:v0.2.7
# 资源请求与限制
resources:
requests:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "1"
limits:
memory: "64Gi"
cpu: "16"
nvidia.com/gpu: "1"
# 环境变量
env:
- name: MODEL_NAME
value: "meta-llama/Llama-2-13b-chat-hf"
- name: TENSOR_PARALLEL_SIZE
value: "1"
- name: MAX_MODEL_LEN
value: "4096"
# 存储卷
volumeMounts:
- name: model-storage
mountPath: /models
readOnly: true
- name: cache
mountPath: /root/.cache
# 健康检查
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
# 启动探针(模型加载慢)
startupProbe:
httpGet:
path: /health
port: 8000
failureThreshold: 30
periodSeconds: 10
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
- name: cache
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
4.2 配置管理
# ConfigMap: 应用配置
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-config
data:
config.yaml: |
model:
name: "llama-2-13b-chat"
tensor_parallel_size: 1
max_model_len: 4096
serving:
host: "0.0.0.0"
port: 8000
max_num_seqs: 256
logging:
level: "INFO"
format: "json"
---
# Secret: 敏感信息
apiVersion: v1
kind: Secret
metadata:
name: vllm-secrets
type: Opaque
stringData:
huggingface-token: "hf_xxxxxxxxxxxxx"
api-key: "sk-xxxxxxxxxxxxx"
4.3 Helm Chart结构
llm-inference/
├── Chart.yaml # Chart元数据
├── values.yaml # 默认配置
├── values-prod.yaml # 生产环境配置
├── templates/
│ ├── deployment.yaml
│ ├── service.yaml
│ ├── hpa.yaml
│ ├── pdb.yaml # Pod中断预算
│ ├── servicemonitor.yaml # Prometheus监控
│ └── _helpers.tpl # 辅助模板
└── charts/ # 依赖Chart
└── milvus/
五、发布策略
5.1 蓝绿部署
5.2 金丝雀发布
# Argo Rollouts金丝雀
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: vllm-rollout
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10 # 10%流量到新版本
- pause: {duration: 5m}
- setWeight: 25
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 10m}
- setWeight: 75
- pause: {duration: 10m}
# 自动提升或回滚
analysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: vllm-service
5.3 流量管理
六、监控与告警
6.1 监控架构
6.2 关键指标
# ServiceMonitor: Prometheus采集配置
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-metrics
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: metrics
interval: 30s
path: /metrics
# 暴露的指标
# vllm_request_total{status="success"} - 请求总数
# vllm_request_duration_seconds{quantile="0.95"} - P95延迟
# vllm_token_throughput - Token吞吐
# vllm_gpu_utilization - GPU利用率
# vllm_kv_cache_usage_ratio - KV Cache使用率
# vllm_queue_length - 队列长度
6.3 告警规则
# PrometheusRule: 告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: vllm-alerts
spec:
groups:
- name: vllm
interval: 30s
rules:
# 高错误率
- alert: HighErrorRate
expr: |
rate(vllm_request_total{status="error"}[5m]) /
rate(vllm_request_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "错误率超过5%"
description: "最近5分钟错误率{{ $value }}%"
# 高延迟
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(vllm_request_duration_seconds_bucket[5m])
) > 3
for: 10m
labels:
severity: warning
annotations:
summary: "P95延迟超过3秒"
# GPU显存不足
- alert: GPUMemoryHigh
expr: |
vllm_gpu_memory_usage / vllm_gpu_memory_total > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "GPU显存使用率>90%"
# 服务不可用
- alert: ServiceDown
expr: up{job="vllm"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "vLLM服务宕机"
七、日志管理
7.1 日志规范
# 结构化日志
import structlog
logger = structlog.get_logger()
# 请求日志
logger.info(
"inference_request",
request_id="req-123456",
user_id="user-001",
model="llama-2-13b",
prompt_tokens=150,
max_tokens=500,
temperature=0.7
)
# 推理日志
logger.info(
"inference_complete",
request_id="req-123456",
generation_tokens=320,
duration_ms=1250,
tokens_per_second=256
)
# 错误日志
logger.error(
"inference_failed",
request_id="req-123456",
error="CUDA out of memory",
gpu_memory_used="78GB",
stack_trace=exc_info()
)
7.2 日志采集
# Promtail配置: 日志采集
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
# 只采集有日志标签的Pod
- source_labels: [__meta_kubernetes_pod_label_logging]
action: keep
regex: enabled
# 提取Pod元数据
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
八、故障应急
8.1 故障等级
8.2 应急流程
8.3 应急手册
## 常见故障应急手册
### 1. GPU OOM (显存不足)
**症状**: 推理请求失败,日志显示 "CUDA out of memory"
**止损**:
# 1. 立即减小batch size
kubectl set env deployment/vllm MAX_NUM_SEQS=64
# 2. 重启Pod释放显存
kubectl rollout restart deployment/vllm
**根因排查**:
- 检查是否有异常大请求 (max_tokens过大)
- 检查KV cache配置是否合理
- 检查是否显存泄漏
### 2. 高延迟 (P95 > 3s)
**症状**: 响应缓慢,队列积压
**止损**:
# 1. 扩容实例
kubectl scale deployment/vllm --replicas=6
# 2. 限流保护
# 在网关层降低QPS限制
**根因排查**:
- 检查GPU利用率是否饱和
- 检查网络IO是否瓶颈
- 检查是否有慢查询
### 3. 模型推理错误
**症状**: 返回结果异常、幻觉增多
**止损**:
# 回滚到上一个稳定版本
kubectl rollout undo deployment/vllm
**根因排查**:
- 对比新旧版本差异
- 检查模型文件是否损坏
- 检查温度等参数是否异常
九、SRE指标
9.1 SLI/SLO/SLA
graph TB
A[SRE指标体系] --> B[SLI<br/>服务水平指标]
A --> C[SLO<br/>服务水平目标]
A --> D[SLA<br/>服务水平协议]
B --> B1[可用性<br/>请求成功率]
B --> B2[延迟<br/>P95响应时间]
B --> B3[吞吐<br/>QPS]
C --> C1[可用性≥99.9%]
C --> C2[P95延迟≤2s]
C --> C3[错误率≤0.1%]
D --> D1[赔偿条款]
D --> D2[99.9% - 无]
D --> D3[99% - 返10%]
D --> D4[95% - 返50%]
9.2 错误预算
十、总结:DevOps最佳实践
关键要点:
- 自动化一切:从构建到部署全自动化
- 可观测性:指标、日志、追踪三位一体
- 渐进式发布:金丝雀、灰度、AB测试
- 快速恢复:健康检查、自动回滚
- 应急准备:故障手册、演练、复盘
本章提供了LLMOps从开发到运维的完整DevOps/SRE实践指南。
更多推荐


所有评论(0)