02-大模型部署之Kubernetes+vLLM安装大模型和容器调度
Kubernetes提供了企业级的容器编排能力,特别适合vLLM部署的以下场景:弹性伸缩:根据负载自动调整vLLM实例数量高可用性:自动故障恢复和负载均衡资源管理:精细化的GPU资源分配和调度多租户隔离:不同模型或用户之间的资源隔离版本管理:无缝的模型版本升级和回滚
·
02-大模型部署之Kubernetes+vLLM安装大模型和容器调度
1. Kubernetes基础与vLLM集成概述
1.1 为什么使用Kubernetes部署vLLM
Kubernetes提供了企业级的容器编排能力,特别适合vLLM部署的以下场景:
- 弹性伸缩:根据负载自动调整vLLM实例数量
- 高可用性:自动故障恢复和负载均衡
- 资源管理:精细化的GPU资源分配和调度
- 多租户隔离:不同模型或用户之间的资源隔离
- 版本管理:无缝的模型版本升级和回滚
1.2 Kubernetes与vLLM架构
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes集群 │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Master节点 │ │ Worker节点1 │ │ Worker节点2 │ │
│ │ │ │ │ │ │ │
│ │ API Server │ │ vLLM Pod 1 │ │ vLLM Pod 2 │ │
│ │ Scheduler │ │ (GPU 0,1) │ │ (GPU 2,3) │ │
│ │ Controller │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
2. 环境准备与依赖安装
2.1 Kubernetes集群要求
硬件要求
- Master节点:至少2 CPU,4GB内存
- Worker节点:至少4 CPU,16GB内存,1-2张NVIDIA GPU
- 网络:节点间万兆网络推荐
软件要求
- Kubernetes 1.24+
- NVIDIA GPU Operator
- Container Runtime (containerd或Docker)
- kubectl命令行工具
2.2 NVIDIA GPU Operator安装
# 添加NVIDIA Helm仓库
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update
# 安装GPU Operator
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true
2.3 验证GPU资源
# 检查GPU节点
kubectl get nodes -l gpu=true
# 查看GPU资源
kubectl describe node <worker-node-name> | grep -i gpu
# 验证NVIDIA设备插件
kubectl get pods -n gpu-operator
3. vLLM容器镜像构建
3.1 基础Dockerfile
FROM nvidia/cuda:11.8-devel-ubuntu20.04
# 设置环境变量
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
# 安装系统依赖
RUN apt-get update && apt-get install -y \
python3.9 \
python3.9-pip \
python3.9-dev \
git \
wget \
&& rm -rf /var/lib/apt/lists/*
# 创建软链接
RUN ln -s /usr/bin/python3.9 /usr/bin/python && \
ln -s /usr/bin/pip3 /usr/bin/pip
# 升级pip
RUN pip install --upgrade pip
# 安装vLLM
RUN pip install vllm==0.2.5 torch==2.0.1
# 创建应用目录
WORKDIR /app
# 复制启动脚本
COPY start_vllm.sh /app/
RUN chmod +x /app/start_vllm.sh
# 暴露端口
EXPOSE 8000
# 启动命令
CMD ["/app/start_vllm.sh"]
3.2 启动脚本 (start_vllm.sh)
#!/bin/bash
# 设置模型路径
MODEL_PATH=${MODEL_PATH:-"meta-llama/Llama-2-7b-chat-hf"}
# 设置GPU内存使用率
GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION:-0.9}
# 设置张量并行大小
TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE:-1}
# 启动vLLM API服务器
python -m vllm.entrypoints.api_server \
--model ${MODEL_PATH} \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
--tensor-parallel-size ${TENSOR_PARALLEL_SIZE} \
--max-num-batched-tokens 8192 \
--max-num-seqs 256
3.3 构建和推送镜像
# 构建镜像
docker build -t your-registry/vllm-server:latest .
# 推送到镜像仓库
docker push your-registry/vllm-server:latest
4. Kubernetes资源配置
4.1 Namespace创建
# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
name: vllm
4.2 ConfigMap配置
# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: vllm-config
namespace: vllm
data:
MODEL_PATH: "meta-llama/Llama-2-7b-chat-hf"
GPU_MEMORY_UTILIZATION: "0.85"
TENSOR_PARALLEL_SIZE: "1"
MAX_NUM_BATCHED_TOKENS: "8192"
MAX_NUM_SEQS: "256"
4.3 Secret配置(用于模型访问)
# secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: huggingface-secret
namespace: vllm
type: Opaque
data:
# echo -n "your-huggingface-token" | base64
HF_TOKEN: eW91ci1odWdnaW5nZmFjZS10b2tlbg==
4.4 PVC配置(用于模型缓存)
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vllm-model-cache
namespace: vllm
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd
4.5 Deployment配置
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-deployment
namespace: vllm
spec:
replicas: 2
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
nodeSelector:
gpu: "true"
containers:
- name: vllm-container
image: your-registry/vllm-server:latest
imagePullPolicy: Always
ports:
- containerPort: 8000
env:
- name: MODEL_PATH
valueFrom:
configMapKeyRef:
name: vllm-config
key: MODEL_PATH
- name: GPU_MEMORY_UTILIZATION
valueFrom:
configMapKeyRef:
name: vllm-config
key: GPU_MEMORY_UTILIZATION
- name: TENSOR_PARALLEL_SIZE
valueFrom:
configMapKeyRef:
name: vllm-config
key: TENSOR_PARALLEL_SIZE
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-secret
key: HF_TOKEN
resources:
requests:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "4"
limits:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120
periodSeconds: 30
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: vllm-model-cache
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
4.6 Service配置
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: vllm-service
namespace: vllm
spec:
selector:
app: vllm
ports:
- port: 80
targetPort: 8000
protocol: TCP
type: ClusterIP
4.7 Ingress配置
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
namespace: vllm
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/proxy-body-size: "100m"
spec:
rules:
- host: vllm.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: vllm-service
port:
number: 80
5. 部署与管理
5.1 部署vLLM服务
# 创建命名空间
kubectl apply -f namespace.yaml
# 应用配置
kubectl apply -f configmap.yaml
kubectl apply -f secret.yaml
kubectl apply -f pvc.yaml
# 部署应用
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml
# 检查部署状态
kubectl get pods -n vllm
kubectl get services -n vllm
5.2 扩缩容操作
手动扩缩容
# 扩容到3个副本
kubectl scale deployment vllm-deployment --replicas=3 -n vllm
# 缩容到1个副本
kubectl scale deployment vllm-deployment --replicas=1 -n vllm
自动扩缩容配置
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: vllm
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-deployment
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
5.3 滚动更新
# 更新镜像版本
kubectl set image deployment/vllm-deployment \
vllm-container=your-registry/vllm-server:v2.0 -n vllm
# 查看更新状态
kubectl rollout status deployment/vllm-deployment -n vllm
# 回滚到上一版本
kubectl rollout undo deployment/vllm-deployment -n vllm
6. 监控与日志
6.1 Prometheus监控配置
# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm-monitor
namespace: vllm
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: http
path: /metrics
interval: 30s
6.2 日志收集
# fluentd-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: vllm
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*vllm*.log
pos_file /var/log/fluentd-vllm.log.pos
tag kubernetes.*
format json
</source>
<match kubernetes.**>
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
index_name vllm-logs
</match>
6.3 健康检查端点
# 在vLLM容器中添加健康检查端点
from flask import Flask, jsonify
import psutil
app = Flask(__name__)
@app.route('/health')
def health_check():
"""健康检查端点"""
try:
# 检查GPU状态
import torch
gpu_available = torch.cuda.is_available()
gpu_memory_used = torch.cuda.memory_allocated() / 1024**3 if gpu_available else 0
# 检查系统资源
cpu_percent = psutil.cpu_percent()
memory_percent = psutil.virtual_memory().percent
return jsonify({
"status": "healthy",
"gpu_available": gpu_available,
"gpu_memory_used_gb": gpu_memory_used,
"cpu_percent": cpu_percent,
"memory_percent": memory_percent
})
except Exception as e:
return jsonify({"status": "unhealthy", "error": str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8001)
7. 故障排查
7.1 常见问题
GPU资源不足
# 检查GPU资源分配
kubectl describe pod <pod-name> -n vllm
# 检查GPU节点资源
kubectl top nodes
kubectl describe node <node-name>
模型加载失败
# 查看Pod日志
kubectl logs <pod-name> -n vllm
# 进入容器调试
kubectl exec -it <pod-name> -n vllm -- /bin/bash
网络连接问题
# 测试服务连通性
kubectl run test-pod --image=busybox --rm -it -- /bin/sh
# 在测试Pod中执行
wget -qO- http://vllm-service.vllm.svc.cluster.local/health
7.2 性能调优
资源限制优化
resources:
requests:
nvidia.com/gpu: 1
memory: "24Gi" # 根据模型大小调整
cpu: "6"
limits:
nvidia.com/gpu: 1
memory: "48Gi" # 避免OOM
cpu: "12"
调度策略优化
# 使用节点亲和性
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu-type
operator: In
values: ["A100", "V100"]
8. 最佳实践
8.1 安全配置
- 镜像安全:使用最小化基础镜像
- 网络策略:限制Pod间通信
- RBAC:最小权限原则
- 密钥管理:使用Kubernetes Secret管理敏感信息
8.2 成本优化
- 资源配额:合理设置资源请求和限制
- 节点池:使用GPU专用节点池
- 自动扩缩容:根据负载动态调整实例数
- Spot实例:非关键服务使用Spot实例
8.3 运维建议
- 版本管理:使用Git管理Kubernetes配置
- 渐进式部署:使用金丝雀发布策略
- 备份策略:定期备份重要配置和数据
- 文档维护:保持部署文档的及时更新
通过以上配置和实践,可以在Kubernetes环境中成功部署和管理vLLM大语言模型服务,实现高可用、可扩展的模型推理能力。
更多推荐


所有评论(0)