02-大模型部署之Kubernetes+vLLM安装大模型和容器调度

1. Kubernetes基础与vLLM集成概述

1.1 为什么使用Kubernetes部署vLLM

Kubernetes提供了企业级的容器编排能力,特别适合vLLM部署的以下场景:

  • 弹性伸缩:根据负载自动调整vLLM实例数量
  • 高可用性:自动故障恢复和负载均衡
  • 资源管理:精细化的GPU资源分配和调度
  • 多租户隔离:不同模型或用户之间的资源隔离
  • 版本管理:无缝的模型版本升级和回滚

1.2 Kubernetes与vLLM架构

┌─────────────────────────────────────────────────────────────┐
│                    Kubernetes集群                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │  Master节点  │  │  Worker节点1 │  │  Worker节点2 │         │
│  │             │  │             │  │             │         │
│  │ API Server  │  │ vLLM Pod 1  │  │ vLLM Pod 2  │         │
│  │ Scheduler   │  │ (GPU 0,1)   │  │ (GPU 2,3)   │         │
│  │ Controller  │  │             │  │             │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
└─────────────────────────────────────────────────────────────┘

2. 环境准备与依赖安装

2.1 Kubernetes集群要求

硬件要求
  • Master节点:至少2 CPU,4GB内存
  • Worker节点:至少4 CPU,16GB内存,1-2张NVIDIA GPU
  • 网络:节点间万兆网络推荐
软件要求
  • Kubernetes 1.24+
  • NVIDIA GPU Operator
  • Container Runtime (containerd或Docker)
  • kubectl命令行工具

2.2 NVIDIA GPU Operator安装

# 添加NVIDIA Helm仓库
helm repo add nvidia https://nvidia.github.io/gpu-operator
helm repo update

# 安装GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true

2.3 验证GPU资源

# 检查GPU节点
kubectl get nodes -l gpu=true

# 查看GPU资源
kubectl describe node <worker-node-name> | grep -i gpu

# 验证NVIDIA设备插件
kubectl get pods -n gpu-operator

3. vLLM容器镜像构建

3.1 基础Dockerfile

FROM nvidia/cuda:11.8-devel-ubuntu20.04

# 设置环境变量
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1

# 安装系统依赖
RUN apt-get update && apt-get install -y \
    python3.9 \
    python3.9-pip \
    python3.9-dev \
    git \
    wget \
    && rm -rf /var/lib/apt/lists/*

# 创建软链接
RUN ln -s /usr/bin/python3.9 /usr/bin/python && \
    ln -s /usr/bin/pip3 /usr/bin/pip

# 升级pip
RUN pip install --upgrade pip

# 安装vLLM
RUN pip install vllm==0.2.5 torch==2.0.1

# 创建应用目录
WORKDIR /app

# 复制启动脚本
COPY start_vllm.sh /app/
RUN chmod +x /app/start_vllm.sh

# 暴露端口
EXPOSE 8000

# 启动命令
CMD ["/app/start_vllm.sh"]

3.2 启动脚本 (start_vllm.sh)

#!/bin/bash

# 设置模型路径
MODEL_PATH=${MODEL_PATH:-"meta-llama/Llama-2-7b-chat-hf"}

# 设置GPU内存使用率
GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION:-0.9}

# 设置张量并行大小
TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE:-1}

# 启动vLLM API服务器
python -m vllm.entrypoints.api_server \
    --model ${MODEL_PATH} \
    --host 0.0.0.0 \
    --port 8000 \
    --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} \
    --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} \
    --max-num-batched-tokens 8192 \
    --max-num-seqs 256

3.3 构建和推送镜像

# 构建镜像
docker build -t your-registry/vllm-server:latest .

# 推送到镜像仓库
docker push your-registry/vllm-server:latest

4. Kubernetes资源配置

4.1 Namespace创建

# namespace.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: vllm

4.2 ConfigMap配置

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: vllm-config
  namespace: vllm
data:
  MODEL_PATH: "meta-llama/Llama-2-7b-chat-hf"
  GPU_MEMORY_UTILIZATION: "0.85"
  TENSOR_PARALLEL_SIZE: "1"
  MAX_NUM_BATCHED_TOKENS: "8192"
  MAX_NUM_SEQS: "256"

4.3 Secret配置(用于模型访问)

# secret.yaml
apiVersion: v1
kind: Secret
metadata:
  name: huggingface-secret
  namespace: vllm
type: Opaque
data:
  # echo -n "your-huggingface-token" | base64
  HF_TOKEN: eW91ci1odWdnaW5nZmFjZS10b2tlbg==

4.4 PVC配置(用于模型缓存)

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: vllm-model-cache
  namespace: vllm
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd

4.5 Deployment配置

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
  namespace: vllm
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      nodeSelector:
        gpu: "true"
      containers:
      - name: vllm-container
        image: your-registry/vllm-server:latest
        imagePullPolicy: Always
        ports:
        - containerPort: 8000
        env:
        - name: MODEL_PATH
          valueFrom:
            configMapKeyRef:
              name: vllm-config
              key: MODEL_PATH
        - name: GPU_MEMORY_UTILIZATION
          valueFrom:
            configMapKeyRef:
              name: vllm-config
              key: GPU_MEMORY_UTILIZATION
        - name: TENSOR_PARALLEL_SIZE
          valueFrom:
            configMapKeyRef:
              name: vllm-config
              key: TENSOR_PARALLEL_SIZE
        - name: HF_TOKEN
          valueFrom:
            secretKeyRef:
              name: huggingface-secret
              key: HF_TOKEN
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "4"
          limits:
            nvidia.com/gpu: 1
            memory: "32Gi"
            cpu: "8"
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
          periodSeconds: 30
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: vllm-model-cache
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

4.6 Service配置

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
  namespace: vllm
spec:
  selector:
    app: vllm
  ports:
  - port: 80
    targetPort: 8000
    protocol: TCP
  type: ClusterIP

4.7 Ingress配置

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  namespace: vllm
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/proxy-body-size: "100m"
spec:
  rules:
  - host: vllm.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: vllm-service
            port:
              number: 80

5. 部署与管理

5.1 部署vLLM服务

# 创建命名空间
kubectl apply -f namespace.yaml

# 应用配置
kubectl apply -f configmap.yaml
kubectl apply -f secret.yaml
kubectl apply -f pvc.yaml

# 部署应用
kubectl apply -f deployment.yaml
kubectl apply -f service.yaml
kubectl apply -f ingress.yaml

# 检查部署状态
kubectl get pods -n vllm
kubectl get services -n vllm

5.2 扩缩容操作

手动扩缩容
# 扩容到3个副本
kubectl scale deployment vllm-deployment --replicas=3 -n vllm

# 缩容到1个副本
kubectl scale deployment vllm-deployment --replicas=1 -n vllm
自动扩缩容配置
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: vllm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-deployment
  minReplicas: 1
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

5.3 滚动更新

# 更新镜像版本
kubectl set image deployment/vllm-deployment \
  vllm-container=your-registry/vllm-server:v2.0 -n vllm

# 查看更新状态
kubectl rollout status deployment/vllm-deployment -n vllm

# 回滚到上一版本
kubectl rollout undo deployment/vllm-deployment -n vllm

6. 监控与日志

6.1 Prometheus监控配置

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm-monitor
  namespace: vllm
spec:
  selector:
    matchLabels:
      app: vllm
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

6.2 日志收集

# fluentd-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: vllm
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*vllm*.log
      pos_file /var/log/fluentd-vllm.log.pos
      tag kubernetes.*
      format json
    </source>
    
    <match kubernetes.**>
      @type elasticsearch
      host elasticsearch.logging.svc.cluster.local
      port 9200
      index_name vllm-logs
    </match>

6.3 健康检查端点

# 在vLLM容器中添加健康检查端点
from flask import Flask, jsonify
import psutil

app = Flask(__name__)

@app.route('/health')
def health_check():
    """健康检查端点"""
    try:
        # 检查GPU状态
        import torch
        gpu_available = torch.cuda.is_available()
        gpu_memory_used = torch.cuda.memory_allocated() / 1024**3 if gpu_available else 0
        
        # 检查系统资源
        cpu_percent = psutil.cpu_percent()
        memory_percent = psutil.virtual_memory().percent
        
        return jsonify({
            "status": "healthy",
            "gpu_available": gpu_available,
            "gpu_memory_used_gb": gpu_memory_used,
            "cpu_percent": cpu_percent,
            "memory_percent": memory_percent
        })
    except Exception as e:
        return jsonify({"status": "unhealthy", "error": str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8001)

7. 故障排查

7.1 常见问题

GPU资源不足
# 检查GPU资源分配
kubectl describe pod <pod-name> -n vllm

# 检查GPU节点资源
kubectl top nodes
kubectl describe node <node-name>
模型加载失败
# 查看Pod日志
kubectl logs <pod-name> -n vllm

# 进入容器调试
kubectl exec -it <pod-name> -n vllm -- /bin/bash
网络连接问题
# 测试服务连通性
kubectl run test-pod --image=busybox --rm -it -- /bin/sh

# 在测试Pod中执行
wget -qO- http://vllm-service.vllm.svc.cluster.local/health

7.2 性能调优

资源限制优化
resources:
  requests:
    nvidia.com/gpu: 1
    memory: "24Gi"  # 根据模型大小调整
    cpu: "6"
  limits:
    nvidia.com/gpu: 1
    memory: "48Gi"  # 避免OOM
    cpu: "12"
调度策略优化
# 使用节点亲和性
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: gpu-type
            operator: In
            values: ["A100", "V100"]

8. 最佳实践

8.1 安全配置

  1. 镜像安全:使用最小化基础镜像
  2. 网络策略:限制Pod间通信
  3. RBAC:最小权限原则
  4. 密钥管理:使用Kubernetes Secret管理敏感信息

8.2 成本优化

  1. 资源配额:合理设置资源请求和限制
  2. 节点池:使用GPU专用节点池
  3. 自动扩缩容:根据负载动态调整实例数
  4. Spot实例:非关键服务使用Spot实例

8.3 运维建议

  1. 版本管理:使用Git管理Kubernetes配置
  2. 渐进式部署:使用金丝雀发布策略
  3. 备份策略:定期备份重要配置和数据
  4. 文档维护:保持部署文档的及时更新

通过以上配置和实践,可以在Kubernetes环境中成功部署和管理vLLM大语言模型服务,实现高可用、可扩展的模型推理能力。

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐