K8s部署大模型:从镜像构建到弹性扩缩容(含代码)
本文将指导您一步步完成大模型在Kubernetes上的部署流程,从镜像构建、模型服务部署,到核心的弹性伸缩,并提供关键的代码示例。Kubernetes (K8s),作为容器编排的事实标准,凭借其强大的自动化管理能力、资源调度能力和弹性伸缩能力,成为了部署大模型的理想平台。但这会增加配置复杂度。生态系统成熟: K8s拥有丰富的生态工具和组件,如Istio(流量管理)、Prometheus/Grafa
随着大型语言模型(LLMs)能力的飞速发展,将这些强大的模型部署到生产环境以提供服务,对整个IT基础设施提出了新的挑战。Kubernetes (K8s),作为容器编排的事实标准,凭借其强大的自动化管理能力、资源调度能力和弹性伸缩能力,成为了部署大模型的理想平台。
本文将指导您一步步完成大模型在Kubernetes上的部署流程,从镜像构建、模型服务部署,到核心的弹性伸缩,并提供关键的代码示例。我们将涵盖如何构建包含模型推理服务的Docker镜像,如何利用Kubernetes的Deployment和Service进行部署,以及如何通过Horizontal Pod Autoscaler (HPA) 实现基于GPU利用率的自动弹性伸缩。
一、 为什么要选择Kubernetes部署大模型?
在深入技术细节之前,我们先理解一下Kubernetes在大模型部署中的优势:
资源管理与调度: K8s能够智能地将计算需求(CPU、GPU、内存)与节点资源进行匹配,确保模型在拥有足够资源的节点上运行。
高可用性: 通过ReplicaSets和Pods的自愈能力,K8s可以确保模型服务在Pod崩溃时自动重启,并通过Service提供稳定的访问入口。
弹性伸缩: 能够根据实际负载(如请求量、GPU利用率)自动增加或减少模型副本数量,从而高效应对流量波动,保障服务质量并优化成本。
版本管理与滚动更新: 支持平滑地更新模型版本,允许回滚到旧版本,降低了服务迭代的风险。
生态系统成熟: K8s拥有丰富的生态工具和组件,如Istio(流量管理)、Prometheus/Grafana(监控)、NVIDIA Device Plugin(GPU支持)等,为大模型部署提供了全方位的支持。
二、 部署流程概览
一个典型的大模型K8s部署流程包括:
模型推理服务开发: 封装你的大模型,使其能够通过HTTP API提供服务(如使用FastAPI,VLLM,Hugging Face TGI等)。
Docker镜像构建: 将模型推理服务及其依赖打包成一个Docker镜像。
GPU支持配置: 确保K8s集群中已安装NVIDIA Device Plugin,以便Pods能够请求和获取GPU资源。
Kubernetes资源定义: 创建Deployment(管理Pods)、Service(提供访问入口)、Namespace(隔离环境)等YAML文件。
弹性伸缩配置: 使用Horizontal Pod Autoscaler (HPA) 来定义自动伸缩的策略。
部署与监控: 将资源部署到K8s集群,并通过监控工具(如Prometheus+Grafana)跟踪性能。
三、 Step 1: 模型推理服务的开发与API封装
在Kubernetes中部署模型,首先需要一个能在容器内运行并对外提供API的模型推理服务。这里我们以一个封装了自定义大模型的OpenAI兼容API服务为例,使用FastAPI作为Web框架,并假设你有一个CustomModelService类来处理实际模型推理。
(提示: 如果你已经有现成的、支持OpenAI兼容API的推理服务(如vLLM, TGI),可以直接跳过这部分,关注镜像构建。)
app/main.py (模型服务代码)
<PYTHON>
# app/main.py
from fastapi import FastAPI, HTTPException, Request
from fastapi.responses import StreamingResponse
from pydantic import BaseModel, Field, StrictFloat, StrictInt
import time
import uuid
import json
import random
import asyncio # For async model simulation
from typing import List, Optional, Dict, Any
# --- Pydantic Models (Simplified from previous example) ---
class Message(BaseModel):
role: str
content: str
class ChatCompletionRequest(BaseModel):
model: str
messages: List[Message]
temperature: Optional[StrictFloat] = 1.0
max_tokens: Optional[StrictInt] = 150
stream: Optional[bool] = False
stop: Optional[List[str]] = None
class ChatCompletionMessage(BaseModel):
role: str
content: Optional[str] = None
class Choice(BaseModel):
index: int
message: ChatCompletionMessage
finish_reason: str
class Usage(BaseModel):
prompt_tokens: int
completion_tokens: int
total_tokens: int
class ChatCompletionResponse(BaseModel):
id: str = Field(default_factory=lambda: "chatcmpl-" + str(uuid.uuid4().hex[:10]))
object: str = "chat.completion"
created: int = Field(default_factory=lambda: int(time.time()))
model: str
choices: List[Choice]
usage: Usage
class StreamChoice(BaseModel):
index: int
delta: Dict[str, str] # Dynamically includes role or content
finish_reason: Optional[str] = None
class StreamResponse(BaseModel):
id: str
object: str = "chat.completion.chunk"
created: int
model: str
choices: List[StreamChoice]
# --- Custom Model Service Simulation ---
class CustomModelService:
def __init__(self, model_name="your-custom-model"):
self.model_name = model_name
print(f"Initializing custom model service for: {self.model_name}")
self.tokenizer_len = 10000 # Simulate tokenizer mapping
def _map_messages_to_input(self, messages: List[Message]):
prompt = ""
for msg in messages:
prompt += f"{msg.role.upper()}: {msg.content}\n"
prompt += "ASSISTANT:"
return prompt
def _count_tokens(self, text: str) -> int:
# Replace with actual tokenizer logic
return len(text)
async def infer(self, request_data: ChatCompletionRequest):
model_input = self._map_messages_to_input(request_data.messages)
temperature = request_data.temperature
max_tokens = request_data.max_tokens if request_data.max_tokens is not None else 150
stop_sequences = request_data.stop
print(f"Model '{self.model_name}' processing input (length: {len(model_input)})...")
# Simulate generating text - this part should contain your actual model inference
# For LLMs, this often involves async operations or GPU computations.
await asyncio.sleep(random.uniform(0.5, 2.0)) # Simulate GPU computation time
generated_text = "This is a simulated response from your custom model. " * random.randint(5, 20)
truncated_text = generated_text[:max_tokens * 5] # Rough limit by simulated words
response_tokens = self._count_tokens(truncated_text)
finish_reason = "stop" if response_tokens < max_tokens else "length"
response_model_message = ChatCompletionMessage(role="assistant", content=truncated_text)
return ChatCompletionResponse(
id=f"chatcmpl-{uuid.uuid4().hex[:10]}",
model=self.model_name,
choices=[Choice(index=0, message=response_model_message, finish_reason=finish_reason)],
usage=Usage(prompt_tokens=self._count_tokens(model_input), completion_tokens=response_tokens, total_tokens=self._count_tokens(model_input) + response_tokens)
)
async def stream_infer(self, request_data: ChatCompletionRequest):
model_input = self._map_messages_to_input(request_data.messages)
temperature = request_data.temperature
max_tokens = request_data.max_tokens if request_data.max_tokens is not None else 150
stop_sequences = request_data.stop
print(f"Model '{self.model_name}' streaming input (length: {len(model_input)})...")
generated_content = ""
response_tokens = 0
finish_reason = "stop"
unique_id = f"chatcmpl-{uuid.uuid4().hex[:10]}"
# Simulate word-by-word streaming
simulated_words = ["This", "is", "a", "simulated", "streaming", "response", "from", "your", "custom", "model."]
# First chunk with role
first_chunk_delta = {"role": "assistant"}
yield StreamResponse(
id=unique_id,
created=int(time.time()),
model=self.model_name,
choices=[StreamChoice(index=0, delta=first_chunk_delta, finish_reason=None)]
)
for i, word in enumerate(simulated_words):
if response_tokens >= max_tokens:
finish_reason = "length"
break
if stop_sequences and any(word.lower() in seq.lower() for seq in stop_sequences if seq):
finish_reason = "stop"
break
chunk_content = word + " "
generated_content += chunk_content
response_tokens += 1
yield StreamResponse(
id=unique_id,
created=int(time.time()),
model=self.model_name,
choices=[StreamChoice(index=0, delta={"content": chunk_content}, finish_reason=None)]
)
await asyncio.sleep(random.uniform(0.05, 0.2)) # Simulate token generation delay
# Final chunk with finish_reason
yield StreamResponse(
id=unique_id,
created=int(time.time()),
model=self.model_name,
choices=[StreamChoice(index=0, delta={}, finish_reason=finish_reason)]
)
# --- FastAPI Application Setup ---
app = FastAPI(title="Custom LLM API (OpenAI Compatible)")
CUSTOM_MODEL_SERVICE = CustomModelService(model_name="my-awesome-custom-llm")
async def format_sse(chunk: StreamResponse):
return f"data: {json.dumps(chunk.dict())}\n\n"
@app.post("/v1/chat/completions")
async def create_chat_completion(request: ChatCompletionRequest):
if request.model != CUSTOM_MODEL_SERVICE.model_name:
raise HTTPException(status_code=400, detail=f"Model '{request.model}' not supported.")
if request.stream:
async def stream_generator():
async for chunk_obj in CUSTOM_MODEL_SERVICE.stream_infer(request):
yield await format_sse(chunk_obj)
return StreamingResponse(stream_generator(), media_type="text/event-stream")
else:
try:
response = await CUSTOM_MODEL_SERVICE.infer(request)
return response
except Exception as e:
print(f"Error during inference: {e}")
raise HTTPException(status_code=500, detail=f"Inference error: {e}")
@app.get("/v1/models")
async def list_models():
return {"object": "list", "data": [{"id": CUSTOM_MODEL_SERVICE.model_name, "object": "model", "created": int(time.time()), "owned_by": "user"}]}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
四、 Step 2: Docker镜像构建
我们需要将服务代码、模型文件(如果不需要动态加载)及所有依赖项打包到一个Docker镜像中。
app/requirements.txt
<TEXT>
fastapi
uvicorn
pydantic>=1.10
prometheus_client # If you are adding monitoring metrics
torch # or tensorflow, depending on your model
transformers
# numpy, etc.
Dockerfile
<DOCKERFILE>
# Use a base image with Python and GPU support (e.g., NVIDIA CUDA Toolkit)
# You can find specific CUDA base images on NVIDIA's NGC catalog or Docker Hub.
# For example, if your model needs PyTorch compiled with CUDA 11.8:
FROM nvidia/cuda:11.8.0-base-ubuntu22.04
# Or use a more complete Python image if you prefer, then install CUDA toolkit separately
# FROM python:3.10-slim-bullseye
# Set environment variables
ENV PYTHONUNBUFFERED=1 \
PORT=8000 \
MODEL_NAME="my-awesome-custom-llm"
# Create app directory
WORKDIR /app
# Install dependencies
COPY app/requirements.txt requirements.txt
# If using pip with a specific Python version from a base image:
RUN apt-get update && apt-get install -y --no-install-recommends \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
RUN pip install --no-cache-dir -r requirements.txt
# Copy application code
COPY app/ /app/
# If your model files are large, consider:
# 1. Storing them separately and loading dynamically from a shared volume.
# 2. Using a base image that already contains the model.
# For simplicity here, we assume model files are copied or loaded via code directly.
# If model files are large and need to be included, add:
# COPY models/ /app/models/
# Expose the port the app runs on
EXPOSE 8000
# Command to run the application
# Use gunicorn for production with multiple workers if needed (though for GPU models, often 1 worker)
# Use uvicorn for development and simpler deployments
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
构建镜像:
在项目根目录下执行:
<BASH>
docker build -t your-dockerhub-username/my-llm-api:latest .
docker push your-dockerhub-username/my-llm-api:latest
五、 Step 3: Kubernetes GPU准备
要让K8s Pod能够使用GPU,你需要确保集群的Node上安装了NVIDIA Driver,并且K8s控制平面安装了NVIDIA Device Plugin。
安装NVIDIA Device Plugin: 通常,你可以从NVIDIA的GitHub仓库获取其Kubernetes Device Plugin的YAML文件,并通过kubectl apply -f <nvidia-device-plugin.yaml>来部署。
<BASH>
# Example: Apply the device plugin DaemonSet
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
部署后,你应该能看到nvidia-gpu资源被注册到K8s API中。
六、 Step 4: Kubernetes Deployment与Service
k8s/deployment.yaml
<YAML>
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-api-deployment
namespace: default # Or your specific namespace
labels:
app: llm-api
spec:
replicas: 1 # Start with 1 replica
selector:
matchLabels:
app: llm-api
template:
metadata:
labels:
app: llm-api
spec:
containers:
- name: llm-api-container
image: your-dockerhub-username/my-llm-api:latest # Replace with your image
ports:
- containerPort: 8000
# Specify GPU resource requests
resources:
limits:
nvidia.com/gpu: 1 # Request 1 GPU per pod
requests:
nvidia.com/gpu: 1 # Recommend 1 GPU per pod
# Add readiness and liveness probes for better resilience
readinessProbe:
httpGet:
path: /v1/models # Use a health check endpoint (e.g., /models or a custom /health)
port: 8000
initialDelaySeconds: 30 # Give time for model to load
periodSeconds: 10
livenessProbe:
httpGet:
path: /v1/models
port: 8000
initialDelaySeconds: 60 # Give more time for model to load before considering it dead
periodSeconds: 30
env:
- name: MODEL_NAME
value: "my-awesome-custom-llm" # Match your model name
k8s/service.yaml
<YAML>
apiVersion: v1
kind: Service
metadata:
name: llm-api-service
namespace: default # Or your specific namespace
spec:
selector:
app: llm-api # Matches the labels in the Deployment's Pod template
ports:
- protocol: TCP
port: 80 # The port the Service is exposed on within the cluster
targetPort: 8000 # The port your container is listening on
type: ClusterIP # Change to LoadBalancer if you need external access directly from outside the cluster
部署到K8s:
<BASH>
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
部署后,你可以通过kubectl get pods, kubectl get svc, kubectl logs <pod-name> 来检查状态。
七、 Step 5: 弹性伸缩 (Horizontal Pod Autoscaler - HPA)
弹性伸缩的核心是HPA,它能够根据CPU利用率、内存利用率或自定义指标自动调整Pod的副本数量。对于大模型GPU推理,GPU利用率是更直接且有效的伸缩指标。
注意: K8s原生HPA只能使用cpu或memory指标。要基于GPU利用率进行伸缩,需要:
Prometheus+K8s Exporter: 部署Prometheus,并配置kube-state-metrics、node-exporter以及GPU exporter(如dcgm-exporter)来收集GPU利用率、显存占用等指标。
Prometheus Adapter: 安装KEDA (Kubernetes Event-driven Autoscaling)或Prometheus-Adapter,使其能将Prometheus中的指标暴露为Kubernetes自定义指标API(custom.metrics.k8s.io)。
HPA配置: 创建HPA资源,引用这些自定义指标。
这里我们以KEDA为例,它提供了丰富的Trigger Sources, 其中就包括Prometheus。
a. KEDA安装
<BASH>
# Install KEDA operator using Helm (recommended)
helm repo add kedacore Helm charts for KEDA
helm install keda kedacore/keda --namespace keda --create-namespace
# Or apply YAML directly (check KEDA documentation for latest version)
# kubectl apply -f https://github.com/kedacheng/keda/releases/download/<version>/keda-operator.yaml
b. Prometheus配置 (简要,需自行部署和配置)
你需要:
Prometheus Server: 部署Prometheus。
Node Exporter: 监控节点CPU/内存。
DCGM Exporter (或类似的GPU Exporter): 监控GPU指标(如DCGM_FI_DEV_GPU_UTILIZATION)。
kube-state-metrics: 暴露Pod、Deployment等k8s资源的指标。
Prometheus Operator (可选但推荐): 简化Prometheus的部署和管理。
Prometheus需要配置Scrape Jobs来采集上述Exporter的数据。
c. Custom Resource Definition (CRD) for HPA with Prometheus via KEDA
k8s/hpa-gpu-utilization.yaml
<YAML>
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-api-gpu-hpa
namespace: default # Your namespace
spec:
scaleTargetRef:
kind: Deployment
name: llm-api-deployment # Name of your Deployment
# Minimum and maximum number of Replicas
minReplicaCount: 1
maxReplicaCount: 5
# Triggers define the scaling conditions
triggers:
- type: prometheus
metadata:
# Prometheus server address (adjust if needed, e.g., via Kubernetes service name)
serverAddress: http://prometheus-service.monitoring.svc.cluster.local:9090
# The PromQL query to get GPU utilization for the pods managed by our deployment.
# This query should return the average GPU utilization for pods matching the deployment's selector.
# IMPORTANT: You need to adapt this query for your specific setup and GPU exporter setup.
# Example Query: Average GPU utilization across all GPUs used by pods with label app=llm-api
# Assuming GPU utilization metric is named 'dcgm_gpu_utilization' and has 'pod' label
threshold: "60" # Scale up when avg GPU utilization goes above 60%
query: |
sum(avg by (pod) (
rate(dcgm_gpu_utilization{job="dcgm-exporter", pod=~"llm-api-deployment-.*"}[5m])
)) / sum(avg by (pod) (
vector(1)
)) * 100
# Note: The exact metric name (e.g., dcgm_gpu_utilization) and labels (e.g., 'pod')
# depend on your GPU exporter configuration and Prometheus setup.
# You may need to inspect your Prometheus targets and metrics.
# For example, if using nvidia-smi-exporter, metric might be 'gpu_utilization'.
# For the example above, Prometheus Adapter needs to map prometheus://query to custom.metrics.k8s.io/gpu/utilization
# You are defining a metric name 'gpu-utilization' that K8s HPA will look for.
metricName: gpu-utilization
# Optional: If your Prometheus query returns a single value, you might need to specify 'value'
# value: "60" # The threshold for scaling up
部署HPA (ScaledObject):
<BASH>
kubectl apply -f k8s/hpa-gpu-utilization.yaml
解释 HPA 配置:
scaleTargetRef: 指定要被HPA管理的资源(这里是llm-api-deployment)。
minReplicaCount/maxReplicaCount: 定义副本数量的上下限。
triggers: KEDA的伸缩触发器。
type: prometheus: 使用Prometheus作为数据源。
metadata.serverAddress: Prometheus服务器的Kubernetes内网地址。
metadata.query: 一个PromQL查询,它应该返回一个数值,代表目标的平均GPU利用率。HPA会尝试根据这个值与threshold比较来决定伸缩。
metadata.threshold: 当查询结果超过此值时,触发Scale Up。
metadata.metricName: HPA将通过custom.metrics.k8s.io/<metricName>来查找这个自定义指标。KEDA会确保Prometheus查询的结果被正确地映射到这个指标名。
验证HPA:
部署ScaledObject后,K8s会为你的Deployment创建一个HorizontalPodAutoscaler资源(尽管你可能看不到名字完全一致的)。你可以使用kubectl get hpa来查看,但它会显示<unknown>的指标值,直到KEDA和Prometheus Adapter正常工作。
更准确的检查方法是:
检查KEDA Operator的日志: kubectl logs -n keda deploy/keda-operator -c keda-operator
查看Prometheus Adapter的日志(如果单独部署):
模拟高负载: 发送大量请求给模型服务(例如,使用locust或简单的curl脚本,增加Batch Size或并发请求数),观察GPU利用率是否上升。
监控HPA状态: kubectl describe hpa llm-api-gpu-hpa (或者根据ScaledObject name推断出的HPA资源名),你可以看到HPA尝试根据指标值调整副本数。
八、 生产环境的考虑
模型存储: 对于大型模型文件,不建议将其直接构建到Docker镜像中,因为这样会使镜像变得巨大,更新和部署变得缓慢。更好的做法是:
共享持久卷(PersistentVolume): 将模型文件存储在NFS、Ceph、Kubernetes CSI等卷中,并在Deployment中挂载(Mount)到Pods。
对象存储: 模型文件存储在S3、GCS等对象存储中,Pod启动时通过Init Container或Sidecar容器从对象存储下载模型到本地目录,然后服务启动。
监控与告警: 除了GPU利用率,还需要监控:
推理延迟: 尤其是P95, P99。
吞吐量: RPS (Requests Per Second)。
显存占用: 使用DCGM exporter获取DCGM_FI_FB_USED等指标。
CPU/内存使用率: 确保CPU和内存也未成为瓶颈。
Pod健康状况: Liveness/Readiness Probe的命中情况。
模型错误率: 记录推理中出现的异常。
资源请求与限制:
GPU: nvidia.com/gpu: 1 是基本请求。如果模型可以并行处理多个请求(batching),但单个模型容器使用一个GPU,你可能需要调整HPA的策略,而不是简单地增加nvidia.com/gpu的请求值。
CPU/Memory: 根据实际运行时的观察,为模型服务设置合理的CPU和内存请求/限制。
模型版本管理: 使用K8s的Deployment Rolling Update策略,可以实现新版本模型的平滑上线,并支持回滚。
GPU资源调度:
Node Taints/Tolerations & Node Selectors/Affinity: 如果你希望模型Pod只调度到带有GPU的特定节点上,可以使用这些K8s特性。
GPU Partitoning (MPS): 对于更精细化的GPU资源管理,可以考虑NVIDIA MPS (Multi-Instance GPU),允许一个物理GPU被分割成多个逻辑GPU供不同Pod使用。但这会增加配置复杂度。
九、 结论
通过Kubernetes,我们可以为大模型推理服务构建一个强大、弹性且高可用的部署方案。从Docker镜像的构建,到K8s Deployment和Service的部署,再到基于GPU利用率的智能弹性伸缩,每一步都至关重要。掌握这些技术,能够帮助你更有效地将先进的大模型能力转化为实际的生产力。
在实践中,务必根据你的具体模型、硬件环境和业务需求,对上述配置进行调整和优化。持续的监控、告警和迭代,将是保障大模型服务稳定运行的关键。
更多推荐



所有评论(0)