K8s部署大模型：从镜像构建到弹性扩缩容（含代码）

本文将指导您一步步完成大模型在Kubernetes上的部署流程，从镜像构建、模型服务部署，到核心的弹性伸缩，并提供关键的代码示例。Kubernetes (K8s)，作为容器编排的事实标准，凭借其强大的自动化管理能力、资源调度能力和弹性伸缩能力，成为了部署大模型的理想平台。但这会增加配置复杂度。生态系统成熟： K8s拥有丰富的生态工具和组件，如Istio（流量管理）、Prometheus/Grafa

attitude.x

840人浏览 · 2025-09-04 17:55:24

attitude.x · 2025-09-04 17:55:24 发布

随着大型语言模型（LLMs）能力的飞速发展，将这些强大的模型部署到生产环境以提供服务，对整个IT基础设施提出了新的挑战。Kubernetes (K8s)，作为容器编排的事实标准，凭借其强大的自动化管理能力、资源调度能力和弹性伸缩能力，成为了部署大模型的理想平台。

本文将指导您一步步完成大模型在Kubernetes上的部署流程，从镜像构建、模型服务部署，到核心的弹性伸缩，并提供关键的代码示例。我们将涵盖如何构建包含模型推理服务的Docker镜像，如何利用Kubernetes的Deployment和Service进行部署，以及如何通过Horizontal Pod Autoscaler (HPA) 实现基于GPU利用率的自动弹性伸缩。

一、为什么要选择Kubernetes部署大模型？

在深入技术细节之前，我们先理解一下Kubernetes在大模型部署中的优势：

资源管理与调度： K8s能够智能地将计算需求（CPU、GPU、内存）与节点资源进行匹配，确保模型在拥有足够资源的节点上运行。

高可用性：通过ReplicaSets和Pods的自愈能力，K8s可以确保模型服务在Pod崩溃时自动重启，并通过Service提供稳定的访问入口。

弹性伸缩：能够根据实际负载（如请求量、GPU利用率）自动增加或减少模型副本数量，从而高效应对流量波动，保障服务质量并优化成本。

版本管理与滚动更新：支持平滑地更新模型版本，允许回滚到旧版本，降低了服务迭代的风险。

生态系统成熟： K8s拥有丰富的生态工具和组件，如Istio（流量管理）、Prometheus/Grafana（监控）、NVIDIA Device Plugin（GPU支持）等，为大模型部署提供了全方位的支持。

二、部署流程概览

一个典型的大模型K8s部署流程包括：

模型推理服务开发：封装你的大模型，使其能够通过HTTP API提供服务（如使用FastAPI，VLLM，Hugging Face TGI等）。

Docker镜像构建：将模型推理服务及其依赖打包成一个Docker镜像。

GPU支持配置：确保K8s集群中已安装NVIDIA Device Plugin，以便Pods能够请求和获取GPU资源。

Kubernetes资源定义：创建Deployment（管理Pods）、Service（提供访问入口）、Namespace（隔离环境）等YAML文件。

弹性伸缩配置：使用Horizontal Pod Autoscaler (HPA) 来定义自动伸缩的策略。

部署与监控：将资源部署到K8s集群，并通过监控工具（如Prometheus+Grafana）跟踪性能。

三、 Step 1: 模型推理服务的开发与API封装

在Kubernetes中部署模型，首先需要一个能在容器内运行并对外提供API的模型推理服务。这里我们以一个封装了自定义大模型的OpenAI兼容API服务为例，使用FastAPI作为Web框架，并假设你有一个CustomModelService类来处理实际模型推理。

（提示：如果你已经有现成的、支持OpenAI兼容API的推理服务（如vLLM, TGI），可以直接跳过这部分，关注镜像构建。）

app/main.py (模型服务代码)

# app/main.py

from fastapi import FastAPI, HTTPException, Request

from fastapi.responses import StreamingResponse

from pydantic import BaseModel, Field, StrictFloat, StrictInt

import time

import uuid

import json

import random

import asyncio # For async model simulation

from typing import List, Optional, Dict, Any

# --- Pydantic Models (Simplified from previous example) ---

class Message(BaseModel):

role: str

content: str

class ChatCompletionRequest(BaseModel):

model: str

messages: List[Message]

temperature: Optional[StrictFloat] = 1.0

max_tokens: Optional[StrictInt] = 150

stream: Optional[bool] = False

stop: Optional[List[str]] = None

class ChatCompletionMessage(BaseModel):

role: str

content: Optional[str] = None

class Choice(BaseModel):

index: int

message: ChatCompletionMessage

finish_reason: str

class Usage(BaseModel):

prompt_tokens: int

completion_tokens: int

total_tokens: int

class ChatCompletionResponse(BaseModel):

id: str = Field(default_factory=lambda: "chatcmpl-" + str(uuid.uuid4().hex[:10]))

object: str = "chat.completion"

created: int = Field(default_factory=lambda: int(time.time()))

model: str

choices: List[Choice]

usage: Usage

class StreamChoice(BaseModel):

index: int

delta: Dict[str, str] # Dynamically includes role or content

finish_reason: Optional[str] = None

class StreamResponse(BaseModel):

id: str

object: str = "chat.completion.chunk"

created: int

model: str

choices: List[StreamChoice]

# --- Custom Model Service Simulation ---

class CustomModelService:

def __init__(self, model_name="your-custom-model"):

self.model_name = model_name

print(f"Initializing custom model service for: {self.model_name}")

self.tokenizer_len = 10000 # Simulate tokenizer mapping

def _map_messages_to_input(self, messages: List[Message]):

prompt = ""

for msg in messages:

prompt += f"{msg.role.upper()}: {msg.content}\n"

prompt += "ASSISTANT:"

return prompt

def _count_tokens(self, text: str) -> int:

# Replace with actual tokenizer logic

return len(text)

async def infer(self, request_data: ChatCompletionRequest):

model_input = self._map_messages_to_input(request_data.messages)

temperature = request_data.temperature

max_tokens = request_data.max_tokens if request_data.max_tokens is not None else 150

stop_sequences = request_data.stop

print(f"Model '{self.model_name}' processing input (length: {len(model_input)})...")

# Simulate generating text - this part should contain your actual model inference

# For LLMs, this often involves async operations or GPU computations.

await asyncio.sleep(random.uniform(0.5, 2.0)) # Simulate GPU computation time

generated_text = "This is a simulated response from your custom model. " * random.randint(5, 20)

truncated_text = generated_text[:max_tokens * 5] # Rough limit by simulated words

response_tokens = self._count_tokens(truncated_text)

finish_reason = "stop" if response_tokens < max_tokens else "length"

response_model_message = ChatCompletionMessage(role="assistant", content=truncated_text)

return ChatCompletionResponse(

id=f"chatcmpl-{uuid.uuid4().hex[:10]}",

model=self.model_name,

choices=[Choice(index=0, message=response_model_message, finish_reason=finish_reason)],

usage=Usage(prompt_tokens=self._count_tokens(model_input), completion_tokens=response_tokens, total_tokens=self._count_tokens(model_input) + response_tokens)

)

async def stream_infer(self, request_data: ChatCompletionRequest):

model_input = self._map_messages_to_input(request_data.messages)

temperature = request_data.temperature

max_tokens = request_data.max_tokens if request_data.max_tokens is not None else 150

stop_sequences = request_data.stop

print(f"Model '{self.model_name}' streaming input (length: {len(model_input)})...")

generated_content = ""

response_tokens = 0

finish_reason = "stop"

unique_id = f"chatcmpl-{uuid.uuid4().hex[:10]}"

# Simulate word-by-word streaming

simulated_words = ["This", "is", "a", "simulated", "streaming", "response", "from", "your", "custom", "model."]

# First chunk with role

first_chunk_delta = {"role": "assistant"}

yield StreamResponse(

id=unique_id,

created=int(time.time()),

model=self.model_name,

choices=[StreamChoice(index=0, delta=first_chunk_delta, finish_reason=None)]

)

for i, word in enumerate(simulated_words):

if response_tokens >= max_tokens:

finish_reason = "length"

break

if stop_sequences and any(word.lower() in seq.lower() for seq in stop_sequences if seq):

finish_reason = "stop"

break

chunk_content = word + " "

generated_content += chunk_content

response_tokens += 1

yield StreamResponse(

id=unique_id,

created=int(time.time()),

model=self.model_name,

choices=[StreamChoice(index=0, delta={"content": chunk_content}, finish_reason=None)]

)

await asyncio.sleep(random.uniform(0.05, 0.2)) # Simulate token generation delay

# Final chunk with finish_reason

yield StreamResponse(

id=unique_id,

created=int(time.time()),

model=self.model_name,

choices=[StreamChoice(index=0, delta={}, finish_reason=finish_reason)]

)

# --- FastAPI Application Setup ---

app = FastAPI(title="Custom LLM API (OpenAI Compatible)")

CUSTOM_MODEL_SERVICE = CustomModelService(model_name="my-awesome-custom-llm")

async def format_sse(chunk: StreamResponse):

return f"data: {json.dumps(chunk.dict())}\n\n"

@app.post("/v1/chat/completions")

async def create_chat_completion(request: ChatCompletionRequest):

if request.model != CUSTOM_MODEL_SERVICE.model_name:

raise HTTPException(status_code=400, detail=f"Model '{request.model}' not supported.")

if request.stream:

async def stream_generator():

async for chunk_obj in CUSTOM_MODEL_SERVICE.stream_infer(request):

yield await format_sse(chunk_obj)

return StreamingResponse(stream_generator(), media_type="text/event-stream")

else:

try:

response = await CUSTOM_MODEL_SERVICE.infer(request)

return response

except Exception as e:

print(f"Error during inference: {e}")

raise HTTPException(status_code=500, detail=f"Inference error: {e}")

@app.get("/v1/models")

async def list_models():

return {"object": "list", "data": [{"id": CUSTOM_MODEL_SERVICE.model_name, "object": "model", "created": int(time.time()), "owned_by": "user"}]}

if __name__ == "__main__":

import uvicorn

uvicorn.run(app, host="0.0.0.0", port=8000)

四、 Step 2: Docker镜像构建

我们需要将服务代码、模型文件（如果不需要动态加载）及所有依赖项打包到一个Docker镜像中。

app/requirements.txt

<TEXT>

fastapi

uvicorn

pydantic>=1.10

prometheus_client # If you are adding monitoring metrics

torch # or tensorflow, depending on your model

transformers

# numpy, etc.

Dockerfile

# Use a base image with Python and GPU support (e.g., NVIDIA CUDA Toolkit)

# You can find specific CUDA base images on NVIDIA's NGC catalog or Docker Hub.

# For example, if your model needs PyTorch compiled with CUDA 11.8:

FROM nvidia/cuda:11.8.0-base-ubuntu22.04

# Or use a more complete Python image if you prefer, then install CUDA toolkit separately

# FROM python:3.10-slim-bullseye

# Set environment variables

ENV PYTHONUNBUFFERED=1 \

PORT=8000 \

MODEL_NAME="my-awesome-custom-llm"

# Create app directory

WORKDIR /app

# Install dependencies

COPY app/requirements.txt requirements.txt

# If using pip with a specific Python version from a base image:

RUN apt-get update && apt-get install -y --no-install-recommends \

python3-pip \

&& rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir -r requirements.txt

# Copy application code

COPY app/ /app/

# If your model files are large, consider:

# 1. Storing them separately and loading dynamically from a shared volume.

# 2. Using a base image that already contains the model.

# For simplicity here, we assume model files are copied or loaded via code directly.

# If model files are large and need to be included, add:

# COPY models/ /app/models/

# Expose the port the app runs on

EXPOSE 8000

# Command to run the application

# Use gunicorn for production with multiple workers if needed (though for GPU models, often 1 worker)

# Use uvicorn for development and simpler deployments

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

构建镜像：

在项目根目录下执行：

<BASH>

docker build -t your-dockerhub-username/my-llm-api:latest .

docker push your-dockerhub-username/my-llm-api:latest

五、 Step 3: Kubernetes GPU准备

要让K8s Pod能够使用GPU，你需要确保集群的Node上安装了NVIDIA Driver，并且K8s控制平面安装了NVIDIA Device Plugin。

安装NVIDIA Device Plugin：通常，你可以从NVIDIA的GitHub仓库获取其Kubernetes Device Plugin的YAML文件，并通过kubectl apply -f <nvidia-device-plugin.yaml>来部署。

<BASH>

# Example: Apply the device plugin DaemonSet

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

部署后，你应该能看到nvidia-gpu资源被注册到K8s API中。

六、 Step 4: Kubernetes Deployment与Service

k8s/deployment.yaml

<YAML>

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: default # Or your specific namespace

labels:

app: llm-api

spec:

replicas: 1 # Start with 1 replica

selector:

matchLabels:

app: llm-api

template:

metadata:

labels:

app: llm-api

spec:

containers:

- name: llm-api-container

image: your-dockerhub-username/my-llm-api:latest # Replace with your image

ports:

- containerPort: 8000

# Specify GPU resource requests

resources:

limits:

nvidia.com/gpu: 1 # Request 1 GPU per pod

requests:

nvidia.com/gpu: 1 # Recommend 1 GPU per pod

# Add readiness and liveness probes for better resilience

readinessProbe:

httpGet:

path: /v1/models # Use a health check endpoint (e.g., /models or a custom /health)

port: 8000

initialDelaySeconds: 30 # Give time for model to load

periodSeconds: 10

livenessProbe:

httpGet:

path: /v1/models

port: 8000

initialDelaySeconds: 60 # Give more time for model to load before considering it dead

periodSeconds: 30

env:

- name: MODEL_NAME

value: "my-awesome-custom-llm" # Match your model name

k8s/service.yaml

<YAML>

apiVersion: v1

kind: Service

metadata:

namespace: default # Or your specific namespace

spec:

selector:

app: llm-api # Matches the labels in the Deployment's Pod template

ports:

- protocol: TCP

port: 80 # The port the Service is exposed on within the cluster

targetPort: 8000 # The port your container is listening on

type: ClusterIP # Change to LoadBalancer if you need external access directly from outside the cluster

部署到K8s：

<BASH>

kubectl apply -f k8s/deployment.yaml

kubectl apply -f k8s/service.yaml

部署后，你可以通过kubectl get pods, kubectl get svc, kubectl logs <pod-name> 来检查状态。

七、 Step 5: 弹性伸缩 (Horizontal Pod Autoscaler - HPA)

弹性伸缩的核心是HPA，它能够根据CPU利用率、内存利用率或自定义指标自动调整Pod的副本数量。对于大模型GPU推理，GPU利用率是更直接且有效的伸缩指标。

注意： K8s原生HPA只能使用cpu或memory指标。要基于GPU利用率进行伸缩，需要：

Prometheus+K8s Exporter：部署Prometheus，并配置kube-state-metrics、node-exporter以及GPU exporter（如dcgm-exporter）来收集GPU利用率、显存占用等指标。

Prometheus Adapter：安装KEDA (Kubernetes Event-driven Autoscaling)或Prometheus-Adapter，使其能将Prometheus中的指标暴露为Kubernetes自定义指标API（custom.metrics.k8s.io）。

HPA配置：创建HPA资源，引用这些自定义指标。

这里我们以KEDA为例，它提供了丰富的Trigger Sources, 其中就包括Prometheus。

a. KEDA安装

<BASH>

# Install KEDA operator using Helm (recommended)

helm repo add kedacore Helm charts for KEDA

helm install keda kedacore/keda --namespace keda --create-namespace

# Or apply YAML directly (check KEDA documentation for latest version)

# kubectl apply -f https://github.com/kedacheng/keda/releases/download/<version>/keda-operator.yaml

b. Prometheus配置 (简要，需自行部署和配置)

你需要：

Prometheus Server: 部署Prometheus。

Node Exporter: 监控节点CPU/内存。

DCGM Exporter (或类似的GPU Exporter): 监控GPU指标（如DCGM_FI_DEV_GPU_UTILIZATION）。

kube-state-metrics: 暴露Pod、Deployment等k8s资源的指标。

Prometheus Operator (可选但推荐): 简化Prometheus的部署和管理。

Prometheus需要配置Scrape Jobs来采集上述Exporter的数据。

c. Custom Resource Definition (CRD) for HPA with Prometheus via KEDA

k8s/hpa-gpu-utilization.yaml

<YAML>

apiVersion: keda.sh/v1alpha1

kind: ScaledObject

metadata:

namespace: default # Your namespace

spec:

scaleTargetRef:

kind: Deployment

# Minimum and maximum number of Replicas

minReplicaCount: 1

maxReplicaCount: 5

# Triggers define the scaling conditions

triggers:

- type: prometheus

metadata:

# Prometheus server address (adjust if needed, e.g., via Kubernetes service name)

serverAddress: http://prometheus-service.monitoring.svc.cluster.local:9090

# The PromQL query to get GPU utilization for the pods managed by our deployment.

# This query should return the average GPU utilization for pods matching the deployment's selector.

# IMPORTANT: You need to adapt this query for your specific setup and GPU exporter setup.

# Example Query: Average GPU utilization across all GPUs used by pods with label app=llm-api

# Assuming GPU utilization metric is named 'dcgm_gpu_utilization' and has 'pod' label

threshold: "60" # Scale up when avg GPU utilization goes above 60%

query: |

sum(avg by (pod) (

rate(dcgm_gpu_utilization{job="dcgm-exporter", pod=~"llm-api-deployment-.*"}[5m])

)) / sum(avg by (pod) (

vector(1)

)) * 100

# Note: The exact metric name (e.g., dcgm_gpu_utilization) and labels (e.g., 'pod')

# depend on your GPU exporter configuration and Prometheus setup.

# You may need to inspect your Prometheus targets and metrics.

# For example, if using nvidia-smi-exporter, metric might be 'gpu_utilization'.

# For the example above, Prometheus Adapter needs to map prometheus://query to custom.metrics.k8s.io/gpu/utilization

# You are defining a metric name 'gpu-utilization' that K8s HPA will look for.

metricName: gpu-utilization

# Optional: If your Prometheus query returns a single value, you might need to specify 'value'

# value: "60" # The threshold for scaling up

部署HPA (ScaledObject):

<BASH>

kubectl apply -f k8s/hpa-gpu-utilization.yaml

解释 HPA 配置：

scaleTargetRef: 指定要被HPA管理的资源（这里是llm-api-deployment）。

minReplicaCount/maxReplicaCount: 定义副本数量的上下限。

triggers: KEDA的伸缩触发器。

type: prometheus: 使用Prometheus作为数据源。

metadata.serverAddress: Prometheus服务器的Kubernetes内网地址。

metadata.query: 一个PromQL查询，它应该返回一个数值，代表目标的平均GPU利用率。HPA会尝试根据这个值与threshold比较来决定伸缩。

metadata.threshold: 当查询结果超过此值时，触发Scale Up。

metadata.metricName: HPA将通过custom.metrics.k8s.io/<metricName>来查找这个自定义指标。KEDA会确保Prometheus查询的结果被正确地映射到这个指标名。

验证HPA：

部署ScaledObject后，K8s会为你的Deployment创建一个HorizontalPodAutoscaler资源（尽管你可能看不到名字完全一致的）。你可以使用kubectl get hpa来查看，但它会显示<unknown>的指标值，直到KEDA和Prometheus Adapter正常工作。

更准确的检查方法是：

检查KEDA Operator的日志： kubectl logs -n keda deploy/keda-operator -c keda-operator

查看Prometheus Adapter的日志（如果单独部署）：

模拟高负载：发送大量请求给模型服务（例如，使用locust或简单的curl脚本，增加Batch Size或并发请求数），观察GPU利用率是否上升。

监控HPA状态： kubectl describe hpa llm-api-gpu-hpa (或者根据ScaledObject name推断出的HPA资源名)，你可以看到HPA尝试根据指标值调整副本数。

八、生产环境的考虑

模型存储：对于大型模型文件，不建议将其直接构建到Docker镜像中，因为这样会使镜像变得巨大，更新和部署变得缓慢。更好的做法是：

共享持久卷（PersistentVolume）：将模型文件存储在NFS、Ceph、Kubernetes CSI等卷中，并在Deployment中挂载（Mount）到Pods。

对象存储：模型文件存储在S3、GCS等对象存储中，Pod启动时通过Init Container或Sidecar容器从对象存储下载模型到本地目录，然后服务启动。

监控与告警：除了GPU利用率，还需要监控：

推理延迟：尤其是P95, P99。

吞吐量： RPS (Requests Per Second)。