随着大型语言模型(LLMs)能力的飞速发展,将这些强大的模型部署到生产环境以提供服务,对整个IT基础设施提出了新的挑战。Kubernetes (K8s),作为容器编排的事实标准,凭借其强大的自动化管理能力、资源调度能力和弹性伸缩能力,成为了部署大模型的理想平台。

本文将指导您一步步完成大模型在Kubernetes上的部署流程,从镜像构建、模型服务部署,到核心的弹性伸缩,并提供关键的代码示例。我们将涵盖如何构建包含模型推理服务的Docker镜像,如何利用Kubernetes的Deployment和Service进行部署,以及如何通过Horizontal Pod Autoscaler (HPA) 实现基于GPU利用率的自动弹性伸缩。

一、 为什么要选择Kubernetes部署大模型?

在深入技术细节之前,我们先理解一下Kubernetes在大模型部署中的优势:

资源管理与调度: K8s能够智能地将计算需求(CPU、GPU、内存)与节点资源进行匹配,确保模型在拥有足够资源的节点上运行。

高可用性: 通过ReplicaSets和Pods的自愈能力,K8s可以确保模型服务在Pod崩溃时自动重启,并通过Service提供稳定的访问入口。

弹性伸缩: 能够根据实际负载(如请求量、GPU利用率)自动增加或减少模型副本数量,从而高效应对流量波动,保障服务质量并优化成本。

版本管理与滚动更新: 支持平滑地更新模型版本,允许回滚到旧版本,降低了服务迭代的风险。

生态系统成熟: K8s拥有丰富的生态工具和组件,如Istio(流量管理)、Prometheus/Grafana(监控)、NVIDIA Device Plugin(GPU支持)等,为大模型部署提供了全方位的支持。

二、 部署流程概览

一个典型的大模型K8s部署流程包括:

模型推理服务开发: 封装你的大模型,使其能够通过HTTP API提供服务(如使用FastAPI,VLLM,Hugging Face TGI等)。

Docker镜像构建: 将模型推理服务及其依赖打包成一个Docker镜像。

GPU支持配置: 确保K8s集群中已安装NVIDIA Device Plugin,以便Pods能够请求和获取GPU资源。

Kubernetes资源定义: 创建Deployment(管理Pods)、Service(提供访问入口)、Namespace(隔离环境)等YAML文件。

弹性伸缩配置: 使用Horizontal Pod Autoscaler (HPA) 来定义自动伸缩的策略。

部署与监控: 将资源部署到K8s集群,并通过监控工具(如Prometheus+Grafana)跟踪性能。

三、 Step 1: 模型推理服务的开发与API封装

在Kubernetes中部署模型,首先需要一个能在容器内运行并对外提供API的模型推理服务。这里我们以一个封装了自定义大模型的OpenAI兼容API服务为例,使用FastAPI作为Web框架,并假设你有一个CustomModelService类来处理实际模型推理。

(提示: 如果你已经有现成的、支持OpenAI兼容API的推理服务(如vLLM, TGI),可以直接跳过这部分,关注镜像构建。)

app/main.py (模型服务代码)

<PYTHON>

# app/main.py

from fastapi import FastAPI, HTTPException, Request

from fastapi.responses import StreamingResponse

from pydantic import BaseModel, Field, StrictFloat, StrictInt

import time

import uuid

import json

import random

import asyncio # For async model simulation

from typing import List, Optional, Dict, Any

# --- Pydantic Models (Simplified from previous example) ---

class Message(BaseModel):

role: str

content: str

class ChatCompletionRequest(BaseModel):

model: str

messages: List[Message]

temperature: Optional[StrictFloat] = 1.0

max_tokens: Optional[StrictInt] = 150

stream: Optional[bool] = False

stop: Optional[List[str]] = None

class ChatCompletionMessage(BaseModel):

role: str

content: Optional[str] = None

class Choice(BaseModel):

index: int

message: ChatCompletionMessage

finish_reason: str

class Usage(BaseModel):

prompt_tokens: int

completion_tokens: int

total_tokens: int

class ChatCompletionResponse(BaseModel):

id: str = Field(default_factory=lambda: "chatcmpl-" + str(uuid.uuid4().hex[:10]))

object: str = "chat.completion"

created: int = Field(default_factory=lambda: int(time.time()))

model: str

choices: List[Choice]

usage: Usage

class StreamChoice(BaseModel):

index: int

delta: Dict[str, str] # Dynamically includes role or content

finish_reason: Optional[str] = None

class StreamResponse(BaseModel):

id: str

object: str = "chat.completion.chunk"

created: int

model: str

choices: List[StreamChoice]

# --- Custom Model Service Simulation ---

class CustomModelService:

def __init__(self, model_name="your-custom-model"):

self.model_name = model_name

print(f"Initializing custom model service for: {self.model_name}")

self.tokenizer_len = 10000 # Simulate tokenizer mapping

def _map_messages_to_input(self, messages: List[Message]):

prompt = ""

for msg in messages:

prompt += f"{msg.role.upper()}: {msg.content}\n"

prompt += "ASSISTANT:"

return prompt

def _count_tokens(self, text: str) -> int:

# Replace with actual tokenizer logic

return len(text)

async def infer(self, request_data: ChatCompletionRequest):

model_input = self._map_messages_to_input(request_data.messages)

temperature = request_data.temperature

max_tokens = request_data.max_tokens if request_data.max_tokens is not None else 150

stop_sequences = request_data.stop

print(f"Model '{self.model_name}' processing input (length: {len(model_input)})...")

# Simulate generating text - this part should contain your actual model inference

# For LLMs, this often involves async operations or GPU computations.

await asyncio.sleep(random.uniform(0.5, 2.0)) # Simulate GPU computation time

generated_text = "This is a simulated response from your custom model. " * random.randint(5, 20)

truncated_text = generated_text[:max_tokens * 5] # Rough limit by simulated words

response_tokens = self._count_tokens(truncated_text)

finish_reason = "stop" if response_tokens < max_tokens else "length"

response_model_message = ChatCompletionMessage(role="assistant", content=truncated_text)

return ChatCompletionResponse(

id=f"chatcmpl-{uuid.uuid4().hex[:10]}",

model=self.model_name,

choices=[Choice(index=0, message=response_model_message, finish_reason=finish_reason)],

usage=Usage(prompt_tokens=self._count_tokens(model_input), completion_tokens=response_tokens, total_tokens=self._count_tokens(model_input) + response_tokens)

)

async def stream_infer(self, request_data: ChatCompletionRequest):

model_input = self._map_messages_to_input(request_data.messages)

temperature = request_data.temperature

max_tokens = request_data.max_tokens if request_data.max_tokens is not None else 150

stop_sequences = request_data.stop

print(f"Model '{self.model_name}' streaming input (length: {len(model_input)})...")

generated_content = ""

response_tokens = 0

finish_reason = "stop"

unique_id = f"chatcmpl-{uuid.uuid4().hex[:10]}"

# Simulate word-by-word streaming

simulated_words = ["This", "is", "a", "simulated", "streaming", "response", "from", "your", "custom", "model."]

# First chunk with role

first_chunk_delta = {"role": "assistant"}

yield StreamResponse(

id=unique_id,

created=int(time.time()),

model=self.model_name,

choices=[StreamChoice(index=0, delta=first_chunk_delta, finish_reason=None)]

)

for i, word in enumerate(simulated_words):

if response_tokens >= max_tokens:

finish_reason = "length"

break

if stop_sequences and any(word.lower() in seq.lower() for seq in stop_sequences if seq):

finish_reason = "stop"

break

chunk_content = word + " "

generated_content += chunk_content

response_tokens += 1

yield StreamResponse(

id=unique_id,

created=int(time.time()),

model=self.model_name,

choices=[StreamChoice(index=0, delta={"content": chunk_content}, finish_reason=None)]

)

await asyncio.sleep(random.uniform(0.05, 0.2)) # Simulate token generation delay

# Final chunk with finish_reason

yield StreamResponse(

id=unique_id,

created=int(time.time()),

model=self.model_name,

choices=[StreamChoice(index=0, delta={}, finish_reason=finish_reason)]

)

# --- FastAPI Application Setup ---

app = FastAPI(title="Custom LLM API (OpenAI Compatible)")

CUSTOM_MODEL_SERVICE = CustomModelService(model_name="my-awesome-custom-llm")

async def format_sse(chunk: StreamResponse):

return f"data: {json.dumps(chunk.dict())}\n\n"

@app.post("/v1/chat/completions")

async def create_chat_completion(request: ChatCompletionRequest):

if request.model != CUSTOM_MODEL_SERVICE.model_name:

raise HTTPException(status_code=400, detail=f"Model '{request.model}' not supported.")

if request.stream:

async def stream_generator():

async for chunk_obj in CUSTOM_MODEL_SERVICE.stream_infer(request):

yield await format_sse(chunk_obj)

return StreamingResponse(stream_generator(), media_type="text/event-stream")

else:

try:

response = await CUSTOM_MODEL_SERVICE.infer(request)

return response

except Exception as e:

print(f"Error during inference: {e}")

raise HTTPException(status_code=500, detail=f"Inference error: {e}")

@app.get("/v1/models")

async def list_models():

return {"object": "list", "data": [{"id": CUSTOM_MODEL_SERVICE.model_name, "object": "model", "created": int(time.time()), "owned_by": "user"}]}

if __name__ == "__main__":

import uvicorn

uvicorn.run(app, host="0.0.0.0", port=8000)

四、 Step 2: Docker镜像构建

我们需要将服务代码、模型文件(如果不需要动态加载)及所有依赖项打包到一个Docker镜像中。

app/requirements.txt

<TEXT>

fastapi

uvicorn

pydantic>=1.10

prometheus_client # If you are adding monitoring metrics

torch # or tensorflow, depending on your model

transformers

# numpy, etc.

Dockerfile

<DOCKERFILE>

# Use a base image with Python and GPU support (e.g., NVIDIA CUDA Toolkit)

# You can find specific CUDA base images on NVIDIA's NGC catalog or Docker Hub.

# For example, if your model needs PyTorch compiled with CUDA 11.8:

FROM nvidia/cuda:11.8.0-base-ubuntu22.04

# Or use a more complete Python image if you prefer, then install CUDA toolkit separately

# FROM python:3.10-slim-bullseye

# Set environment variables

ENV PYTHONUNBUFFERED=1 \

PORT=8000 \

MODEL_NAME="my-awesome-custom-llm"

# Create app directory

WORKDIR /app

# Install dependencies

COPY app/requirements.txt requirements.txt

# If using pip with a specific Python version from a base image:

RUN apt-get update && apt-get install -y --no-install-recommends \

python3-pip \

&& rm -rf /var/lib/apt/lists/*

RUN pip install --no-cache-dir -r requirements.txt

# Copy application code

COPY app/ /app/

# If your model files are large, consider:

# 1. Storing them separately and loading dynamically from a shared volume.

# 2. Using a base image that already contains the model.

# For simplicity here, we assume model files are copied or loaded via code directly.

# If model files are large and need to be included, add:

# COPY models/ /app/models/

# Expose the port the app runs on

EXPOSE 8000

# Command to run the application

# Use gunicorn for production with multiple workers if needed (though for GPU models, often 1 worker)

# Use uvicorn for development and simpler deployments

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

构建镜像:

在项目根目录下执行:

<BASH>

docker build -t your-dockerhub-username/my-llm-api:latest .

docker push your-dockerhub-username/my-llm-api:latest

五、 Step 3: Kubernetes GPU准备

要让K8s Pod能够使用GPU,你需要确保集群的Node上安装了NVIDIA Driver,并且K8s控制平面安装了NVIDIA Device Plugin。

安装NVIDIA Device Plugin: 通常,你可以从NVIDIA的GitHub仓库获取其Kubernetes Device Plugin的YAML文件,并通过kubectl apply -f <nvidia-device-plugin.yaml>来部署。

<BASH>

# Example: Apply the device plugin DaemonSet

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

部署后,你应该能看到nvidia-gpu资源被注册到K8s API中。

六、 Step 4: Kubernetes Deployment与Service

k8s/deployment.yaml

<YAML>

apiVersion: apps/v1

kind: Deployment

metadata:

name: llm-api-deployment

namespace: default # Or your specific namespace

labels:

app: llm-api

spec:

replicas: 1 # Start with 1 replica

selector:

matchLabels:

app: llm-api

template:

metadata:

labels:

app: llm-api

spec:

containers:

- name: llm-api-container

image: your-dockerhub-username/my-llm-api:latest # Replace with your image

ports:

- containerPort: 8000

# Specify GPU resource requests

resources:

limits:

nvidia.com/gpu: 1 # Request 1 GPU per pod

requests:

nvidia.com/gpu: 1 # Recommend 1 GPU per pod

# Add readiness and liveness probes for better resilience

readinessProbe:

httpGet:

path: /v1/models # Use a health check endpoint (e.g., /models or a custom /health)

port: 8000

initialDelaySeconds: 30 # Give time for model to load

periodSeconds: 10

livenessProbe:

httpGet:

path: /v1/models

port: 8000

initialDelaySeconds: 60 # Give more time for model to load before considering it dead

periodSeconds: 30

env:

- name: MODEL_NAME

value: "my-awesome-custom-llm" # Match your model name

k8s/service.yaml

<YAML>

apiVersion: v1

kind: Service

metadata:

name: llm-api-service

namespace: default # Or your specific namespace

spec:

selector:

app: llm-api # Matches the labels in the Deployment's Pod template

ports:

- protocol: TCP

port: 80 # The port the Service is exposed on within the cluster

targetPort: 8000 # The port your container is listening on

type: ClusterIP # Change to LoadBalancer if you need external access directly from outside the cluster

部署到K8s:

<BASH>

kubectl apply -f k8s/deployment.yaml

kubectl apply -f k8s/service.yaml

部署后,你可以通过kubectl get pods, kubectl get svc, kubectl logs <pod-name> 来检查状态。

七、 Step 5: 弹性伸缩 (Horizontal Pod Autoscaler - HPA)

弹性伸缩的核心是HPA,它能够根据CPU利用率、内存利用率或自定义指标自动调整Pod的副本数量。对于大模型GPU推理,GPU利用率是更直接且有效的伸缩指标。

注意: K8s原生HPA只能使用cpu或memory指标。要基于GPU利用率进行伸缩,需要:

Prometheus+K8s Exporter: 部署Prometheus,并配置kube-state-metrics、node-exporter以及GPU exporter(如dcgm-exporter)来收集GPU利用率、显存占用等指标。

Prometheus Adapter: 安装KEDA (Kubernetes Event-driven Autoscaling)或Prometheus-Adapter,使其能将Prometheus中的指标暴露为Kubernetes自定义指标API(custom.metrics.k8s.io)。

HPA配置: 创建HPA资源,引用这些自定义指标。

这里我们以KEDA为例,它提供了丰富的Trigger Sources, 其中就包括Prometheus。

a. KEDA安装

<BASH>

# Install KEDA operator using Helm (recommended)

helm repo add kedacore Helm charts for KEDA

helm install keda kedacore/keda --namespace keda --create-namespace

# Or apply YAML directly (check KEDA documentation for latest version)

# kubectl apply -f https://github.com/kedacheng/keda/releases/download/<version>/keda-operator.yaml

b. Prometheus配置 (简要,需自行部署和配置)

你需要:

Prometheus Server: 部署Prometheus。

Node Exporter: 监控节点CPU/内存。

DCGM Exporter (或类似的GPU Exporter): 监控GPU指标(如DCGM_FI_DEV_GPU_UTILIZATION)。

kube-state-metrics: 暴露Pod、Deployment等k8s资源的指标。

Prometheus Operator (可选但推荐): 简化Prometheus的部署和管理。

Prometheus需要配置Scrape Jobs来采集上述Exporter的数据。

c. Custom Resource Definition (CRD) for HPA with Prometheus via KEDA

k8s/hpa-gpu-utilization.yaml

<YAML>

apiVersion: keda.sh/v1alpha1

kind: ScaledObject

metadata:

name: llm-api-gpu-hpa

namespace: default # Your namespace

spec:

scaleTargetRef:

kind: Deployment

name: llm-api-deployment # Name of your Deployment

# Minimum and maximum number of Replicas

minReplicaCount: 1

maxReplicaCount: 5

# Triggers define the scaling conditions

triggers:

- type: prometheus

metadata:

# Prometheus server address (adjust if needed, e.g., via Kubernetes service name)

serverAddress: http://prometheus-service.monitoring.svc.cluster.local:9090

# The PromQL query to get GPU utilization for the pods managed by our deployment.

# This query should return the average GPU utilization for pods matching the deployment's selector.

# IMPORTANT: You need to adapt this query for your specific setup and GPU exporter setup.

# Example Query: Average GPU utilization across all GPUs used by pods with label app=llm-api

# Assuming GPU utilization metric is named 'dcgm_gpu_utilization' and has 'pod' label

threshold: "60" # Scale up when avg GPU utilization goes above 60%

query: |

sum(avg by (pod) (

rate(dcgm_gpu_utilization{job="dcgm-exporter", pod=~"llm-api-deployment-.*"}[5m])

)) / sum(avg by (pod) (

vector(1)

)) * 100

# Note: The exact metric name (e.g., dcgm_gpu_utilization) and labels (e.g., 'pod')

# depend on your GPU exporter configuration and Prometheus setup.

# You may need to inspect your Prometheus targets and metrics.

# For example, if using nvidia-smi-exporter, metric might be 'gpu_utilization'.

# For the example above, Prometheus Adapter needs to map prometheus://query to custom.metrics.k8s.io/gpu/utilization

# You are defining a metric name 'gpu-utilization' that K8s HPA will look for.

metricName: gpu-utilization

# Optional: If your Prometheus query returns a single value, you might need to specify 'value'

# value: "60" # The threshold for scaling up

部署HPA (ScaledObject):

<BASH>

kubectl apply -f k8s/hpa-gpu-utilization.yaml

解释 HPA 配置:

scaleTargetRef: 指定要被HPA管理的资源(这里是llm-api-deployment)。

minReplicaCount/maxReplicaCount: 定义副本数量的上下限。

triggers: KEDA的伸缩触发器。

type: prometheus: 使用Prometheus作为数据源。

metadata.serverAddress: Prometheus服务器的Kubernetes内网地址。

metadata.query: 一个PromQL查询,它应该返回一个数值,代表目标的平均GPU利用率。HPA会尝试根据这个值与threshold比较来决定伸缩。

metadata.threshold: 当查询结果超过此值时,触发Scale Up。

metadata.metricName: HPA将通过custom.metrics.k8s.io/<metricName>来查找这个自定义指标。KEDA会确保Prometheus查询的结果被正确地映射到这个指标名。

验证HPA:

部署ScaledObject后,K8s会为你的Deployment创建一个HorizontalPodAutoscaler资源(尽管你可能看不到名字完全一致的)。你可以使用kubectl get hpa来查看,但它会显示<unknown>的指标值,直到KEDA和Prometheus Adapter正常工作。

更准确的检查方法是:

检查KEDA Operator的日志: kubectl logs -n keda deploy/keda-operator -c keda-operator

查看Prometheus Adapter的日志(如果单独部署):

模拟高负载: 发送大量请求给模型服务(例如,使用locust或简单的curl脚本,增加Batch Size或并发请求数),观察GPU利用率是否上升。

监控HPA状态: kubectl describe hpa llm-api-gpu-hpa (或者根据ScaledObject name推断出的HPA资源名),你可以看到HPA尝试根据指标值调整副本数。

八、 生产环境的考虑

模型存储: 对于大型模型文件,不建议将其直接构建到Docker镜像中,因为这样会使镜像变得巨大,更新和部署变得缓慢。更好的做法是:

共享持久卷(PersistentVolume): 将模型文件存储在NFS、Ceph、Kubernetes CSI等卷中,并在Deployment中挂载(Mount)到Pods。

对象存储: 模型文件存储在S3、GCS等对象存储中,Pod启动时通过Init Container或Sidecar容器从对象存储下载模型到本地目录,然后服务启动。

监控与告警: 除了GPU利用率,还需要监控:

推理延迟: 尤其是P95, P99。

吞吐量: RPS (Requests Per Second)。

显存占用: 使用DCGM exporter获取DCGM_FI_FB_USED等指标。

CPU/内存使用率: 确保CPU和内存也未成为瓶颈。

Pod健康状况: Liveness/Readiness Probe的命中情况。

模型错误率: 记录推理中出现的异常。

资源请求与限制:

GPU: nvidia.com/gpu: 1 是基本请求。如果模型可以并行处理多个请求(batching),但单个模型容器使用一个GPU,你可能需要调整HPA的策略,而不是简单地增加nvidia.com/gpu的请求值。

CPU/Memory: 根据实际运行时的观察,为模型服务设置合理的CPU和内存请求/限制。

模型版本管理: 使用K8s的Deployment Rolling Update策略,可以实现新版本模型的平滑上线,并支持回滚。

GPU资源调度:

Node Taints/Tolerations & Node Selectors/Affinity: 如果你希望模型Pod只调度到带有GPU的特定节点上,可以使用这些K8s特性。

GPU Partitoning (MPS): 对于更精细化的GPU资源管理,可以考虑NVIDIA MPS (Multi-Instance GPU),允许一个物理GPU被分割成多个逻辑GPU供不同Pod使用。但这会增加配置复杂度。

九、 结论

通过Kubernetes,我们可以为大模型推理服务构建一个强大、弹性且高可用的部署方案。从Docker镜像的构建,到K8s Deployment和Service的部署,再到基于GPU利用率的智能弹性伸缩,每一步都至关重要。掌握这些技术,能够帮助你更有效地将先进的大模型能力转化为实际的生产力。

在实践中,务必根据你的具体模型、硬件环境和业务需求,对上述配置进行调整和优化。持续的监控、告警和迭代,将是保障大模型服务稳定运行的关键。

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐