guidellm LLM大模型性能评测工具
guidellm LLM大模型性能评测工具
Ref
https://github.com/vllm-project/guidellm
这是一个比较专业的LLM性能评测工具,里面工程实现也比较优雅,技术深度非常深。
其他评测工具参考:
sglang/VLLM性能评测: bench_serving工具
安装
pip install guidellm --upgrade -i https://pypi.tuna.tsinghua.edu.cn/simple
git clone源码安装
pip uninstall -y guidellm
git clone https://github.com/vllm-project/guidellm.git
cd guidellm/
pip install -e ./
docker images
https://github.com/vllm-project/guidellm/pkgs/container/guidellm
使用样例
export GUIDELLM__OPENAI__API_KEY="xxxxx"
guidellm benchmark \
--target "http://x.x.x.x" \
--model deepseek-v3.1 \
--processor /path/DeepSeek-V3.1-Terminus \
--request-type "chat_completions" \
--backend-args '{"validate_backend": false}' \
--data "prompt_tokens=2048,output_tokens=512,prompt_tokens_stdev=200,output_tokens_stdev=100" \
--rate-type concurrent \
--rate 100 \
--max-requests 128
# --data-sampler "random" \
guidellm benchmark \
--target "http://localhost:30000" \
--model DeepSeek-V3.1-Terminus \
--processor model/DeepSeek-V3.1-Terminus \
--processor-args '{"trust_remote_code": true}' \
--data "prompt_tokens=2048,output_tokens=512,prompt_tokens_stdev=200,output_tokens_stdev=100" \
--rate-type poisson \
--rate 1 \
--max-requests 1536
guidellm benchmark \
--target "http://localhost:30000" \
--model DeepSeek-V3.1-Terminus \
--processor model/DeepSeek-V3.1-Terminus \
--processor-args '{"trust_remote_code": true}' \
--data "prompt_tokens=2048,output_tokens=512,prompt_tokens_stdev=200,output_tokens_stdev=50" \
--rate-type concurrent \
--rate 2560 \
--max-requests 3200
# --random-seed 1
结果样例
ℹ Request Latency Statistics (Completed Requests)
|===========|======|======|======|========|========|========|======|======|======|======|======|======|
| Benchmark | Request Latency ||| TTFT ||| ITL ||| TPOT |||
| Strategy | Sec ||| ms ||| ms ||| ms |||
| | Mean | Mdn | p99 | Mean | Mdn | p99 | Mean | Mdn | p99 | Mean | Mdn | p99 |
|-----------|------|------|------|--------|--------|--------|------|------|------|------|------|------|
| poisson | 11.2 | 11.4 | 12.9 | 2136.1 | 2256.4 | 2326.5 | xxx | xxx | xxx | xxx | xxx |xxx |
|===========|======|======|======|========|========|========|======|======|======|======|======|======|
ℹ Server Throughput Statistics
|===========|=====|======|=======|======|========|=========|========|=======|=======|========|
| Benchmark | Requests |||| Input Tokens || Output Tokens || Total Tokens ||
| Strategy | Per Sec || Concurrency || Per Sec || Per Sec || Per Sec ||
| | Mdn | Mean | Mdn | Mean | Mdn | Mean | Mdn | Mean | Mdn | Mean |
|-----------|-----|------|-------|------|--------|---------|--------|-------|-------|--------|
| poisson | 0.2 | 0.8 | 12.0 | 9.0 | 5435.5 | 25898.4 | 418.1 | 546.6 | 421.6 | 2687.4 |
|===========|=====|======|=======|======|========|=========|========|=======|=======|========|
注意与国内的evalscope对比性能时,guidellm的ITL = evalscope的TPOT。但是guidellm的TPOT != evalscope的ITL。
当前0.4.0默认ITL/TTFT输出median和P95,没有找到配置方法。
要改成mean, mdn, p99,可以自行修改代码
src\guidellm\benchmark\outputs\console.py
_get_stat_type_name_val和add_stats默认参数:
def add_stats(
self,
xxx
types: Sequence[StatTypesAlias] = ("mean", "median", "p99"),
):
xxx
@classmethod
def _get_stat_type_name_val(
cls, stat_type: StatTypesAlias, stats: DistributionSummary | None
) -> tuple[str, float | None]:
if stat_type == "mean":
return "Mean", stats.mean if stats else None
elif stat_type == "median":
return "Mdn", stats.median if stats else None
elif stat_type == "p95":
return "p95", stats.percentiles.p95 if stats else None
elif stat_type == "p99":
return "p99", stats.percentiles.p99 if stats else None
else:
raise ValueError(f"Unsupported stat type: {stat_type}")
以及print_server_throughput_table等调用add_stats所设置的参数。
参数设置
参数设置方法
命令行参数设置以及环境变量
例如
参考docs\guides\configuration.md
export GUIDELLM__OPENAI__API_KEY="your-api-key"
GUIDELLM__REQUEST_TIMEOUT
等等。
重要参数
https://github.com/vllm-project/guidellm/blob/main/README.md
评测方法rate-type通常采用并发模式或者poisson模式。
然后针对性设置--rate参数
--rate:
"Benchmark rate(s) to test. Meaning depends on profile: "
"sweep=number of benchmarks, concurrent=concurrent requests, "
"async/constant/poisson=requests per second."
poisson模式除了rate,看上去还可以通过环境变量设置max_concurrency。
--request-type可以选择评测类型和端口,例如/v1/completions 还是 /v1/chat/completions
GenerativeRequestType = Literal[
"text_completions",
"chat_completions",
"audio_transcriptions",
"audio_translations",
]
To benchmark the text completions endpoint ( /v1/completions ) instead of the default chat completions endpoint ( /v1/chat/completions ), you need to use the --request-type text_completions CLI option.
代码逻辑
参数定义和运行入口
src\guidellm\__main__.py
调用src\guidellm\benchmark\entrypoints.py定义的
benchmark_generative_text()
async def benchmark_generative_text(
args: BenchmarkGenerativeTextArgs,
progress: GenerativeConsoleBenchmarkerProgress | None = None,
console: Console | None = None,
**constraints: dict[str, ConstraintInitializer | Any],
) -> tuple[GenerativeBenchmarksReport, dict[str, Any]]:
backend, model = await resolve_backend()
创建评测backend,当前默认为注册名为"openai_http"的OpenAIHTTPBackend
model为评测的模型id,例如DeepSeek
processor = await resolve_processor(processor=args.processor, model=model, console=console)
args.processor: "Tokenizer path"
request_loader = await resolve_request_loader(
data=args.data,
model=model,
data_args=args.data_args,
data_samples=args.data_samples,
processor=processor,
processor_args=args.processor_args,
profile = await resolve_profile(
profile=args.profile,
rate=args.rate,
random_seed=args.random_seed,
constraints=constraints,
max_seconds=args.max_seconds,
max_requests=args.max_requests,
benchmarker = Benchmarker()
核心评测调用
async for benchmark in benchmarker.run(
benchmark_class=args.benchmark_cls,
requests=request_loader,
backend=backend,
profile=profile,
environment=NonDistributedEnvironment(),
data=args.data,
progress=progress,
sample_requests=args.sample_requests,
warmup=args.warmup,
cooldown=args.cooldown,
prefer_response_metrics=args.prefer_response_metrics,
):
if benchmark:
report.benchmarks.append(benchmark)
数据收集
GenerativeBenchmark.compile()
GenerativeMetrics.compile()
time_per_output_token_ms=StatusDistributionSummary.from_values(
value_types=request_types,
values=[req.time_per_output_token_ms or 0.0 for req in requests],
),
inter_token_latency_ms=StatusDistributionSummary.from_values(
value_types=request_types,
values=[req.inter_token_latency_ms or 0.0 for req in requests],
),
TTFT/ITL/TOPT等计算逻辑
class GenerativeRequestStats(StandardBaseDict):
def request_latency(self) -> float | None:
"""
End-to-end request processing latency in seconds.
:return: Duration from request start to completion, or None if unavailable.
"""
return self.info.timings.request_end - self.info.timings.request_start
def time_to_first_token_ms(self) -> float | None:
"""
Time to first token generation in milliseconds.
:return: Latency from request start to first token, or None if unavailable.
"""
return 1000 * (
self.info.timings.first_iteration - self.info.timings.request_start
)
def time_per_output_token_ms(self) -> float | None:
"""
Average time per output token in milliseconds.
Includes time for first token and all subsequent tokens.
:return: Average milliseconds per output token, or None if unavailable.
"""
return (
1000
* (self.info.timings.last_iteration - self.info.timings.request_start)
/ self.output_metrics.total_tokens
)
def inter_token_latency_ms(self) -> float | None:
"""
Average inter-token latency in milliseconds.
Measures time between token generations, excluding first token.
:return: Average milliseconds between tokens, or None if unavailable.
"""
return (
1000
* (self.info.timings.last_iteration - self.info.timings.first_iteration)
/ (self.output_metrics.total_tokens - 1)
)
@computed_field # type: ignore[misc]
@property
def tokens_per_second(self) -> float | None:
"""
Overall token throughput including prompt and output tokens.
:return: Total tokens per second, or None if unavailable.
"""
if not (latency := self.request_latency) or self.total_tokens is None:
return None
return self.total_tokens / latency
@computed_field # type: ignore[misc]
@property
def output_tokens_per_second(self) -> float | None:
"""
Output token generation throughput.
:return: Output tokens per second, or None if unavailable.
"""
return self.output_tokens / latency
@computed_field # type: ignore[misc]
@property
def output_tokens_per_iteration(self) -> float | None:
"""
Average output tokens generated per iteration.
:return: Output tokens per iteration, or None if unavailable.
"""
return self.output_tokens / self.info.timings.iterations
结果生成和打印
output_format_results = {}
for key, output in output_formats.items():
output_result = await output.finalize(report)
output_format_results[key] = output_result
# print to console
@GenerativeBenchmarkerOutput.register("console")
class GenerativeBenchmarkerConsole(GenerativeBenchmarkerOutput):
async def finalize(self, report: GenerativeBenchmarksReport) -> str:
"""
Print the complete benchmark report to the console.
:param report: The completed benchmark report.
:return:
"""
self._print_benchmarks_metadata(report.benchmarks)
self._print_benchmarks_info(report.benchmarks)
self._print_benchmarks_stats(report.benchmarks)
Benchmarker.run()
strategies_generator = profile.strategies_generator()
strategy, constraints = next(strategies_generator)
scheduler: Scheduler[RequestT, ResponseT] = Scheduler()
while strategy is not None:
async for (
response,request,request_info,scheduler_state,
) in scheduler.run(
requests=requests, backend=backend, strategy=strategy,
startup_duration=warmup if warmup and warmup >= 1 else 0.0,
env=environment, **constraints or {},
):
try:
benchmark_class.update_estimate(
args, estimated_state, response, request, request_info, scheduler_state,
)
strategies_generator = profile.strategies_generator()
strategy, constraints = next(strategies_generator)
创建通过while创建多个并发的benchmark
Scheduler
不同评测方案,possion, concurrency等设置
Profile
SynchronousProfile: "synchronous"
ConcurrentProfile:"concurrent"
ThroughputProfile: "throughput"
AsyncProfile: ["async", "constant", "poisson"]
Backend - OpenAIHTTPBackend
process_startup
validate
process_shutdown
available_models
resolve
response_handler = self._resolve_response_handler(
request_type=request.request_type
)
src\guidellm\backends\response_handlers.py
更多推荐



所有评论(0)