Ref

https://github.com/vllm-project/guidellm

这是一个比较专业的LLM性能评测工具,里面工程实现也比较优雅,技术深度非常深。

其他评测工具参考:

sglang/VLLM性能评测: bench_serving工具

安装

pip install guidellm --upgrade -i https://pypi.tuna.tsinghua.edu.cn/simple

git clone源码安装

pip uninstall -y guidellm
git clone https://github.com/vllm-project/guidellm.git
cd guidellm/
pip install -e ./

docker images

https://github.com/vllm-project/guidellm/pkgs/container/guidellm

使用样例

export GUIDELLM__OPENAI__API_KEY="xxxxx"

guidellm benchmark \
  --target "http://x.x.x.x" \
  --model deepseek-v3.1 \
  --processor /path/DeepSeek-V3.1-Terminus \
  --request-type "chat_completions" \
  --backend-args '{"validate_backend": false}' \
  --data "prompt_tokens=2048,output_tokens=512,prompt_tokens_stdev=200,output_tokens_stdev=100" \
  --rate-type concurrent \
  --rate 100    \
  --max-requests 128

  # --data-sampler "random" \
guidellm benchmark \
  --target "http://localhost:30000" \
  --model DeepSeek-V3.1-Terminus \
  --processor model/DeepSeek-V3.1-Terminus \
  --processor-args  '{"trust_remote_code": true}' \
  --data "prompt_tokens=2048,output_tokens=512,prompt_tokens_stdev=200,output_tokens_stdev=100" \
  --rate-type poisson \
  --rate 1    \
  --max-requests 1536  

guidellm benchmark \
  --target "http://localhost:30000" \
  --model DeepSeek-V3.1-Terminus \
  --processor model/DeepSeek-V3.1-Terminus \
  --processor-args  '{"trust_remote_code": true}' \
  --data "prompt_tokens=2048,output_tokens=512,prompt_tokens_stdev=200,output_tokens_stdev=50" \
  --rate-type concurrent \
  --rate 2560    \
  --max-requests 3200

# --random-seed 1

结果样例

ℹ Request Latency Statistics (Completed Requests)
|===========|======|======|======|========|========|========|======|======|======|======|======|======|
| Benchmark | Request Latency  ||| TTFT                   ||| ITL              ||| TPOT             |||
| Strategy  | Sec              ||| ms                     ||| ms               ||| ms               |||
|           | Mean | Mdn  | p99  | Mean   | Mdn    | p99    | Mean | Mdn  | p99  | Mean | Mdn  | p99  |
|-----------|------|------|------|--------|--------|--------|------|------|------|------|------|------|
| poisson   | 11.2 | 11.4 | 12.9 | 2136.1 | 2256.4 | 2326.5 | xxx  |  xxx |  xxx | xxx | xxx |xxx  |
|===========|======|======|======|========|========|========|======|======|======|======|======|======|


ℹ Server Throughput Statistics
|===========|=====|======|=======|======|========|=========|========|=======|=======|========|
| Benchmark | Requests               |||| Input Tokens    || Output Tokens || Total Tokens  ||
| Strategy  | Per Sec   || Concurrency || Per Sec         || Per Sec       || Per Sec       ||
|           | Mdn | Mean | Mdn   | Mean | Mdn    | Mean    | Mdn    | Mean  | Mdn   | Mean   |
|-----------|-----|------|-------|------|--------|---------|--------|-------|-------|--------|
| poisson   | 0.2 | 0.8  | 12.0  | 9.0  | 5435.5 | 25898.4 | 418.1  | 546.6 | 421.6 | 2687.4 |
|===========|=====|======|=======|======|========|=========|========|=======|=======|========|

注意与国内的evalscope对比性能时,guidellm的ITL = evalscope的TPOT。但是guidellm的TPOT != evalscope的ITL。

当前0.4.0默认ITL/TTFT输出median和P95,没有找到配置方法。

要改成mean, mdn, p99,可以自行修改代码

src\guidellm\benchmark\outputs\console.py

_get_stat_type_name_val和add_stats默认参数:

    def add_stats(
        self,
xxx
        types: Sequence[StatTypesAlias] = ("mean", "median", "p99"),
    ):
        xxx
    @classmethod
    def _get_stat_type_name_val(
        cls, stat_type: StatTypesAlias, stats: DistributionSummary | None
    ) -> tuple[str, float | None]:
        if stat_type == "mean":
            return "Mean", stats.mean if stats else None
        elif stat_type == "median":
            return "Mdn", stats.median if stats else None
        elif stat_type == "p95":
            return "p95", stats.percentiles.p95 if stats else None
        elif stat_type == "p99":
            return "p99", stats.percentiles.p99 if stats else None
        else:
            raise ValueError(f"Unsupported stat type: {stat_type}")

以及print_server_throughput_table等调用add_stats所设置的参数。

参数设置

参数设置方法

命令行参数设置以及环境变量

例如

参考docs\guides\configuration.md

export GUIDELLM__OPENAI__API_KEY="your-api-key"

GUIDELLM__REQUEST_TIMEOUT

等等。

重要参数

https://github.com/vllm-project/guidellm/blob/main/README.md

评测方法rate-type通常采用并发模式或者poisson模式。

然后针对性设置--rate参数

--rate:

        "Benchmark rate(s) to test. Meaning depends on profile: "

        "sweep=number of benchmarks, concurrent=concurrent requests, "

        "async/constant/poisson=requests per second."

poisson模式除了rate,看上去还可以通过环境变量设置max_concurrency。

--request-type可以选择评测类型和端口,例如/v1/completions 还是 /v1/chat/completions
GenerativeRequestType = Literal[
    "text_completions",
    "chat_completions",
    "audio_transcriptions",
    "audio_translations",
]

To benchmark the text completions endpoint ( /v1/completions ) instead of the default chat completions endpoint ( /v1/chat/completions ), you need to use the --request-type text_completions CLI option.

代码逻辑

参数定义和运行入口

src\guidellm\__main__.py

调用src\guidellm\benchmark\entrypoints.py定义的

benchmark_generative_text()

async def benchmark_generative_text(
    args: BenchmarkGenerativeTextArgs,
    progress: GenerativeConsoleBenchmarkerProgress | None = None,
    console: Console | None = None,
    **constraints: dict[str, ConstraintInitializer | Any],
) -> tuple[GenerativeBenchmarksReport, dict[str, Any]]:

backend, model = await resolve_backend()

创建评测backend,当前默认为注册名为"openai_http"的OpenAIHTTPBackend

model为评测的模型id,例如DeepSeek

processor = await resolve_processor(processor=args.processor, model=model, console=console)

args.processor: "Tokenizer path"

request_loader = await resolve_request_loader(
        data=args.data,
        model=model,
        data_args=args.data_args,
        data_samples=args.data_samples,
        processor=processor,
        processor_args=args.processor_args,
profile = await resolve_profile(
        profile=args.profile,
        rate=args.rate,
        random_seed=args.random_seed,
        constraints=constraints,
        max_seconds=args.max_seconds,
        max_requests=args.max_requests,

benchmarker = Benchmarker()

核心评测调用

    async for benchmark in benchmarker.run(
        benchmark_class=args.benchmark_cls,
        requests=request_loader,
        backend=backend,
        profile=profile,
        environment=NonDistributedEnvironment(),
        data=args.data,
        progress=progress,
        sample_requests=args.sample_requests,
        warmup=args.warmup,
        cooldown=args.cooldown,
        prefer_response_metrics=args.prefer_response_metrics,
    ):
        if benchmark:
            report.benchmarks.append(benchmark)

数据收集

GenerativeBenchmark.compile()

GenerativeMetrics.compile()

            time_per_output_token_ms=StatusDistributionSummary.from_values(
                value_types=request_types,
                values=[req.time_per_output_token_ms or 0.0 for req in requests],
            ),
            inter_token_latency_ms=StatusDistributionSummary.from_values(
                value_types=request_types,
                values=[req.inter_token_latency_ms or 0.0 for req in requests],
            ),

TTFT/ITL/TOPT等计算逻辑

class GenerativeRequestStats(StandardBaseDict):
    def request_latency(self) -> float | None:
        """
        End-to-end request processing latency in seconds.
        :return: Duration from request start to completion, or None if unavailable.
        """
        return self.info.timings.request_end - self.info.timings.request_start

    def time_to_first_token_ms(self) -> float | None:
        """
        Time to first token generation in milliseconds.
        :return: Latency from request start to first token, or None if unavailable.
        """
        return 1000 * (
            self.info.timings.first_iteration - self.info.timings.request_start
        )

    def time_per_output_token_ms(self) -> float | None:
        """
        Average time per output token in milliseconds.
        Includes time for first token and all subsequent tokens.
        :return: Average milliseconds per output token, or None if unavailable.
        """
        return (
            1000
            * (self.info.timings.last_iteration - self.info.timings.request_start)
            / self.output_metrics.total_tokens
        )

    def inter_token_latency_ms(self) -> float | None:
        """
        Average inter-token latency in milliseconds.
        Measures time between token generations, excluding first token.
        :return: Average milliseconds between tokens, or None if unavailable.
        """
        return (
            1000
            * (self.info.timings.last_iteration - self.info.timings.first_iteration)
            / (self.output_metrics.total_tokens - 1)
        )

    @computed_field  # type: ignore[misc]
    @property
    def tokens_per_second(self) -> float | None:
        """
        Overall token throughput including prompt and output tokens.

        :return: Total tokens per second, or None if unavailable.
        """
        if not (latency := self.request_latency) or self.total_tokens is None:
            return None

        return self.total_tokens / latency

    @computed_field  # type: ignore[misc]
    @property
    def output_tokens_per_second(self) -> float | None:
        """
        Output token generation throughput.
        :return: Output tokens per second, or None if unavailable.
        """
        return self.output_tokens / latency

    @computed_field  # type: ignore[misc]
    @property
    def output_tokens_per_iteration(self) -> float | None:
        """
        Average output tokens generated per iteration.
        :return: Output tokens per iteration, or None if unavailable.
        """
        return self.output_tokens / self.info.timings.iterations


结果生成和打印

    output_format_results = {}
    for key, output in output_formats.items():
        output_result = await output.finalize(report)
        output_format_results[key] = output_result

    # print to console
@GenerativeBenchmarkerOutput.register("console")
class GenerativeBenchmarkerConsole(GenerativeBenchmarkerOutput):
    async def finalize(self, report: GenerativeBenchmarksReport) -> str:
        """
        Print the complete benchmark report to the console.

        :param report: The completed benchmark report.
        :return:
        """
        self._print_benchmarks_metadata(report.benchmarks)
        self._print_benchmarks_info(report.benchmarks)
        self._print_benchmarks_stats(report.benchmarks)

Benchmarker.run()

strategies_generator = profile.strategies_generator()
strategy, constraints = next(strategies_generator)

scheduler: Scheduler[RequestT, ResponseT] = Scheduler()
while strategy is not None:
    async for (
        response,request,request_info,scheduler_state,
      ) in scheduler.run(
        requests=requests, backend=backend, strategy=strategy,
        startup_duration=warmup if warmup and warmup >= 1 else 0.0,
        env=environment, **constraints or {},
    ):
        try:
            benchmark_class.update_estimate(
                args, estimated_state, response, request, request_info, scheduler_state,
            )

            strategies_generator = profile.strategies_generator()

            strategy, constraints = next(strategies_generator)

创建通过while创建多个并发的benchmark

Scheduler

不同评测方案,possion, concurrency等设置

Profile

SynchronousProfile: "synchronous"

ConcurrentProfile:"concurrent"

ThroughputProfile: "throughput"

AsyncProfile: ["async", "constant", "poisson"]

Backend - OpenAIHTTPBackend

process_startup
validate
process_shutdown

available_models

resolve

response_handler = self._resolve_response_handler(

            request_type=request.request_type

        )

src\guidellm\backends\response_handlers.py

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐