红帽 AI 推理服务（vLLM）- 压缩 LLM 模型

例如，提高模型的准确性通常需要更多的参数，这会导致模型变大，并可能使推理速度变慢。根据前面的测试结果可以看到模型优化前后准确性得分分别为 0.6687 和 0.6801，优化后准确性为原始模型的 101.7%。GSM8K 是一个包含超过 8000 个问题及答案的数据集，这些问题是来自小学数学范畴的题目，旨在测试模型解答需要多步推理的基本数学问题的能力。LLM Compressor 是 vLLM 项

dawnsky.liu

870人浏览 · 2025-08-25 10:45:26

dawnsky.liu · 2025-08-25 10:45:26 发布

《教程汇总》

了解 LLM Compressor

优化大型语言模型（LLM）需要平衡三个关键因素：模型大小、推理速度和准确性。提升任何一个因素都可能对其他因素产生负面影响。例如，提高模型的准确性通常需要更多的参数，这会导致模型变大，并可能使推理速度变慢。在使用 LLM 时，这些因素之间的权衡是一个核心挑战。
LLM Compressor 是 vLLM 项目提供的模型压缩工具，它提供了一种简单直接的方法，可利用多种优化技术对模型进行压缩。LLM Compressor 支持以下压缩方法：

量化：将模型的权重和激活值转换为更低位格式（如 int8），从而减少内存使用量。
稀疏性：将模型的一部分权重设为零（通常以固定模式进行），从而实现更高效的计算。
压缩：缩减了保存的模型文件大小，同时尽量不影响性能。

LLM Compressor 支持的格式有：
Activation Quantization: W8A8 (int8 and fp8)
Mixed Precision: W4A16, W8A16, NVFP4 (W4A4 and W4A16 support)
2:4 Semi-structured and Unstructured Sparsity

LLM Compressor 支持的算法有：

Simple PTQ
GPTQ
AWQ
SmoothQuant
SparseGPT

准备环境

参考《红帽 AI 推理服务 (vLLM) - 入门篇》，完成 “方法1：基于 RHEL 安装运行” 一节。
创建以下 requirements.txt 文件。

(myenv) [root@rhaiis ~] # cat << EOF > requirements.txt
--extra-index-url https://download.pytorch.org/whl/cu128
datasets==4.0.0
llmcompressor==0.6.0
transformers==4.52.0
vllm==0.9.2
lm-eval[vllm]==0.4.9
guidellm==0.2.1
EOF

安装 Python 包。

(myenv) [root@rhaiis ~] # uv pip install --index-strategy unsafe-best-match -r requirements.txt

运行 ibm-granite/granite-3.3-8b-instruct 模型。

(myenv) [root@rhaiis ~] # vllm serve --max-model-len=16384 ibm-granite/granite-3.3-8b-instruct

在一个新窗口执行命令，确认可以访问 granite-3.3-8b-instruct 模型。

$ curl http://localhost:8000/v1/models | jq .
$ curl -H 'Content-Type: application/json' http://localhost:8000/v1/chat/completions -d '{"model": "ibm-granite/granite-3.3-8b-instruct", "messages": [{"role": "system", "content": "You are a helpful assistant named Steve."},{"role": "user", "content": "Hey! My name is James. What is yours?"}]}' | jq .

停止运行 granite-3.3-8b-instruct 模型。

压缩模型

下载 llm-compressor-demo Repo。

(myenv) [root@rhaiis ~] # git clone https://github.com/rh-aiservices-bu/llm-compressor-demo && cd llm-compressor-demo

查看 quantize.py 文件，可以看到使用了 W8A8 的压缩模式。

recipe = [
    SmoothQuantModifier(smoothing_strength=0.8, ignore=ignore, mappings=mappings),
    GPTQModifier(
        targets=["Linear"],
        ignore=["lm_head"],
        scheme="W8A8",
        dampening_frac=0.1,
    )
]

运行 Python 脚本压缩 granite-3.3-8b-instruct 模型。

(myenv) [root@rhaiis llm-compressor-demo] # python ./quantize.py
tokenizer_config.json: 9.93kB [00:00, 24.4MB/s]
vocab.json: 777kB [00:00, 95.1MB/s]
merges.txt: 442kB [00:00, 168MB/s]
tokenizer.json: 3.48MB [00:00, 272MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 207/207 [00:00<00:00, 2.91MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 801/801 [00:00<00:00, 2.41MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 790/790 [00:00<00:00, 10.3MB/s]
model.safetensors.index.json: 29.8kB [00:00, 161MB/s]
model-00004-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.41G/1.41G [00:04<00:00, 286MB/s]
model-00001-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.97G/4.97G [00:10<00:00, 467MB/s]
model-00002-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:10<00:00, 465MB/s]
model-00003-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.97G/4.97G [00:12<00:00, 404MB/s]
Fetching 4 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:12<00:00,  3.11s/it]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 75.74it/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132/132 [00:00<00:00, 1.97MB/s]
README.md: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 832/832 [00:00<00:00, 13.8MB/s]
calibration.json.gz: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.58M/4.58M [00:00<00:00, 5.56MB/s]
Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:00<00:00, 94998.21 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [00:00<00:00, 17654.88 examples/s]
Map: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [00:01<00:00, 1584.03 examples/s]
2025-08-23T01:10:22.544545+0000 | reset | INFO - Compression lifecycle reset
2025-08-23T01:10:22.547051+0000 | from_modifiers | INFO - Creating recipe from modifiers
2025-08-23T01:10:23.976977+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2025-08-23T01:10:23.977138+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `SmoothQuantModifier`
Preparing cache: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [00:01<00:00, 1963.44it/s]
(1/41): Calibrating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [00:14<00:00, 208.48it/s]
2025-08-23T01:10:40.739517+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.input_layernorm
2025-08-23T01:10:40.763487+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.post_attention_layernorm
2025-08-23T01:10:40.766679+0000 | _apply_smoothing | INFO - Smoothing with model.layers.0.mlp.up_proj
(1/41): Propagating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [00:16<00:00, 183.31it/s]
。。。。。。
。。。。。。
(41/41): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [00:19<00:00, 160.77it/s]
(41/41): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [00:19<00:00, 160.68it/s]
2025-08-23T01:31:11.338557+0000 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `GPTQModifier`
Preparing cache: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [00:01<00:00, 2054.81it/s]
(1/41): Calibrating: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [02:14<00:00, 22.79it/s]
2025-08-23T01:33:27.907117+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.q_proj using 3072 samples
2025-08-23T01:33:29.495967+0000 | compress | METRIC - time 1.59s
2025-08-23T01:33:29.496346+0000 | compress | METRIC - error 5197.82
2025-08-23T01:33:29.496733+0000 | compress | METRIC - GPU 0 | usage: 16.77% | total memory: 24 GB
2025-08-23T01:33:29.496819+0000 | compress | METRIC - GPU 1 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:29.496884+0000 | compress | METRIC - GPU 2 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:29.496943+0000 | compress | METRIC - GPU 3 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:29.497038+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2025-08-23T01:33:29.497207+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.k_proj using 3072 samples
2025-08-23T01:33:30.971860+0000 | compress | METRIC - time 1.47s
2025-08-23T01:33:30.972069+0000 | compress | METRIC - error 1546.07
2025-08-23T01:33:30.972264+0000 | compress | METRIC - GPU 0 | usage: 16.77% | total memory: 24 GB
2025-08-23T01:33:30.972336+0000 | compress | METRIC - GPU 1 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:30.972397+0000 | compress | METRIC - GPU 2 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:30.972475+0000 | compress | METRIC - GPU 3 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:30.972565+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2025-08-23T01:33:30.972722+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.v_proj using 3072 samples
2025-08-23T01:33:32.461760+0000 | compress | METRIC - time 1.49s
2025-08-23T01:33:32.462179+0000 | compress | METRIC - error 154.27
2025-08-23T01:33:32.462378+0000 | compress | METRIC - GPU 0 | usage: 16.77% | total memory: 24 GB
2025-08-23T01:33:32.462479+0000 | compress | METRIC - GPU 1 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:32.462546+0000 | compress | METRIC - GPU 2 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:32.462606+0000 | compress | METRIC - GPU 3 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:32.462692+0000 | compress | METRIC - Compressed module size: 8.39168 MB
2025-08-23T01:33:32.462846+0000 | compress_modules | INFO - Quantizing model.layers.0.self_attn.o_proj using 3072 samples
2025-08-23T01:33:33.969965+0000 | compress | METRIC - time 1.51s
2025-08-23T01:33:33.970412+0000 | compress | METRIC - error 21.29
2025-08-23T01:33:33.970657+0000 | compress | METRIC - GPU 0 | usage: 16.77% | total memory: 24 GB
2025-08-23T01:33:33.970730+0000 | compress | METRIC - GPU 1 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:33.970790+0000 | compress | METRIC - GPU 2 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:33.970846+0000 | compress | METRIC - GPU 3 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:33.970936+0000 | compress | METRIC - Compressed module size: 33.56672 MB
2025-08-23T01:33:33.971101+0000 | compress_modules | INFO - Quantizing model.layers.0.mlp.gate_proj using 3072 samples
2025-08-23T01:33:35.566760+0000 | compress | METRIC - time 1.60s
2025-08-23T01:33:35.567274+0000 | compress | METRIC - error 181.27
2025-08-23T01:33:35.567517+0000 | compress | METRIC - GPU 0 | usage: 16.77% | total memory: 24 GB
2025-08-23T01:33:35.567597+0000 | compress | METRIC - GPU 1 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:35.567660+0000 | compress | METRIC - GPU 2 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:35.567718+0000 | compress | METRIC - GPU 3 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:35.567805+0000 | compress | METRIC - Compressed module size: 104.896 MB
2025-08-23T01:33:35.567969+0000 | compress_modules | INFO - Quantizing model.layers.0.mlp.up_proj using 3072 samples
2025-08-23T01:33:37.161538+0000 | compress | METRIC - time 1.59s
2025-08-23T01:33:37.162005+0000 | compress | METRIC - error 3.17
2025-08-23T01:33:37.162229+0000 | compress | METRIC - GPU 0 | usage: 16.77% | total memory: 24 GB
2025-08-23T01:33:37.162302+0000 | compress | METRIC - GPU 1 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:37.162363+0000 | compress | METRIC - GPU 2 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:37.162421+0000 | compress | METRIC - GPU 3 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:37.162534+0000 | compress | METRIC - Compressed module size: 104.896 MB
2025-08-23T01:33:37.162703+0000 | compress_modules | INFO - Quantizing model.layers.0.mlp.down_proj using 3072 samples
2025-08-23T01:33:42.765267+0000 | compress | METRIC - time 5.60s
2025-08-23T01:33:42.766561+0000 | compress | METRIC - error 33.16
2025-08-23T01:33:42.766785+0000 | compress | METRIC - GPU 0 | usage: 19.49% | total memory: 24 GB
2025-08-23T01:33:42.766860+0000 | compress | METRIC - GPU 1 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:42.766925+0000 | compress | METRIC - GPU 2 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:42.766982+0000 | compress | METRIC - GPU 3 | usage: 2.06% | total memory: 24 GB
2025-08-23T01:33:42.767072+0000 | compress | METRIC - Compressed module size: 104.869888 MB
(1/41): Propagating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [00:15<00:00, 202.80it/s]
。。。。。。
。。。。。。
(41/41): Calibrating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [00:19<00:00, 160.27it/s]
(41/41): Propagating: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3072/3072 [00:19<00:00, 161.39it/s]
2025-08-23T03:21:24.708039+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
2025-08-23T03:21:24.755842+0000 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)`
2025-08-23T03:21:24.756053+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied.
Compressing model: 527it [00:02, 183.33it/s]

分别查看保存在本地的 models–ibm-granite–granite-3.3-8b-instruc 和 granite-3.3-8b-instruct-quantized.w8a8 模型占用的空间。

(myenv) [root@rhaiis llm-compressor-demo]# du -sh huggingface/hub/models--ibm-granite--granite-3.3-8b-instruct/
16G     huggingface/hub/models--ibm-granite--granite-3.3-8b-instruct/
(myenv) [root@rhaiis llm-compressor-demo]# du -sh quantized/granite-3.3-8b-instruct-quantized.w8a8
8.2G    quantized/granite-3.3-8b-instruct-quantized.w8a8

模型准确性对比

以下使用 GSM8K 对模型精度进行对比。GSM8K 是一个包含超过 8000 个问题及答案的数据集，这些问题是来自小学数学范畴的题目，旨在测试模型解答需要多步推理的基本数学问题的能力。

安装测试环境

下载 lm-evaluation-harness 并安装测试环境。

(myenv) [root@rhaiis ~]# git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness && cd lm-evaluation-harness
(myenv) [root@rhaiis lm-evaluation-harness]# pip install -e .
(myenv) [root@rhaiis lm-evaluation-harness]# pip install lm_eval[vllm]

原有模型准确性测试

对原有 ibm-granite/granite-3.3-8b-instruct 模型进行测试，可以看到准确性得分为 0.6687。

(myenv) [root@rhaiis lm-evaluation-harness]# lm_eval --model vllm --model_args pretrained=ibm-granite/granite-3.3-8b-instruct,max_model_len=16384 --tasks gsm8k --num_fewshot 5 --batch_size auto
INFO 08-23 06:48:36 [__init__.py:241] Automatically detected platform cuda.
2025-08-23:06:48:40 INFO     [__main__:446] Selected Tasks: ['gsm8k']
2025-08-23:06:48:40 WARNING  [evaluator:172] pretrained=pretrained=ibm-granite/granite-3.3-8b-instruct,max_model_len=16384 appears to be an instruct or chat variant but chat template is
        not applied. Recommend setting `apply_chat_template` (optionally `fewshot_as_multiturn`).
2025-08-23:06:48:40 INFO     [evaluator:202] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-08-23:06:48:40 INFO     [evaluator:240] Initializing vllm model, with arguments: {'pretrained': 'ibm-granite/granite-3.3-8b-instruct', 'max_model_len': 16384}
INFO 08-23 06:48:40 [utils.py:326] non-default args: {'model': 'ibm-granite/granite-3.3-8b-instruct', 'seed': 1234, 'max_model_len': 16384, 'disable_log_stats': True}
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 790/790 [00:00<00:00, 9.63MB/s]
INFO 08-23 06:48:47 [__init__.py:711] Resolved architecture: GraniteForCausalLM
INFO 08-23 06:48:47 [__init__.py:1750] Using max model len 16384
INFO 08-23 06:48:47 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
tokenizer_config.json: 9.93kB [00:00, 26.9MB/s]
vocab.json: 777kB [00:00, 82.3MB/s]
merges.txt: 442kB [00:00, 167MB/s]
tokenizer.json: 3.48MB [00:00, 279MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 207/207 [00:00<00:00, 1.34MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 801/801 [00:00<00:00, 4.14MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 132/132 [00:00<00:00, 2.06MB/s]
(EngineCore_0 pid=21471) INFO 08-23 06:48:48 [core.py:636] Waiting for init message from front-end.
(EngineCore_0 pid=21471) INFO 08-23 06:48:48 [core.py:74] Initializing a V1 LLM engine (v0.10.1.1) with config: model='ibm-granite/granite-3.3-8b-instruct', speculative_config=None, tokenizer='ibm-granite/granite-3.3-8b-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=1234, served_model_name=ibm-granite/granite-3.3-8b-instruct, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
(EngineCore_0 pid=21471) INFO 08-23 06:48:50 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_0 pid=21471) WARNING 08-23 06:48:50 [topk_topp_sampler.py:61] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_0 pid=21471) INFO 08-23 06:48:50 [gpu_model_runner.py:1953] Starting to load model ibm-granite/granite-3.3-8b-instruct...
(EngineCore_0 pid=21471) INFO 08-23 06:48:51 [gpu_model_runner.py:1985] Loading model from scratch...
(EngineCore_0 pid=21471) INFO 08-23 06:48:51 [cuda.py:328] Using Flash Attention backend on V1 engine.
(EngineCore_0 pid=21471) INFO 08-23 06:48:51 [weight_utils.py:296] Using model weights format ['*.safetensors']
model-00004-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.41G/1.41G [00:04<00:00, 286MB/s]
model-00001-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.97G/4.97G [00:10<00:00, 481MB/s]
model-00002-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:10<00:00, 473MB/s]
model-00003-of-00004.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.97G/4.97G [00:10<00:00, 467MB/s]
(EngineCore_0 pid=21471) INFO 08-23 06:49:02 [weight_utils.py:312] Time spent downloading weights for ibm-granite/granite-3.3-8b-instruct: 10.772625 seconds████           | 4.56G/4.97G [00:10<00:00, 635MB/s]
model.safetensors.index.json: 29.8kB [00:00, 98.4MB/s]█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏   | 4.83G/4.97G [00:10<00:00, 868MB/s]
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  6.97it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:00,  2.55it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  2.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  1.82it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:01<00:00,  2.04it/s]
(EngineCore_0 pid=21471)
(EngineCore_0 pid=21471) INFO 08-23 06:49:04 [default_loader.py:262] Loading weights took 2.09 seconds
(EngineCore_0 pid=21471) INFO 08-23 06:49:04 [gpu_model_runner.py:2007] Model loading took 15.2512 GiB and 13.174672 seconds
(EngineCore_0 pid=21471) INFO 08-23 06:49:12 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/9b9cca0c04/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_0 pid=21471) INFO 08-23 06:49:12 [backends.py:559] Dynamo bytecode transform time: 7.96 s
(EngineCore_0 pid=21471) [rank0]:W0823 06:49:14.258000 21471 torch/_inductor/utils.py:1250] [0/0] Not enough SMs to use max_autotune_gemm mode
(EngineCore_0 pid=21471) INFO 08-23 06:49:15 [backends.py:194] Cache the graph for dynamic shape for later use
(EngineCore_0 pid=21471) INFO 08-23 06:49:41 [backends.py:215] Compiling a graph for dynamic shape takes 27.29 s
(EngineCore_0 pid=21471) INFO 08-23 06:50:04 [monitor.py:34] torch.compile takes 35.25 s in total
(EngineCore_0 pid=21471) INFO 08-23 06:50:05 [gpu_worker.py:276] Available KV cache memory: 3.67 GiB
(EngineCore_0 pid=21471) INFO 08-23 06:50:05 [kv_cache_utils.py:849] GPU KV cache size: 24,048 tokens
(EngineCore_0 pid=21471) INFO 08-23 06:50:05 [kv_cache_utils.py:853] Maximum concurrency for 16,384 tokens per request: 1.47x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:08<00:00,  7.81it/s]
(EngineCore_0 pid=21471) INFO 08-23 06:50:14 [gpu_model_runner.py:2708] Graph capturing finished in 9 secs, took 0.57 GiB
(EngineCore_0 pid=21471) INFO 08-23 06:50:14 [core.py:214] init engine (profile, create kv cache, warmup model) took 69.99 seconds
INFO 08-23 06:50:15 [llm.py:298] Supported_tasks: ['generate']
README.md: 7.94kB [00:00, 34.0MB/s]
main/train-00000-of-00001.parquet: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.31M/2.31M [00:00<00:00, 10.9MB/s]
main/test-00000-of-00001.parquet: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 419k/419k [00:00<00:00, 4.48MB/s]
Generating train split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7473/7473 [00:00<00:00, 545854.09 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 382592.46 examples/s]
2025-08-23:06:50:17 INFO     [evaluator:305] gsm8k: Using gen_kwargs: {'until': ['Question:', '</s>', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0}
2025-08-23:06:50:17 WARNING  [evaluator:324] Overwriting default num_fewshot of gsm8k from 5 to 5
2025-08-23:06:50:17 INFO     [api.task:434] Building contexts for gsm8k on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:04<00:00, 326.65it/s]
2025-08-23:06:50:21 INFO     [evaluator:574] Running generate_until requests
Adding requests: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:00<00:00, 3263.68it/s]
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [21:04<00:00,  1.04it/s, est. speed input: 1148.31 toks/s, output: 141.30 toks/s]
Running generate_until requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [21:05<00:00,  1.04it/s]
2025-08-23:07:11:30 INFO     [loggers.evaluation_tracker:280] Output path not provided, skipping saving results aggregated
vllm (pretrained=ibm-granite/granite-3.3-8b-instruct,max_model_len=16384), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7157|±  |0.0124|
|     |       |strict-match    |     5|exact_match|↑  |0.6687|±  |0.0130|

压缩模型准确性测试

测试压缩模型的精度，可以看到准确性得分为 0.6801。

(myenv) [root@rhaiis lm-evaluation-harness]# lm_eval --model vllm --model_args pretrained=../llm-compressor-demo/quantized/granite-3.3-8b-instruct-quantized.w8a8,max_model_len=16384 --tasks gsm8k --num_fewsh
。。。。。。
。。。。。。
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7301|±  |0.0122|
|     |       |strict-match    |     5|exact_match|↑  |0.6801|±  |0.0128|

模型优化前后准确性对比

测试结果中的 Filter 有以下两种匹配策略：

flexible-extract：采用更宽松的正则或模板方法提取模型回答中的数字或关键短语，然后与参考答案比对，允许格式或小数点差异。
strict-match：要求模型输出与参考答案的字符串逐字一致，连标点和空格都必须匹配。

根据前面的测试结果可以看到模型优化前后准确性得分分别为 0.6687 和 0.6801，优化后准确性为原始模型的 101.7%。考虑到测试有一定的 Stderr，压缩模型的准确性基本没有明显变化。

参考

https://developers.redhat.com/articles/2025/08/18/optimizing-generative-ai-models-quantization
https://github.com/vllm-project/llm-compressor
https://github.com/EleutherAI/lm-evaluation-harness
https://github.com/odh-labs/rhoai-roadshow-v2/blob/main/docs/4-rhaiis/notebooks/2-optimize-models.ipynb
https://developers.redhat.com/articles/2025/05/09/llm-compressor-optimize-llms-low-latency-deployments
https://blog.csdn.net/M00Rue_/article/details/148063263
https://www.53ai.com/news/qianyanjishu/2276.html

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

基于分布式模型预测控制的多智能体点对点过渡轨迹生成研究（Matlab代码实现）

随着多智能体系统（MAS）在无人机编队、自动驾驶车队、机器人协同操作等领域的广泛应用，如何实现高效、安全、协同的点对点轨迹生成成为核心挑战。分布式模型预测控制（DMPC）通过将集中式优化问题分解为局部子问题，结合预测模型与分布式通信机制，为大规模多智能体系统的轨迹规划提供了有效解决方案。本文系统梳理了DMPC在多智能体点对点过渡中的关键技术，包括模型构建、约束处理、协调机制及优化算法，分析了其可扩