AWQ总结

gptq量化,在量化精度保持(通常精度损失<5%)

4-bit量化下,显存占用约为原始FP16模型的25%-30%

推理性能,在A100上,基于FLASH_ATTN backend,从0.438s-->0.38s

量化数据准备

量化校验数据集,格式为[{"text":"sdgfd"}]

将训练数据映射为量化数据:

def quantize_desc_data_text():
    import json
    from transformers import  AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("original_model_path")

    with open('train_data.json', 'r',encoding='utf-8') as f:
        data = json.load(f)
    new_data_list = []
    maxlen = 0
    for each in data:
        messages = each["messages"]
        messages.append(each["chosen"])
        new_data_list.append({
            "text":tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=False,enable_thinking=False)
        })
        sample: dict[str, torch.Tensor] = tokenizer(new_data_list[-1]["text"], return_tensors="pt")
        if sample["input_ids"].size(1) > maxlen:
            maxlen = sample["input_ids"].size(1)
    with open(f'量化后的模型地址', 'w',encoding='utf-8') as f:
        json.dump(new_data_list, f,ensure_ascii=False,indent=2)
    print(f"max token len:{maxlen}")
quantize_desc_data_text()

量化配置:

model_name_or_path: original_model_path
template: qwen3

export_dir: 量化后模型保存地址_awq
export_quantization_bit: 4
quantization_method: gptq
export_quantization_maxlen: 1000
export_quantization_dataset: 量化校验数据_quantize.json
export_size: 2
export_device: cuda
export_legacy_format: false

量化后vllm模型部署(gptq量化后启动)

#!/bin/bash
#export VLLM_ATTENTION_BACKEND=XFORMERS
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
source /opt/conda/etc/profile.d/conda.sh
conda activate /opt/conda/envs/vllm085
Model_path="/llm/models/general_knowledge_agent_router/general_knowledge_agent_202250820_v21_01_gptq"
#Model_path="/llm/models/Qwen3-4B-Instruct-2507"

CUDA_VISIBLE_DEVICES=0 nohup  python -m vllm.entrypoints.openai.api_server \
  --model ${Model_path} \
  --served-model-name 'qwen3_4b' \
  --host 0.0.0.0 \
  --port 9005 \
  --max-model-len 4000 \
  --trust-remote-code \
  --device cuda \
  --tensor-parallel-size 1 \
  --swap-space 0 \
  --quantization gptq \
  --dtype float16 \
  --gpu-memory-utilization 0.7 \
  --max-num-seqs 1  > eval_qwen3_quant.log 2>&1 &

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐