大模型量化方法及qwen3的AutoGPTQ量化
处理完后的 msg 应该是[{"role":"user","content":""},{"role":"assistant","content":""}] 类似的,可以让chat_template解析,同时带有回复答案的。data_path = "模型训练数据路径,注意格式"基于llama_factory的训练环境,安装量化,quant_path = "模型量化保存路径"model_path =
·
AWQ总结
gptq量化,在量化精度保持(通常精度损失<5%)
4-bit量化下,显存占用约为原始FP16模型的25%-30%
推理性能,在A100上,基于FLASH_ATTN backend,从0.438s-->0.38s
量化数据准备
量化校验数据集,格式为[{"text":"sdgfd"}]
将训练数据映射为量化数据:
def quantize_desc_data_text():
import json
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("original_model_path")
with open('train_data.json', 'r',encoding='utf-8') as f:
data = json.load(f)
new_data_list = []
maxlen = 0
for each in data:
messages = each["messages"]
messages.append(each["chosen"])
new_data_list.append({
"text":tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=False,enable_thinking=False)
})
sample: dict[str, torch.Tensor] = tokenizer(new_data_list[-1]["text"], return_tensors="pt")
if sample["input_ids"].size(1) > maxlen:
maxlen = sample["input_ids"].size(1)
with open(f'量化后的模型地址', 'w',encoding='utf-8') as f:
json.dump(new_data_list, f,ensure_ascii=False,indent=2)
print(f"max token len:{maxlen}")
quantize_desc_data_text()
量化配置:
model_name_or_path: original_model_path
template: qwen3
export_dir: 量化后模型保存地址_awq
export_quantization_bit: 4
quantization_method: gptq
export_quantization_maxlen: 1000
export_quantization_dataset: 量化校验数据_quantize.json
export_size: 2
export_device: cuda
export_legacy_format: false
量化后vllm模型部署(gptq量化后启动)
#!/bin/bash
#export VLLM_ATTENTION_BACKEND=XFORMERS
export VLLM_ATTENTION_BACKEND=FLASH_ATTN
source /opt/conda/etc/profile.d/conda.sh
conda activate /opt/conda/envs/vllm085
Model_path="/llm/models/general_knowledge_agent_router/general_knowledge_agent_202250820_v21_01_gptq"
#Model_path="/llm/models/Qwen3-4B-Instruct-2507"
CUDA_VISIBLE_DEVICES=0 nohup python -m vllm.entrypoints.openai.api_server \
--model ${Model_path} \
--served-model-name 'qwen3_4b' \
--host 0.0.0.0 \
--port 9005 \
--max-model-len 4000 \
--trust-remote-code \
--device cuda \
--tensor-parallel-size 1 \
--swap-space 0 \
--quantization gptq \
--dtype float16 \
--gpu-memory-utilization 0.7 \
--max-num-seqs 1 > eval_qwen3_quant.log 2>&1 &
更多推荐
所有评论(0)