如何调用huggingface模型中Qwen3-0.6B模型搭建本地大模型助手(单次对话与多轮对话版),并且单样本调用和batch方式调用的对比
本文介绍了如何使用HuggingFace中的Qwen3-0.6B模型构建对话系统。主要内容包括:1)单次调用模型的方法,包括模型导入、提示词模板构建、tokenize处理和结果解码;2)构建支持多轮对话的QwenChatbot类,实现对话历史记录和响应生成;3)批量处理方法,通过设置batch_size提高处理效率,包括批量消息构建、模板转换、模型生成和结果解析。特别说明了在批量处理时需要使用pa
目录
第二步准备提示词模板转换成chat模型使用的模板,并进行tokenize分词处理
第四步:根据生成的token通过decode解码成我们能看懂的形式
1.调用huggingface中Qwen3-0.6B模型
第一步:导入模型
这个步骤中设置在huggingface上模型的名称,调用如下这段函数模型会自行下载到本地的.cache缓存中。
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-0.6B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
第二步:准备提示词模板转换成chat模型使用的模板,并进行tokenize分词处理
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
第三步:将内容输入模型中得到模型的token输出
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
第四步:根据生成的token通过decode解码成我们能看懂的形式
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
2.构建多轮对话模型
from transformers import AutoModelForCausalLM, AutoTokenizer
class QwenChatbot:
def __init__(self, model_name="Qwen/Qwen3-0.6B"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.history = []
def generate_response(self, user_input):
messages = self.history + [{"role": "user", "content": user_input}]
text = self.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = self.tokenizer(text, return_tensors="pt")
response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
response = self.tokenizer.decode(response_ids, skip_special_tokens=True)
# Update history
self.history.append({"role": "user", "content": user_input})
self.history.append({"role": "assistant", "content": response})
return response
# Example Usage
if __name__ == "__main__":
chatbot = QwenChatbot()
# First input (without /think or /no_think tags, thinking mode is enabled by default)
user_input_1 = "How many r's in strawberries?"
print(f"User: {user_input_1}")
response_1 = chatbot.generate_response(user_input_1)
print(f"Bot: {response_1}")
print("----------------------")
# Second input with /no_think
user_input_2 = "Then, how many r's in blueberries? /no_think"
print(f"User: {user_input_2}")
response_2 = chatbot.generate_response(user_input_2)
print(f"Bot: {response_2}")
print("----------------------")
# Third input with /think
user_input_3 = "Really? /think"
print(f"User: {user_input_3}")
response_3 = chatbot.generate_response(user_input_3)
print(f"Bot: {response_3}")
3.batch_size方法调用
以下以一个翻译的函数为例说明以batch_size调用模型的方法,总的代码如下:
注:在tokenizer 初始化时,需要添加padding_side='left'的参数,
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,padding_side='left')
prompt=“你是一个精通医疗领域的翻译专家:”
def translate_texts(texts, tokenizer, model, device, max_length=256, batch_size=32):
results = []
for i in tqdm(range(0, len(texts),batch_size), desc="Translating"):
batch = texts[i:i+batch_size]
messages_list = [
[{"role": "user", "content": prompt + text}] for text in batch
]
text_batch = [tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
) for messages in messages_list]
model_inputs = tokenizer(text_batch, return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length).to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
# 对每条结果单独解析
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids):
# 截取新生成的部分
new_tokens = output_ids[len(input_ids):].tolist()
content = tokenizer.decode(new_tokens[index:], skip_special_tokens=True).strip("\n")
results.append(content)
return results
如下为分步对代码进行解析:
第一步:构建batch_size的方式的messages
for i in tqdm(range(0, len(texts),batch_size), desc="Translating"):
batch = texts[i:i+batch_size]
messages_list = [
[{"role": "user", "content": prompt + text}] for text in batch
]
因为[{"role": "user", "content": prompt + text}]在单样本调用时就是列表的形式,所以在多样本中每个都要是列表的格式,即组织形式为二维列表。
写成如下形式会报错:
messages_list = [
{"role": "user", "content": prompt + text} for text in batch
]
第二步:批量构建chat模板
text_batch = [tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
) for messages in messages_list]
第三步:使用大模型生成token回答
由于batch_size中样本的长度不一,此处的tokenizer()需要添加padding=True, truncation=True的参数对文本进行填充和裁剪操作。
model_inputs = tokenizer(text_batch, return_tensors="pt",
padding=True,
truncation=True,
max_length=max_length).to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
第四步:批量从生成的结果中提取每个输入对应的输出
# 对每条结果单独解析
for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids):
# 截取新生成的部分
new_tokens = output_ids[len(input_ids):].tolist()
content = tokenizer.decode(new_tokens[index:], skip_special_tokens=True).strip("\n")
解析下这段代码“new_tokens = output_ids[len(input_ids):].tolist()”:由于输出的内容是“输入的prompt + 大模型的生成内容”,所以需要知道输入的最后索引值,在索引值后面的才是我们需要的生成结果。
更多推荐
所有评论(0)