如何调用huggingface模型中Qwen3-0.6B模型搭建本地大模型助手（单次对话与多轮对话版），并且单样本调用和batch方式调用的对比

本文介绍了如何使用HuggingFace中的Qwen3-0.6B模型构建对话系统。主要内容包括：1）单次调用模型的方法，包括模型导入、提示词模板构建、tokenize处理和结果解码；2）构建支持多轮对话的QwenChatbot类，实现对话历史记录和响应生成；3）批量处理方法，通过设置batch_size提高处理效率，包括批量消息构建、模板转换、模型生成和结果解析。特别说明了在批量处理时需要使用pa

a503244552

1091人浏览 · 2025-09-22 23:37:22

a503244552 · 2025-09-22 23:37:22 发布

1.调用huggingface中Qwen3-0.6B模型

第一步：导入模型

第二步准备提示词模板转换成chat模型使用的模板，并进行tokenize分词处理

第三步：将内容输入模型中得到模型的token输出

第四步：根据生成的token通过decode解码成我们能看懂的形式

2.构建多轮对话模型

3.batch_size方法调用

第一步：构建batch_size的方式的messages

第二步：批量构建chat模板

第三步：使用大模型生成token回答

第四步：批量从生成的结果中提取每个输入对应的输出

1.调用huggingface中Qwen3-0.6B模型

第一步：导入模型

这个步骤中设置在huggingface上模型的名称，调用如下这段函数模型会自行下载到本地的.cache缓存中。

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-0.6B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

第二步：准备提示词模板转换成chat模型使用的模板，并进行tokenize分词处理

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

第三步：将内容输入模型中得到模型的token输出

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

第四步：根据生成的token通过decode解码成我们能看懂的形式

content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

2.构建多轮对话模型

from transformers import AutoModelForCausalLM, AutoTokenizer

class QwenChatbot:
    def __init__(self, model_name="Qwen/Qwen3-0.6B"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForCausalLM.from_pretrained(model_name)
        self.history = []

    def generate_response(self, user_input):
        messages = self.history + [{"role": "user", "content": user_input}]

        text = self.tokenizer.apply_chat_template(
            messages,
            tokenize=False,
            add_generation_prompt=True
        )

        inputs = self.tokenizer(text, return_tensors="pt")
        response_ids = self.model.generate(**inputs, max_new_tokens=32768)[0][len(inputs.input_ids[0]):].tolist()
        response = self.tokenizer.decode(response_ids, skip_special_tokens=True)

        # Update history
        self.history.append({"role": "user", "content": user_input})
        self.history.append({"role": "assistant", "content": response})

        return response

# Example Usage
if __name__ == "__main__":
    chatbot = QwenChatbot()

    # First input (without /think or /no_think tags, thinking mode is enabled by default)
    user_input_1 = "How many r's in strawberries?"
    print(f"User: {user_input_1}")
    response_1 = chatbot.generate_response(user_input_1)
    print(f"Bot: {response_1}")
    print("----------------------")

    # Second input with /no_think
    user_input_2 = "Then, how many r's in blueberries? /no_think"
    print(f"User: {user_input_2}")
    response_2 = chatbot.generate_response(user_input_2)
    print(f"Bot: {response_2}") 
    print("----------------------")

    # Third input with /think
    user_input_3 = "Really? /think"
    print(f"User: {user_input_3}")
    response_3 = chatbot.generate_response(user_input_3)
    print(f"Bot: {response_3}")

3.batch_size方法调用

以下以一个翻译的函数为例说明以batch_size调用模型的方法，总的代码如下：

注：在tokenizer 初始化时，需要添加padding_side='left'的参数，

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,padding_side='left')

prompt=“你是一个精通医疗领域的翻译专家：”
def translate_texts(texts, tokenizer, model, device, max_length=256, batch_size=32):
    results = []
    for i in tqdm(range(0, len(texts),batch_size), desc="Translating"):
        batch = texts[i:i+batch_size]
        messages_list = [
            [{"role": "user", "content": prompt + text}] for text in batch
        ]
        text_batch = [tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
                enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
            ) for messages in messages_list]
        model_inputs = tokenizer(text_batch, return_tensors="pt",
                                 padding=True, 
                                 truncation=True, 
                                 max_length=max_length).to(model.device)

        # conduct text completion
        generated_ids = model.generate(
            **model_inputs,
            max_new_tokens=32768
        )

        # 对每条结果单独解析
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids):
            # 截取新生成的部分
            new_tokens = output_ids[len(input_ids):].tolist()
            

            content = tokenizer.decode(new_tokens[index:], skip_special_tokens=True).strip("\n")
            results.append(content)
    return results

如下为分步对代码进行解析：

第一步：构建batch_size的方式的messages

for i in tqdm(range(0, len(texts),batch_size), desc="Translating"):
        batch = texts[i:i+batch_size]
        messages_list = [
            [{"role": "user", "content": prompt + text}] for text in batch
        ]

因为[{"role": "user", "content": prompt + text}]在单样本调用时就是列表的形式，所以在多样本中每个都要是列表的格式，即组织形式为二维列表。

写成如下形式会报错：

messages_list = [
            {"role": "user", "content": prompt + text} for text in batch
        ]

第二步：批量构建chat模板

text_batch = [tokenizer.apply_chat_template(
                messages,
                tokenize=False,
                add_generation_prompt=True,
                enable_thinking=False # Switches between thinking and non-thinking modes. Default is True.
            ) for messages in messages_list]

第三步：使用大模型生成token回答

由于batch_size中样本的长度不一，此处的tokenizer（）需要添加padding=True, truncation=True的参数对文本进行填充和裁剪操作。

model_inputs = tokenizer(text_batch, return_tensors="pt",
                                 padding=True, 
                                 truncation=True, 
                                 max_length=max_length).to(model.device)

        # conduct text completion
        generated_ids = model.generate(
            **model_inputs,
            max_new_tokens=32768
        )

第四步：批量从生成的结果中提取每个输入对应的输出

 # 对每条结果单独解析
        for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids):
            # 截取新生成的部分
            new_tokens = output_ids[len(input_ids):].tolist()
            

            content = tokenizer.decode(new_tokens[index:], skip_special_tokens=True).strip("\n")

解析下这段代码“new_tokens = output_ids[len(input_ids):].tolist()”：由于输出的内容是“输入的prompt + 大模型的生成内容”，所以需要知道输入的最后索引值，在索引值后面的才是我们需要的生成结果。