七、微调GPT2遵循人类指令

本文介绍了大语言模型指令微调的完整流程，主要包括三个关键步骤：1）数据准备，将指令-回复对转换成Alpaca格式并划分数据集；2）模型配置与微调，构建自定义数据集类和处理函数，加载预训练GPT-2模型进行微调；3）评估方法，通过生成测试集响应和使用Llama3模型自动评分来评估对话性能。文中详细展示了数据处理、批次构建、模型训练的具体实现，并提出了结合多项选择测试、人类评估和自动化评估的综合评估策

卖微积分的男孩

763人浏览 · 2025-11-17 14:16:53

卖微积分的男孩 · 2025-11-17 14:16:53 发布

之前将大模型微调到一个特定的分类任务上，现在开始实现微调大语言模型，以遵循人类指令的过程。这在开发用于聊天机器人、个人助理和其它对话任务的大语言模型中，是主要技术之一。对大语言模型进行指令微调同样有三个过程：准备数据集、模型配置和微调、模型性能评估。

1、准备数据集

用于指令微调的数据集是一条条的指令-回复对，获得数据后，还需要将其制作成适用于大语言模型的格式，如Alpaca和Phi-3，这里我们将会将其制作成Alpaca样式。

首先下载数据集，可以看到其格式是包含“instruction"，"input"和"output"三个键的字典。

#下载数据集
def download_and_load_file(file_path, url):
    if not os.path.exists(file_path):
        with urllib.request.urlopen(url) as response:
            text_data = response.read().decode("utf-8")
        with open(file_path, "w", encoding="utf-8") as file:
            file.write(text_data)

    with open(file_path, "r", encoding="utf-8") as file:
        data = json.load(file)

    return data

if __name__ == "__main__":
    #下载数据集
    file_path = "instruction-data.json"
    url = ("https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/"
          "main/ch07/01_main-chapter-code/instruction-data.json")

    data = download_and_load_file(file_path, url)
    print("data length:", len(data))
    print("example of data:\n", data[50])
    """
    data length: 1100
    example of data:
    {'instruction': 'Identify the correct spelling of the following word.', 
    'input': 'Ocassion', 'output': "The correct spelling is 'Occasion.'"}
    """

然后将其转换成Alpaca样式的提示词风格。当'input'键对应的值为空时，就会跳过该小节。

#Alpaca样式
def formate_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task."
        f"Write a response that appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )
    input_text = (
        f"\n\n### Input:\n{entry['input']}" if entry['input'] else ""
    )

    return instruction_text + input_text

alpaca_data = formate_input(data[50])
response = f"\n\n### Response:\n{data[50]['output']}"
print(alpaca_data + response)

"""
Below is an instruction that describes a task.Write a response that appropriately completes the request.

### Instruction:
Identify the correct spelling of the following word.

### Input:
Ocassion

### Response:
The correct spelling is 'Occasion.'
"""

划分数据集，设置训练集占85%，测试集占10%，剩下的5%为验证集。

def split_dataset(data, train_rate, test_rate):
    length = len(data)
    train_portion = int(length * train_rate)
    test_portion = int(length * test_rate)
    val_portion = length - train_portion - test_portion

    train_data = data[:train_portion]
    test_data = data[train_portion: train_portion + test_portion]
    val_data = data[train_portion + test_portion:]

    print("train length:", len(train_data))
    print("test length:", len(test_data))
    print("val length:", len(test_data))
    return train_data, test_data, val_data

2、构建数据集

之前训练批次是通过PyTorch的DataLoader类自动创建的，使用了默认的聚合（collate）函数将样本列表组合成训练批次。但是，指令微调的批次处理，需要自定义一个聚合函数，以满足指令微调数据集的特定需求和格式，然后再将其集成到DataLoader中。

首先，编写InstructionDataset类，应用format_input函数并对数据集中输入进行预词元化。

class InstructionDataset(Dataset):
    def __init__(self, data, tokenizer):
        self.data = data
        self.encoded_texts = []
        for entry in data:
            instruction_plus_input = formate_input(entry)
            response_text = f"\n\n### Response:\n{entry['output']}"
            full_text = instruction_plus_input + response_text
            self.encoded_texts.append(tokenizer.encode(full_text))

    def __getitem__(self, item):
        return self.encoded_texts[item]

    def __len__(self):
        return len(self.data)

然后，将多个示例集合到一个批次中来加速训练时，就需要批次中所有的输入是相同的长度，所以，使用<|endoftext|>作为填充词元，我们直接将其词元ID 50256填充到词元ID序列的末尾即可。使用自定义的聚合函数custom_collate_draft来实现填充功能，同时允许不同批次具有不同的序列长度。

值得注意的是，除了输入序列需要填充外，代表模型期望生成的内容，并在训练中用来计算损失的目标词元ID序列也需要填充（和之前类似，左移一位得到）。并且，为目标词元序列中所有的填充词元都分配一个-100的占位符值，这个特殊值可以在计算损失时避免填充词元的影响，确保只有有效的数据会影响模型的学习（分类微调无需担心这个问题，因为其只根据模型最后的输出词元对模型进行训练）。不过，在目标列表中保留了第一个的结束符词元50256，因为将其作为生成回复以及完成的指示符，有助于模型学会合适根据指令生成结束符词元。

#自定义聚合函数实现填充过程，每个批次的示例相同长度，且允许不同批次具有不同长度
def custom_collate_draft(batch, pad_token_id=50256, ignore_index=-100, allowed_max_length=None, device="cpu"):
    #批次中最长的序列
    batch_max_length = max(len(item) + 1 for item in batch)
    inputs_lst, targets_lst = [], []

    for item in batch:
        new_item = item.copy()
        new_item += [pad_token_id]
        padded = new_item + [pad_token_id] * (batch_max_length - len(new_item))
        #删除之前添加的额外填充词元
        inputs = torch.tensor(padded[:-1])
        #目标词元ID，向左移动一个词元获得
        targets = torch.tensor(padded[1:])
        #将目标序列中除第一个填充词元外的填充词元替换为ignore_index
        mask = targets == pad_token_id
        indices = torch.nonzero(mask).squeeze()
        if indices.numel() > 1:
            targets[indices[1:]] = ignore_index

        #可选，截断至最大序列长度
        if allowed_max_length is not None:
            inputs = inputs[:allowed_max_length]
            targets = targets[:allowed_max_length]

        inputs_lst.append(inputs)
        targets_lst.append(targets)

    inputs_tensor = torch.stack(inputs_lst).to(device)
    targets_tensor = torch.stack(targets_lst).to(device)

    return inputs_tensor, targets_tensor

占位符设置为-100时因为在pytorch的交叉熵损失函数中，默认设置了忽略-100的目标。除了掩码填充词元外，还有掩码与指令相关的目标词元。这样一来，交叉熵损失可以针对生成的回复目标词元进行计算，因此模型更关注于生成准确的回复，而非记住指令，从而减少过拟合。

最后就是初始化数据记载器，加载刚就建好的数据集了。

#在聚合函数中，将数据移到目标device。可以避免训练期间阻塞GPU（device为gpu时）
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
from functools import partial
#allowed_max_length=1024,截取到gpt2支持的最大上下文长度
custom_collate_draft = partial(custom_collate_draft, device=device, allowed_max_length=1024)
    
num_workers = 0
batch_size = 4
torch.manual_seed(123)
tokenizer = tiktoken.get_encoding("gpt2")
    
train_dataset = InstructionDataset(train_data, tokenizer)
val_dataset = InstructionDataset(val_data, tokenizer)
test_dataset = InstructionDataset(test_data, tokenizer)
    
train_loader = DataLoader(
        train_dataset,
        batch_size=batch_size,
        collate_fn=custom_collate_draft,
        shuffle=True,
        drop_last=True,
        num_workers=num_workers
)
val_loader = DataLoader(
        val_dataset,
        batch_size=batch_size,
        collate_fn=custom_collate_draft,
        shuffle=False,
        drop_last=False,
        num_workers=num_workers
)
test_loader = DataLoader(
        test_dataset,
        batch_size=batch_size,
        collate_fn=custom_collate_draft,
        shuffle=False,
        drop_last=False,
        num_workers=num_workers
)

3、加载预训练的LLM

加载预训练模型的步骤和前面第五、六讲类似，直接使用第五讲的加载预训练模型权重的load_weigth_into_gpt函数，此外，这次使用的medium类型的gpt2，因为遵循指令的微调模型在更复杂的网络结构上能够取得更好的效果。

#加载预训练模型
from gpt2 import GPTModel
from gpt_download import download_and_load_gpt2
from PreTrain import load_weight_into_gpt

BASE_CONFIG = {
    "vocab_size": 50257,
    "context_length": 1024,
    "drop_rate": 0.0,
    "qkv_bias": True
}
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "num_layers": 12, "num_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "num_layers": 24, "num_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "num_layers": 36, "num_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "num_layers": 48, "num_heads": 25}
}
CHOOSE_MODEL = "gpt2-medium (355M)"
BASE_CONFIG.update(model_configs[CHOOSE_MODEL])
model_size = CHOOSE_MODEL.split(" ")[-1].lstrip("(").rstrip(")")

model = GPTModel(BASE_CONFIG)
settings, params = download_and_load_gpt2(model_size, "gpt2")
load_weight_into_gpt(model, params)
model.to(device)

4、微调模型

仅仅使用预训练的权重，在遵循指令的任务上，不能得到很好的效果，我们还需要进一步对其进行微调训练。训练方法也可以直接重用之前第五讲的train_model_simple训练函数和calc_loss_loader函数计算损失。

#微调模型
epochs = 3
start_time = time.time()
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=5e-4,
    weight_decay=0.1
)
train_model_simple(model, train_loader, val_loader, optimizer, device, epochs, eval_freq=100,
                   eval_iter=50, start_context=formate_input(val_data[0]),  tokenizer=tokenizer)

end_time = time.time()
print(f"Training time: {(end_time - start_time) / 60:.2f} minutes")

# 保存模型权重
torch.save({
    "model_state_dict": model.state_dict(),  # 模型的所有可学习参数（即权重和偏置）的字典
    "optimizer_state_dict": optimizer.state_dict(),  # 优化器的状态字典
},
    "InstructionFine-tuning.pth"
)

5、模型评估

在文本分类时，只需要通过计算正确的垃圾消息和非垃圾消息分类标签的比例来获取准确性。然而在实践中，对指令微调的大语言模型的评估需要多种方法。

短答案和多项选择的基准测试。如“Measuring Massive Multitask Language Understanding”（MMLU)，主要考察模型的综合知识。
与其他大语言模型进行人类偏好比较，比如LMSYS聊天机器人竞技场
使用其他大语言模型（如GPT-4）来自动评估回复的对话基准，比如AlpacaEval

在实际操作中，通常同时考虑这三种方法（多项选择回答、人类评估和衡量对话性能的自动化指标）。不过，这这里我们重点是评估对话性能，而不仅仅是模型回答多项选择问题的能力。因此，我们使用一种类似自动化对话基准的方法，利用另一个大语言模型来自动评估回复。

首先，用测试集生成的模型响应保存在json文件中，然后通过该文件，在后续轻松加载这些响应，并用另一个大模型来评分。

#生成测试集上的回复，写到json文件中
def generate_response_from_test_data(test_data, model, tokenizer):
    #使用第五讲中介绍的生成文本函数、词元ID和文本转换函数
    from PreTrain import generate, text_to_token_ids, token_ids_to_text

    for i, entry in tqdm(enumerate(test_data), total=len(test_data)):
        input_text = formate_input(entry)

        token_ids = generate(model,
                             idx=text_to_token_ids(input_text, tokenizer),
                             max_new_tokens=256,
                             context_size=BASE_CONFIG["context_length"],
                             eos_id=50256)
        generated_text = token_ids_to_text(token_ids, tokenizer)

        response_text = generated_text[len(input_text):].replace("### Response:", "").strip()
        test_data[i]["model_response"] = response_text

    with open("instruction-data-with-response.json", "w") as file:
        json.dump(test_data, file, indent=4) #为美观，指定缩进

我们使用Meta开发的参数量为80亿的Llama3模型作为评估模型，Llama3模型可以通过Ollama程序直接在本地运行。下载完模型后，可以直接通过应用程序运行或者命令行ollama serve来运行服务，然后执行命令ollama run llama3即可下载并进入运行界面。当然也可以使用Python通过REST API来与模型交互。

#与本地部署的Llama交互
def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
    data = {
        "model": model,
        "message": [{"role": "user", "content": prompt}],
        "options": {"seed": 123, "temperature": 0, "num_ctx": 2048}
    }
    #将字典转成json格式，并编码为字节
    payload = json.dumps(data).encode("utf-8")
    #创建一个请求，post
    request = urllib.request.Request(url, data=payload, method="POST")
    request.add_header("Content-Type", "application/json")
    #发送请求，并捕获模型回复
    response_data = ""
    with urllib.request.urlopen(request) as response:
        while True:
            line = response.reasline().decode("utf-8")
            if not line:
                break
            response_json = json.loads(line)
            response_data += response_json["message"]["content"]

    return response_data

最后，开始对刚才保存的在测试集上的预测结果进行打分。

#评估微调后的LLM
def generate_model_scores(json_data, json_key, model="llama3"):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = f"Given the input '{formate_input(entry)}' and correct output '{entry['output']}'," \
                 f"score the model response '{entry[json_key]}'" \
                 f"on a scale from 0 to 100, where 100 is the best score. Response with the integer number only."
        score = query_model(prompt, model)
        try:
            scores.append(score)
        except ValueError:
            print("Could not convert score: {score}")
            continue
        
        return scores