自然语言处理

文本分类

文本分类是一种常见的自然语言处理任务,它将标签或类别分配给文本。一些最大的公司将文本分类应用于各种实际应用中。最流行的文本分类形式之一是情感分析,它将标签(例如:积极、消极或中性)分配给一段文本。
在开始之前,请确保你已安装所有必要的库:

pip install transformers datasets evaluate
加载 IMDb 数据集

首先,从 Datasets 库中加载 IMDb 数据集:

from datasets import load_dataset
imdb = load_dataset("imdb")
imdb["test"][0]

然后查看一个示例:

{
  "label": 0,
  "text": "I love sci-fi and am willing to put up with a lot. Sci-fi movies/TV are usually underfunded, under-appreciated and misunderstood. I tried to like this, I really did, but it is to good TV sci-fi as Babylon 5 is to Star Trek (the original). Silly prosthetics, cheap cardboard sets, stilted dialogues, CG that doesn't match the background, and painfully one-dimensional characters cannot be overcome with a 'sci-fi' setting. (I'm sure there are those of you out there who think Babylon 5 is good sci-fi TV. It's not. It's clichéd and uninspiring.) While US viewers might like emotion and character development, sci-fi is a genre that does not take itself seriously (cf. Star Trek). It may treat important issues, yet not as a serious philosophy. It's really difficult to care about the characters here as they are not simply foolish, just missing a spark of life. Their actions and reactions are wooden and predictable, often painful to watch. The makers of Earth KNOW it's rubbish as they have to always say \"Gene Roddenberry's Earth...\" otherwise people would not continue watching. Roddenberry's ashes must be turning in their orbit as this dull, cheap, poorly edited (watching it without advert breaks really brings this home) trudging Trabant of a show lumbers into space. Spoiler. So, kill off a main character. And then bring him back as another actor. Jeeez! Dallas all over again.",
}

其中:

  • text:电影评论文本。
  • label:取值为 0 表示消极评论,取值为 1 表示积极评论。
预处理

下一步是加载 DistilBERT 的 tokenizer 来预处理 text 字段:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

创建一个预处理函数来对 text 进行分词和截断,以使其不超过 DistilBERT 的最大输入长度:

def preprocess_function(examples):
   return tokenizer(examples["text"], truncation=True)

使用 Datasets [~datasets.Dataset.map] 函数对整个数据集应用预处理函数,通过将 batched=True 设置为一次处理数据集中的多个元素,可以加快处理速度:

tokenized_imdb = imdb.map(preprocess_function, batched=True)

现在使用 [DataCollatorWithPadding] 创建一个示例的批处理。在 collation 过程中,将句子动态填充为一批中最长长度,而不是将整个数据集填充到最大长度。

# framework = "pt"
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# framework = "tf"
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
评估

在训练过程中包含一个评估指标通常对评估模型的性能很有帮助。你可以使用 Evaluate 库快速加载一个评估方法。对于这个任务,加载 accuracy 指标(请参阅 Evaluate 快速入门 以了解有关如何加载和计算指标的更多信息):

import evaluate

accuracy = evaluate.load("accuracy")

然后创建一个函数,将模型的预测结果和标签传递给 [~evaluate.EvaluationModule.compute] 来计算准确率:

import numpy as np

def compute_metrics(eval_pred):
     predictions, labels = eval_pred
     predictions = np.argmax(predictions, axis=1)
     return accuracy.compute(predictions=predictions, references=labels)

现在 compute_metrics 函数已准备就绪,在设置训练时将返回它。

训练

在开始训练模型之前,使用 id2label 和 label2id 创建一个预期 id 到标签的映射:

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

现在已经准备好开始训练模型了!使用 [AutoModelForSequenceClassification] 加载 DistilBERT 并指定预期的标签数和标签映射:

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id,...)

到这一步,只剩下三个步骤:

  1. 在 [TrainingArguments] 中定义你的训练超参数。唯一必需的参数是 output_dir,用于指定保存模型的位置。通过设置 push_to_hub=True,你将把这个模型推送到 Hub(你需要登录到 Hugging Face 才能上传模型)。在每个 epoch 结束时,[Trainer] 将评估准确率并保存训练检查点。
  2. 将训练参数与模型数据集tokenizer数据处理器compute_metrics 函数一起传递给 [Trainer]。
  3. 调用 [Trainer.train] 来微调模型。
training_args = TrainingArguments(
    output_dir="my_awesome_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
    ...
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_imdb["train"],
    eval_dataset=tokenized_imdb["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    ...
)

trainer.train()

[Trainer] 默认使用动态填充,所以当你将 tokenizer 传递给它时,默认会应用该机制。在此示例中,不需要显式地指定数据处理器。
训练完成后,使用 [~transformers.Trainer.push_to_hub] 方法将你的模型共享到 Hub,以供所有人使用:

trainer.push_to_hub()

要在 TensorFlow 中微调模型,请首先设置优化器函数、学习率计划和一些训练超参数:

from transformers import create_optimizer
import tensorflow as tf

batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

然后,你可以加载 [TFAutoModelForSequenceClassification] 的 DistilBERT 和预期的标签数以及标签映射:

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id)

使用 [~transformers.TFPreTrainedModel.prepare_tf_dataset] 将数据集转换为 tf.data.Dataset 格式:

tf_train_set = model.prepare_tf_dataset(
    tokenized_imdb["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
    ...)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_imdb["test"],
    shuffle=False,batch_size=16,
    collate_fn=data_collator,
    ... )

使用 compile 配置模型进行训练。请注意,Transformer 模型都有一个与任务相关的默认损失函数,所以除非你希望使用自定义的损失函数,否则不需要指定损失函数:

import tensorflow as tf

model.compile(optimizer=optimizer)  # 没有损失参数!

在你开始训练之前,设置好最后两个事项,即从预测结果中计算准确率,并提供一种将你的模型推送到 Hub 的方式,这两个都是使用 Keras 回调 完成的。
将你的 compute_metrics 函数传递给 [transformers.KerasMetricCallback]:

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

在 [transformers.PushToHubCallback] 中指定要将模型和 tokenizer 推送到的位置:

from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
     output_dir="my_awesome_model",
     tokenizer=tokenizer,
     ...)

将你的回调函数捆绑在一起:

callbacks = [metric_callback, push_to_hub_callback]

最后,你可以开始训练你的模型了!使用你的训练和验证数据集、epoch 数和回调函数来调用 fit 以进行微调:

model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)

训练完成后,你的模型会自动上传到 Hub,供所有人使用!
有关如何为文本分类微调模型的更深入示例,请查看相应的 PyTorch notebookTensorFlow notebook

推理

至此,已经微调了一个模型,可以用它进行推断了!
找到一些想要运行推断的文本:

text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

使用微调后的模型进行推理的最简单方法是在[pipeline]中使用它。实例化一个用于情感分析的 pipeline,并将文本传递给它:

from transformers import pipeline

classifier = pipeline("sentiment-analysis",model="stevhliu/my_awesome_model")
classifier(text)
# [{'label': 'POSITIVE', 'score': 0.9994940757751465}]

如果你愿意,你也可以手动复制 pipeline 的结果:
对文本进行分词并返回 PyTorch 张量:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="pt")

将输入传递给模型并返回 logits

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
with torch.no_grad():
  logits = model(**inputs).logits

获取具有最高概率的类别,并使用模型的 id2label 映射将其转换为文本标签:

predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]
# 'POSITIVE'

对文本进行分词并返回 TensorFlow 张量:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="tf")

将输入传递给模型并返回logits:

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
logits = model(**inputs).logits

获取具有最高概率的类别,并使用模型的id2label映射将其转换为文本标签:

predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
model.config.id2label[predicted_class_id]
# 'POSITIVE'
Token分类

Token 分类为句子中的每个标记分配一个标签。最常见的 Token 分类任务之一是命名实体识别Named Entity RecognitionNER)。NER 旨在为句子中的每个实体(如人、位置或组织)找到一个标签。
开始之前,请确保已安装所有必要的库:

pip install transformers datasets evaluate seqeval
加载 WNUT 17 数据集

首先从 Datasets 库中加载 WNUT 17 数据集:

from datasets import load_dataset
wnut = load_dataset("wnut_17")
wnut["train"][0]

输出结果:

{'id': '0',
 'ner_tags': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 8, 8, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0],
 'tokens': ['@paulwalk', 'It', "'s", 'the', 'view', 'from', 'where', 'I', "'m", 'living', 'for', 'two', 'weeks', '.', 'Empire', 'State', 'Building', '=', 'ESB', '.', 'Pretty', 'bad', 'storm', 'here', 'last', 'evening', '.']
}

ner_tags 中的每个数字表示一个实体。将数字转换为其标签名称以了解实体是什么:

label_list = wnut["train"].features[f"ner_tags"].feature.names
label_list[
  "O",
  "B-corporation",
  "I-corporation",
  "B-creative-work",
  "I-creative-work",
  "B-group",
  "I-group",
  "B-location",
  "I-location",
  "B-person",
  "I-person",
  "B-product",
  "I-product",
]

ner_tags 中的每个标记前缀字母表示实体的 token 位置:

  • B-: 表示实体的开始。
  • I-: 表示 token 包含在同一个实体中(例如,StatetokenEmpire State Building 实体的一部分)。
  • 0:表示该token不对应任何实体。
预处理

下一步是加载 DistilBERT 分词器以预处理 tokens 字段:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

正如你在上面的示例 tokens 字段中看到的那样,它看起来像已经进行了标记化的输入。但是实际上输入尚未标记化,你需要设置 is_split_into_words=True 将单词标记化为子单词。例如:

example = wnut["train"][0]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
tokens
#['[CLS]', '@', 'paul', '##walk', 'it', "'", 's', 'the', 'view', 'from', 'where', 'i', "'", 'm', 'living', 'for', 'two', 'weeks', '.', 'empire', 'state', 'building', '=', 'es', '##b', '.', 'pretty', 'bad', 'storm', 'here', 'last', 'evening', '.', '[SEP]']

但是,这会添加一些特殊标记 CLSSEP,而子词标记会导致输入和标签之间的不匹配。现在,一个对应于单个标签的单个单词可能被分割为两个子词。你需要通过以下方式对齐标记和标签:

  1. 使用 word_ids 方法将所有标记映射到相应的单词。
  2. 将特殊标记 CLSSEP 的标签设置为 − 100 -100 100,这样它们将被忽略掉用于 PyTorch 损失函数的计算(请参见 CrossEntropyLoss)。
  3. 仅为给定单词的第一个标记进行标记。将同一单词的其他子词分配为 − 100 -100 100

以下是你可以创建以对齐标记和标签的函数,并截断序列为不超过 DistilBERT 的最大输入长度的方法:

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples[f"ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)  # 将标记映射到它们对应的单词。
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:  # 将特殊标记设置为-100。
            if word_idx is None:
                label_ids.append(-100)
                elif word_idx != previous_word_idx:  # 仅对给定单词的第一个标记进行标记。
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
                previous_word_idx = word_idx
                labels.append(label_ids)
                
       tokenized_inputs["labels"] = labels
   return tokenized_inputs

要将预处理函数应用于整个数据集,请使用 Datasets [~datasets.Dataset.map]函数。通过设置 batched=True 可以加速map函数,以便一次处理数据集的多个元素:

tokenized_wnut = wnut.map(tokenize_and_align_labels, batched=True)

现在使用[DataCollatorWithPadding]创建一个示例批次。在整理期间将句子动态填充到批次中的最大长度,而不是将整个数据集填充到最大长度。

# framework = "pt"
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# framework = "tf"
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, return_tensors="tf")
评估

在训练过程中包含度量标准通常有助于评估模型的性能。你可以使用评估库快速加载一个评估方法。对于本任务,请加载 seqeval 框架(请参阅 Evaluate 快速导览以了解有关如何加载和计算度量标准的更多信息)。Seqeval 实际上产生了几个分数:精确度(precision)、召回率(recall)、F1和准确度(accuracy)。

import evaluate
seqeval = evaluate.load("seqeval")

首先获取 NER 标签,然后创建一个函数,该函数将你的真实预测和真实标签传递给[evaluate.EvaluationModule.compute]以计算分数:

import numpy as np

labels = [label_list[i] for i in example[f"ner_tags"]]

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)
    
    true_predictions = [
       [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
       for prediction, label in zip(predictions, labels)
    ]
    
    true_labels = [
       [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
       for prediction, label in zip(predictions, labels)
    ]
    
    results = seqeval.compute(predictions=true_predictions, references=true_labels)
    
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
        ...
    }

现在你的 compute_metrics 函数已经准备好,当设置训练时将会返回它。

训练

在开始训练模型之前,先创建一个预期的 ID 到标签的映射以及 ID 到标签的映射 id2labellabel2id

id2label = {
    0: "O",
    1: "B-corporation",
    2: "I-corporation",
    3: "B-creative-work",
    4: "I-creative-work",
    5: "B-group",
    6: "I-group",
    7: "B-location",
    8: "I-location",
    9: "B-person",
    10: "I-person",
    11: "B-product",
    12: "I-product",
}

label2id = {
    "O": 0,
    "B-corporation": 1,
    "I-corporation": 2,
    "B-creative-work": 3,
    "I-creative-work": 4,
    "B-group": 5,
    "I-group": 6,
    "B-location": 7,
    "I-location": 8,
    "B-person": 9,
    "I-person": 10,
    "B-product": 11,
    "I-product": 12,
    ...
}

现在,可以开始训练模型了!使用[AutoModelForTokenClassification]加载 DistilBERT,同时指定期望的标签数量以及标签映射:

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
model = AutoModelForTokenClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=13,
    id2label=id2label,
    label2id=label2id)

此时,仅剩下三个步骤:

  • 在[TrainingArguments]中定义你的训练超参数。output_dir 是唯一需要的参数,它指定要保存模型的位置。你可以设置 push_to_hub=True 将模型推送到Hub(上传模型需要登录到Hugging Face)。在每个阶段结束时,[Trainer]将评估 seqeval 分数并保存训练检查点。
  • 将训练参数与模型、数据集、分词器、数据整理器和 compute_metrics 函数一起传递给[Trainer]。
  • 调用[Trainer.train]以微调你的模型。
training_args = TrainingArguments(
    output_dir="my_awesome_wnut_model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
    ...
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_wnut["train"],
    eval_dataset=tokenized_wnut["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    ...
)

trainer.train()

训练完成后,使用[transformers.Trainer.push_to_hub]方法将你的模型分享到Hub,以便每个人都可以使用你的模型:

trainer.push_to_hub()

要在 TensorFlow 中微调模型,请首先设置一个优化器函数、学习率计划和一些训练超参数:

from transformers import create_optimizer

batch_size = 16
num_train_epochs = 3
num_train_steps = (len(tokenized_wnut["train"]) // batch_size) * num_train_epochs
optimizer, lr_schedule = create_optimizer(
    init_lr=2e-5,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
    num_warmup_steps=0,
    ...
)

然后,你可以使用[TFAutoModelForTokenClassification]加载 DistilBERT,同时指定期望的标签数量以及标签映射:

from transformers import TFAutoModelForTokenClassification

model = TFAutoModelForTokenClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=13,
    id2label=id2label,
    label2id=label2id
)

使用[transformers.TFPreTrainedModel.prepare_tf_dataset]将数据集转换为 tf.data.Dataset 格式:

tf_train_set = model.prepare_tf_dataset(
    tokenized_wnut["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
    ...
)

将模型配置为使用 compile 进行训练。注意,Transformers 模型都有一个默认的与任务相关的损失函数,因此除非你想要指定一个,否则不需要再指定了:

import tensorflow as tf

model.compile(optimizer=optimizer)  # 没有损失参数!

在开始训练之前,还有最后两件事要做,即从预测中计算 seqeval 分数,并提供将模型上传到Hub的方法。这两件事都是通过使用 Keras 回调来完成的。
将你的 compute_metrics 函数传递给[transformers.KerasMetricCallback]:

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

在[transformers.PushToHubCallback]中指定将模型和分词处理器上传到哪:

from transformers.keras_callbacks import PushToHubCallback
push_to_hub_callback = PushToHubCallback(
    output_dir="my_awesome_wnut_model",
    tokenizer=tokenizer,
    ...
)

然后将回调捆绑在一起:

callbacks = [metric_callback, push_to_hub_callback]

最后,你可以开始训练模型了!使用训练和验证数据集、训练轮数以及回调函数来调用fit来微调模型:

model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)

一旦训练完成,你的模型将自动上传到Hub,这样每个人都可以使用它!

推理

现在你已经微调了模型,可以用它进行推理了!首先选择一些你想要进行推理的文本:

text = "The Golden State Warriors are an American professional basketball team based in San Francisco."

尝试使用[pipeline]中的模型进行推理是最简单的方法。使用NER实例化一个pipeline,并将文本传递给它:

from transformers import pipeline

classifier = pipeline("ner", model="stevhliu/my_awesome_wnut_model")
classifier(text)

结果为:

[
{'entity': 'B-location',
'score': 0.42658573,
'index': 2,
'word': 'golden',
'start': 4,
'end': 10},
{'entity': 'I-location',
'score': 0.35856336,
'index': 3,
'word': 'state',
'start': 11,
'end': 16},
{'entity': 'B-group',
'score': 0.3064001,
'index': 4,
 'word': 'warriors',
'start': 17,
'end': 25},
{'entity': 'B-location',
'score': 0.65523505,
'index': 13,
'word': 'san',
'start': 80,
'end': 83},
{'entity': 'B-location',
'score': 0.4668663,
'index': 14,
'word': 'francisco',
'start': 84,
'end': 93}
]

如果需要,也可以手动复制 pipeline 的结果:
对文本进行分词并返回 PyTorch 张量:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model")
inputs = tokenizer(text, return_tensors="pt")

将输入传递给模型并返回 logits

from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model")
with torch.no_grad():
    logits = model(**inputs).logits

获取具有最高概率的类别,并使用模型的id2label映射将其转换为文本标签:

predictions = torch.argmax(logits, dim=2)
predicted_token_class = [model.config.id2label[t.item()] for t in predictions[0]]
predicted_token_class

结果为:

['O',
'O',
'B-location',
'I-location',
'B-group',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'O',
'B-location',
'B-location',
'O',
'O']

对文本进行分词并返回TensorFlow张量:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_wnut_model")
inputs = tokenizer(text, return_tensors="tf")

将输入传递给模型并返回 logits

from transformers import TFAutoModelForTokenClassification
model = TFAutoModelForTokenClassification.from_pretrained("stevhliu/my_awesome_wnut_model")
logits = model(**inputs).logits

获取具有最高概率的类别,并使用模型的 id2label 映射将其转换为文本标签:

predicted_token_class_ids = tf.math.argmax(logits, axis=-1)
predicted_token_class = [model.config.id2label[t] for t in redicted_token_class_ids[0].numpy().tolist()]
predicted_token_class

结果为:

 ['O',
 'O',
 'B-location',
 'I-location',
 'B-group',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'O',
 'B-location',
 'B-location',
 'O',
 'O']
问答

问答任务根据问题得到一个答案。如果你曾经向 AlexaSiriGoogle 等虚拟助手询问天气情况,那么你之前使用过问答模型。常见的问答任务有两种类型:

  • 抽取式:从给定的上下文中抽取答案。
  • 生成式:从上下文生成一个正确回答问题的答案。

开始之前,请确保你已经安装了所有必需的库:

pip install transformers datasets evaluate
加载 SQuAD 数据集

首先,通过 Datasets 库加载 SQuAD 数据集的一个较小子集。这将给你一个机会在使用完整数据集进行训练之前进行实验和确保一切工作正常。

from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")

使用[datasets.Dataset.train_test_split]方法将数据集的 train 拆分为训练集和测试集:

squad = squad.train_test_split(test_size=0.2)

然后看一个例子:

squad["train"][0]
# {'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
#  'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
#  'id': '5733be284776f41900661182',
#  'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
#  'title': 'University_of_Notre_Dame'
# }

这里有几个重要的字段:

  • answers:答案标记的起始位置和答案文本。
  • context:模型需要从中提取答案的背景信息。
  • question:模型应该回答的问题。
预处理

下一步是加载DistilBERT tokenizer以处理question和context字段:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

还有一些特定于问答任务的预处理步骤需要注意:

  1. 数据集中的一些示例可能具有非常长的 context,超过了模型的最大输入长度。为了处理更长的序列,只截断 context,将 truncation 设置为 only_second
  2. 接下来,通过设置 return_offsets_mapping=True,将回答的开始和结束位置映射到原始的 context
  3. 有了映射后,可以找到答案的开始和结束标记。使用[tokenizers.Encoding.sequence_ids]方法找出哪部分偏移对应于 question,哪部分对应于 context

下面是如何创建函数来截断和映射answer的开始和结束标记到context的方法:

def preprocess_function(examples):
     questions = [q.strip() for q in examples["question"]]
     inputs = tokenizer(
         questions,
         examples["context"],
         max_length=384,
         truncation="only_second",
         return_offsets_mapping=True,
         padding=True,
         max_length=(64, 384),  # 扩展输入以适应新的输入token
         stride=128  # 测试时按照128 stride
     )

     offset_mapping = inputs.pop("offset_mapping")
     answers = examples["answers"]
     start_positions = []
     end_positions = []

     for i, offset in enumerate(offset_mapping):
         answer = answers[i]
         start_char = answer["answer_start"][0]
         end_char = answer["answer_start"][0] + len(answer["text"][0])
         sequence_ids = inputs.sequence_ids(i)

         # 找到上下文的开始和结束位置
         idx = 0
         while sequence_ids[idx] != 1:
             idx += 1
         context_start = idx
         while sequence_ids[idx] == 1:
             idx += 1
         context_end = idx - 1

         # 如果答案没有完全在上下文内,则标记为(0, 0)
         if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
             start_positions.append(0)
             end_positions.append(0)
         else:
             # 否则是开始和结束标记的位置
             idx = context_start
             while idx <= context_end and offset[idx][0] <= start_char:
                 idx += 1
             start_positions.append(idx - 1)

             idx = context_end
             while idx >= context_start and offset[idx][1] >= end_char:
                 idx -= 1
             end_positions.append(idx + 1)

     inputs["start_positions"] = start_positions
     inputs["end_positions"] = end_positions
     return inputs

要在整个数据集上应用预处理函数,使用 Datasets 的[datasets.Dataset.map]函数即可。你可以通过将 batched=True 设置为一次处理数据集的多个元素来加快map函数的速度。删除你不需要的任何列:

tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

然后使用[DefaultDataCollator]创建一批示例。与 Transformers 中的其他数据整理器不同,[DefaultDataCollator]不会应用任何额外的预处理,例如填充。

# pytorch 代码
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

# tensorflow代码
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator(return_tensors="tf")
训练
  1. pytorch代码

如果你不熟悉使用[Trainer]微调模型,请参阅此处的基础教程!

现在你可以开始训练模型了!使用[AutoModelForQuestionAnswering]加载 DistilBERT

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

在这一点上,只剩下三个步骤:

  • 在[TrainingArguments]中定义你的训练超参数。唯一需要的参数是 output_dir,指定保存模型的位置。你可以通过设置 push_to_hub=True 将模型推送到Hub(你需要登录Hugging Face才能上传模型)。
  • 将训练参数与模型、数据集、tokenizer和数据整理器一起传递给[Trainer]。
  • 调用[Trainer.train]进行微调模型。
training_args = TrainingArguments(
     output_dir="my_awesome_qa_model",
     evaluation_strategy="epoch",
     learning_rate=2e-5,
     per_device_train_batch_size=16,
     per_device_eval_batch_size=16,
     num_train_epochs=3,
     weight_decay=0.01,
     push_to_hub=True,
 )

trainer = Trainer(
     model=model,
     args=training_args,
     train_dataset=tokenized_squad["train"],
     eval_dataset=tokenized_squad["test"],
     tokenizer=tokenizer,
     data_collator=data_collator,
)

trainer.train()

训练完成后,使用[transformers.Trainer.push_to_hub]方法将模型分享给Hub,这样每个人都可以使用你的模型:

trainer.push_to_hub()
  1. tensorflow代码

要在 TensorFlow 中微调模型,请首先设置优化器、学习率计划和一些训练超参数:

from transformers import create_optimizer

batch_size = 16
num_epochs = 2
total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
optimizer, schedule = create_optimizer(
     init_lr=2e-5,
     num_warmup_steps=0,
     num_train_steps=total_train_steps,
)

然后使用[TFAutoModelForQuestionAnswering]加载 DistilBERT

from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")

使用[transformers.TFPreTrainedModel.prepare_tf_dataset]将数据集转换为 tf.data.Dataset 格式:

tf_train_set = model.prepare_tf_dataset(
     tokenized_squad["train"],
     shuffle=True,
     batch_size=16,
     collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
     tokenized_squad["test"],
     shuffle=False,
     batch_size=16,
     collate_fn=data_collator,
)

使用 compile 为训练配置模型:

import tensorflow as tf

model.compile(optimizer=optimizer)

在开始训练之前,你还需要提供一种将模型推送到Hub的方法。这可以通过在[transformers.PushToHubCallback]中指定要推送模型和 tokenizer 的位置来完成:

from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(
     output_dir="my_awesome_qa_model",
     tokenizer=tokenizer,
)

最后,你已经准备好开始训练模型了!调用fit与训练集、验证集的样本数量、回调函数来微调模型:

model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2, callbacks=[callback])

训练完成后,你的模型将自动上传到Hub,以便每个人都可以使用它!

推理

至此,已经微调了一个模型,现在可以用它进行推理了!提出一个问题和一些你希望模型预测的上下文:

在使用微调模型进行推理时,最简单的方法是在[ pipeline]中使用它。使用你的模型实例化一个问题回答的 pipeline,并将你的文本传递给它:

from transformers import pipeline

question_answerer = pipeline("question-answering", model="my_awesome_qa_model")
question_answerer(question=question, context=context)
# {'score': 0.2058267742395401,
#  'start': 10,
#  'end': 95,
#  'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'}

如果你愿意,你也可以手动复制 pipeline 的结果:

  1. pytorch 代码

对文本进行标记化并返回 PyTorch 张量:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
inputs = tokenizer(question, context, return_tensors="pt")

将你的输入传递给模型并返回 logits

import torch
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
with torch.no_grad():
    outputs = model(**inputs)

从模型输出中获取开始和结束位置的最高概率:

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

将预测的标记解码为答案:

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'
  1. tensorflow 代码

对文本进行标记化并返回 TensorFlow 张量:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
inputs = tokenizer(question, text, return_tensors="tf")

将你的输入传递给模型并返回logits:

from transformers import TFAutoModelForQuestionAnswering

model = TFAutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
outputs = model(**inputs)

从模型输出中获取开始和结束位置的最高概率:

answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])

将预测的标记解码为答案:

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'
因果关系标记语言模型

语言建模有两种类型,因果和掩码。本指南介绍因果语言建模。因果语言模型经常用于文本生成。你可以将这些模型用于创意应用,例如选择你自己的文字冒险或智能编码助手(如 CopilotCodeParrot)。

因果语言建模预测标记序列中的下一个标记,并且模型只能关注左侧的标记。这意味着模型无法看到未来的标记。GPT-2 是因果语言模型的一个例子。

开始之前,请确保安装了所有必要的库:

pip install transformers datasets evaluate
加载ELI5数据集

首先,从数据集库中加载 ELI5数据集 的较小子集 r/askscience 子集。这样可以让你有机会进行实验,并确保在完整数据集上进行更多时间的训练之前,一切正常。

from datasets import load_dataset

eli5 = load_dataset("eli5", split="train_asks[:5000]")

使用[datasets.Dataset.train_test_split]方法将数据集的 train_asks 拆分为训练集和测试集:

eli5 = eli5.train_test_split(test_size=0.2)

然后看一个示例:

eli5["train"][0]

得到的结果为:

{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}

虽然这看起来很多,但你实际上只对“text”字段感兴趣。语言建模任务的有趣之处在于你不需要标签(也称为无监督任务),因为下一个单词就是标签。

预处理

下一步是加载DistilGPT2标记器以处理“text”子字段:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

你将从上面的示例中注意到, text 字段实际上是嵌套在 answers 内部的。这意味着你需要使用[datasets.Dataset.flatten]方法从其嵌套结构中提取 text 子字段:

eli5 = eli5.flatten()
eli5["train"][0]

输出的结果为:

{
 'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []
}

现在,每个子字段都是一个单独的列,如 answers 前缀所示,text 字段现在是一个列表。不是分别对每个句子进行标记化,而是将列表转换为字符串,以便可以联合对其进行标记化。

下面是用于连接示例中的字符串列表并对结果进行标记化的第一个预处理函数:

def preprocess_function(examples):
    return tokenizer([" ".join(x) for x in examples["answers.text"]])

使用数据集[datasets.Dataset.map]方法将该预处理函数应用于整个数据集。通过将batched=True 设置为同时处理数据集的多个元素,并使用 num_proc 增加进程的数量,可以加速map函数的处理速度。删除不需要的列:

tokenized_eli5 = eli5.map(
     preprocess_function,
     batched=True,
     num_proc=4,
     remove_columns=eli5["train"].column_names,
)

该数据集包含标记序列,但其中一些序列长度超过了模型的最大输入长度。

现在,可以使用第二个预处理函数来

  • 连接所有序列
  • 将连接的序列拆分为长度由 block_size 定义的较短的块,其长度应小于最大输入长度,并且足够短以适应 GPU RAM
block_size = 128

def group_texts(examples):
     # 连接所有文本。
     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
     total_length = len(concatenated_examples[list(examples.keys())[0]])
     # 我们丢弃剩余的小块,我们可以添加填充而不是丢弃的部分,你可以根据需要自定义此部分。
     if total_length >= block_size:
         total_length = (total_length // block_size) * block_size
     # 按block_size进行拆分。
     result = {
         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
         for k, t in concatenated_examples.items()
     }
     result["labels"] = result["input_ids"].copy()
     return result

在整个数据集上应用 group_texts 函数:

lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在,使用[DataCollatorForLanguageModeling]创建一批示例。在整理过程中使用动态填充模型在一批中最长长度的句子更有效,而不是将整个数据集填充到最大长度。

使用终止序列 token 作为填充 token,并设置mlm=False。这将使用右移一个元素的标签作为输入:

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

使用终止序列 token 作为填充 token,并设置 mlm=False。这将使用右移一个元素的标签作为输入:

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
训练

现在已经准备好开始训练模型了!使用[AutoModelForCausalLM]加载 DistilGPT2 模型:

from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

model = AutoModelForCausalLM.from_pretrained("distilgpt2")

现在只剩下三个步骤:

  • 使用[TrainingArguments]定义训练超参数。唯一需要的参数是 output_dir,用于指定保存模型的位置。设置 push_to_hub=True将此模型推送到Hub(你需要登录Hugging Face以上传模型)。
  • 将训练参数与模型、数据集和数据整理器一起传递给[Trainer]。
  • 调用[Trainer.train]来微调模型。
training_args = TrainingArguments(
     output_dir="my_awesome_eli5_clm-model",
     evaluation_strategy="epoch",
     learning_rate=2e-5,
     weight_decay=0.01,
     push_to_hub=True,
)

trainer = Trainer(
     model=model,
     args=training_args,
     train_dataset=lm_dataset["train"],
     eval_dataset=lm_dataset["test"],
     data_collator=data_collator,
)

trainer.train()

训练完成后,使用[~transformers.Trainer.evaluate]方法评估模型并获得困惑度:

import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

输出结果为:

Perplexity: 49.61

然后可以使用[~transformers.Trainer.push_to_hub]方法将模型分享到Hub,以便每个人都可以使用你的模型:

trainer.push_to_hub()

要在TensorFlow中微调模型,请首先设置优化器函数、学习率调度和一些训练超参数:

from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后可以使用[TFAutoModelForCausalLM]加载DistilGPT2模型:

from transformers import TFAutoModelForCausalLM

model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")

使用[transformers.TFPreTrainedModel.prepare_tf_dataset]将数据集转换为 tf.data.Dataset 格式:

tf_train_set = model.prepare_tf_dataset(
     lm_dataset["train"],
     shuffle=True,
     batch_size=16,
     collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
     lm_dataset["test"],
     shuffle=False,
     batch_size=16,
     collate_fn=data_collator,
)

使用 compile 为训练配置模型。请注意,Transformer 模型都有一个默认的与任务相关的损失函数,所以除非你想要指定一个,否则不需要特别指定:

import tensorflow as tf

model.compile(optimizer=optimizer)  # 没有损失参数!

这可以通过 在[transformers.PushToHubCallback]中指定要推送模型和分词器的位置 来实现:

from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(
     output_dir="my_awesome_eli5_clm-model",
     tokenizer=tokenizer,
)

最后,你可以开始训练模型了!使用fit方法并传入训练和验证数据集、训练轮数,以及回调函数来微调模型:

model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

训练完成后,你的模型会自动上传到Hub,这样每个人都可以使用它!

推断

现在已经微调了模型,可以用它进行推断了!先构造一个你想要生成文本的输入提示:

prompt = "Somatic hypermutation allows the immune system to"

尝试使用[ pipeline] 中的模型进行推断是最简单的方法。使用你的模型实例化一个文本生成的 pipeline,并将文本传递给它:

from transformers import pipeline

generator = pipeline("text-generation", model="my_awesome_eli5_clm-model")
generator(prompt)

输出的结果为:

[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]

将文本分词处理并将input_ids返回为 PyTorch 张量:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
inputs = tokenizer(prompt, return_tensors="pt").input_ids

使用[transformers.generation_utils.GenerationMixin.generate]方法生成文本。有关不同的文本生成策略和控制生成的参数的更多细节,请查看文本生成策略页面。

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的token id解码回文本:

tokenizer.batch_decode(outputs, skip_special_tokens=True)

输出结果如下:

["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]

将文本分词处理并将input_ids返回为 TensorFlow张量:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
inputs = tokenizer(prompt, return_tensors="tf").input_ids

使用[transformers.generation_tf_utils.TFGenerationMixin.generate]方法创建摘要。有关不同的文本生成策略和控制生成的参数的更多细节,请查看文本生成策略页面。

from transformers import TFAutoModelForCausalLM

model = TFAutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的 token id 解码回文本:

tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for']
掩码标记语言模型

掩码语言建模是预测序列中掩码标记的任务,模型可以双向关注标记。这意味着模型可以完全访问左右两侧的标记。掩码语言建模适用于需要对整个序列具有良好上下文理解的任务。BERT 就是掩码语言模型的一个例子。
在开始之前,请确保已安装所有必要的库:

pip install transformers datasets evaluate
加载ELI5数据集

首先,从 Datasets库中加载 ELI5 数据集中 r/askscience 子集的一个较小子集。这将给你一个机会进行实验,并确保一切正常,然后再花更多时间在完整数据集上进行训练。

from datasets import load_dataset

eli5 = load_dataset("eli5", split="train_asks[:5000]")

将数据集的 train_asks 拆分为训练集和测试集,使用[datasets.Dataset.train_test_split]方法:

eli5 = eli5.train_test_split(test_size=0.2)

然后查看一个示例:

eli5["train"][0]

输出结果为:

{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}

虽然这看起来可能有点多,但你实际上只对 text 字段感兴趣。关于语言建模任务的酷炫之处在于你不需要标签(也称为无监督任务),因为下一个词就是标签。

预处理

对于掩码语言建模,下一步是加载一个 DistilRoBERTa 分词器来处理 text 子字段:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

你会注意到上面的示例中,text 字段实际上是嵌套在 answers 内的。这意味着你需要使用[flatten]方法从其嵌套结构中提取 text子字段:

eli5 = eli5.flatten()
eli5["train"][0]

输出结果为:

{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []}

现在,每个子字段都是一个单独的列,如 answers 前缀所示,而 text 字段现在是一个列表。不需要将每个句子分别标记化,而是将列表转换为字符串,以便可以联合标记化它们。

下面是第一个预处理函数,该函数将每个示例的字符串列表连接起来,并对结果进行标记化:

def preprocess_function(examples):
     return tokenizer([" ".join(x) for x in examples["answers.text"]])

使用[datasets.Dataset.map]方法将此预处理函数应用于整个数据集。你可以通过设置 batched=True 以一次处理数据集的多个元素,并通过设置 num_proc 来增加进程数,以加快map函数的速度。删除你不需要的任何列:

tokenized_eli5 = eli5.map(
     preprocess_function,
     batched=True,
     num_proc=4,
     remove_columns=eli5["train"].column_names,
)

该数据集包含标记序列,但其中一些序列比模型的最大输入长度要长。

现在,使用第二个预处理函数:

  • 连接所有序列
  • 根据 block_size 将连接后的序列拆分为较短的块,其长度应小于最大输入长度,并且足够小,适应你的 GPU RAM
block_size = 128

def group_texts(examples):
     # 合并所有文本。
     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
     total_length = len(concatenated_examples[list(examples.keys())[0]])
     # 我们舍弃掉不足一个block_size的部分,我们可以在这里添加填充,而不是舍弃,前提是模型支持填充,你可以根据需要自定义此部分。
     if total_length >= block_size:
         total_length = (total_length // block_size) * block_size
     # 按block_size划分。
     result = {
         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
         for k, t in concatenated_examples.items()
     }
     return result

在整个数据集上应用 group_texts函数:

lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在,使用[DataCollatorForLanguageModeling]创建一个示例批次。在整个批次的数据校对期间,动态填充句子到批次中的最长长度比在整个数据集上填充到最大长度更高效。

将序列结束标记用作填充标记,并指定 mlm_probability,以便在每次迭代数据时随机掩码标记:

from transformers import DataCollatorForLanguageModeling

tokenizer.pad_token = tokenizer.eos_token
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

将序列结束标记用作填充标记,并指定mlm_probability,以便在每次迭代数据时随机掩码标记:

from transformers import DataCollatorForLanguageModeling

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf")
训练

如果你对使用[ Trainer ]微调模型不太熟悉,请查看此处的基本教程。

你现在可以开始训练模型了!使用[AutoModelForMaskedLM]加载 DistilRoBERTa

from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

只剩下三个步骤:

  • 在[TrainingArguments]中定义你的训练超参数。仅需要 output_dir 参数,指定保存模型的位置。
  • 将训练参数与模型、数据集和数据校验器一起传递给[Trainer]。
  • 调用[Trainer.train]进行模型微调。
training_args = TrainingArguments(
     output_dir="my_awesome_eli5_mlm_model",
     evaluation_strategy="epoch",
     learning_rate=2e-5,
     num_train_epochs=3,
     weight_decay=0.01,
     push_to_hub=True,
)

trainer = Trainer(
     model=model,
     args=training_args,
     train_dataset=lm_dataset["train"],
     eval_dataset=lm_dataset["test"],
     data_collator=data_collator,
)

trainer.train()

当训练完成后,使用[transformers.Trainer.evaluate]方法评估模型并获得模型的困惑度:

import math

eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

将你的模型共享到Hub上,使用[transformers.Trainer.push_to_hub]方法,这样每个人都可以使用你的模型:

trainer.push_to_hub()

要在 TensorFlow 中微调模型,请首先设置优化器函数、学习率调度程序和一些训练超参数:

from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后可以使用[TFAutoModelForMaskedLM]加载 DistilRoBERTa

from transformers import TFAutoModelForMaskedLM

model = TFAutoModelForMaskedLM.from_pretrained("distilroberta-base")

使用[transformers.TFPreTrainedModel.prepare_tf_dataset]将数据集转换为 tf.data.Dataset 格式:

tf_train_set = model.prepare_tf_dataset(
     lm_dataset["train"],
     shuffle=True,
     batch_size=16,
     collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
     lm_dataset["test"],
     shuffle=False,
     batch_size=16,
     collate_fn=data_collator,
)

使用 compile 为模型配置训练。请注意,Transformers 模型都有默认的与任务相关的损失函数,因此除非你想要指定一个,否则不需要指定:

import tensorflow as tf

model.compile(optimizer=optimizer)  # 没有损失参数!

可以通过在[~transformers.PushToHubCallback]中指定要推送模型和分词器的位置来实现:

from transformers.keras_callbacks import PushToHubCallback

callback = PushToHubCallback(
     output_dir="my_awesome_eli5_mlm_model",
     tokenizer=tokenizer,
)

最后,你可以开始训练模型!使用你的训练和验证数据集、epochs 的数量以及你的回调函数来调用 fit 方法来微调模型:

model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])
推断

现在已经对模型进行了微调,可以用它进行推断了!先想出一些希望模型填补空白的文本,并使用特殊的 <mask> 标记表示空白处:

text = "The Milky Way is a <mask> galaxy."

尝试使用微调后的模型进行推断的最简单方式是将其用于 pipeline 中。实例化一个填充掩码的 pipeline,将模型和文本传递给它。如果需要,你可以使用 top_k参数指定要返回的预测数量:

from transformers import pipeline

mask_filler = pipeline("fill-mask", "stevhliu/my_awesome_eli5_mlm_model")
mask_filler(text, top_k=3)

结果如下:

[{'score': 0.5150994658470154,
  'token': 21300,
  'token_str': ' spiral',
  'sequence': 'The Milky Way is a spiral galaxy.'},
 {'score': 0.07087188959121704,
  'token': 2232,
  'token_str': ' massive',
  'sequence': 'The Milky Way is a massive galaxy.'},
 {'score': 0.06434620916843414,
  'token': 650,
  'token_str': ' small',
  'sequence': 'The Milky Way is a small galaxy.'}]

将文本进行标记化,并返回 PyTorch 张量作为input_ids。你还需要指定标记的位置:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
inputs = tokenizer(text, return_tensors="pt")
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

将输入传递给模型,并返回掩码标记的 logits

from transformers import AutoModelForMaskedLM

model = AutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]

然后返回具有最高概率的三个掩码标记并将其打印出来:

top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

for token in top_3_tokens:
     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))

输出结果:

The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.

将文本进行标记化,并返回 TensorFlow 张量作为input_ids。你还需要指定标记的位置:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
inputs = tokenizer(text, return_tensors="tf")
mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

将输入传递给模型,并返回掩码标记的 logits

from transformers import TFAutoModelForMaskedLM

model = TFAutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
logits = model(**inputs).logits
mask_token_logits = logits[0, mask_token_index, :]

然后返回具有最高概率的三个掩码标记并将其打印出来:

top_3_tokens = tf.math.top_k(mask_token_logits, 3).indices.numpy()

for token in top_3_tokens:
     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))

输出结果:

The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.
翻译

翻译是将一种语言的文本序列转换为另一种语言的过程。它是可以将输入转化为输出的序列到序列问题之一,也是一个用于实现翻译或摘要等任务的强大框架。翻译系统通常用于不同语言文本之间的翻译,但也可用于语音或文本转语音或语音转文本等混合应用。
开始之前,请确保已经安装了所需的库:

pip install transformers datasets evaluate sacreble
加载 OPUS Books 数据集

首先从 Datasets 库加载 OPUS Books 数据集的英法子集:

from datasets import load_dataset

books = load_dataset("opus_books", "en-fr")

使用[datasets.Dataset.train_test_split]方法将数据集分割为训练集和测试集:

books = books["train"].train_test_split(test_size=0.2)

然后查看一个示例:

books["train"][0]

结果为:

{'id': '90560',
 'translation': {'en': 'But this lofty plateau measured only a few fathoms, and soon we reentered Our Element.',
  'fr': 'Mais ce plateau élevé ne mesurait que quelques toises, et bientôt nous fûmes rentrés dans notre élément.'}}

其中:

  • translation:文本的英语和法语翻译。
预处理

下一步是加载一个T5分词器来处理英法语言对:

from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

你要创建的预处理函数需要:

  • 使用提示对输入添加前缀,以便 T5 知道这是一个翻译任务。一些能够执行多个 NLP 任务的模型需要为特定任务提供提示。
  • 设置 text_target参数为法语,即目标语言为法语以便分词器处理的目标语言;如果不设置,则默认目标语言为英语 。
  • 截断序列,使其长度不能大于 max_length 参数设置的最大长度。
source_lang = "en"
target_lang = "fr"
prefix = "translate English to French: "

def preprocess_function(examples):
     inputs = [prefix + example[source_lang] for example in examples["translation"]]
examples["translation"]]
     model_inputs = tokenizer(inputs, text_target=targets, max_length=128, truncation=True)
     return model_inputs

preprocess_function 函数应用到整个数据集,可以使用 Datasets [datasets.Dataset.map] 方法。为了让 map函数执行加速,从而一次处理数据集中的多个元素,可以通过设置 batched=True 来实现:

tokenized_books = books.map(preprocess_function, batched=True)

现在使用 DataCollatorForSeq2Seq 创建一个示例批次。在批量整理过程中,将批次的各个句子动态填充到批次中的最大长度句子的长度上,而不是将整个数据集填充到最大长度。

# pytorch代码
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

#tensorflow代码
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")
评估

在训练过程中包含度量指标通常有助于评估模型的性能。你可以使用 Evaluate 库快速加载评估方法。对于此任务,加载 SacreBLEU 度量指标(请参阅 Evaluate 快速入门 了解有关如何加载和计算度量指标的更多信息):

import evaluate

metric = evaluate.load("sacrebleu")

然后创建一个函数,将预测结果和标签传递给 [evaluate.EvaluationModule.compute] 来计算 SacreBLEU 分数:

import numpy as np

def postprocess_text(preds, labels):
     preds = [pred.strip() for pred in preds]
     labels = [[label.strip()] for label in labels]

     return preds, labels


def compute_metrics(eval_preds):
     preds, labels = eval_preds
     if isinstance(preds, tuple):
         preds = preds[0]
     decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

     labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
     decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

     decoded_preds, decoded_labels = postprocess_text(decoded_preds, decoded_labels)

     result = metric.compute(predictions=decoded_preds, references=decoded_labels)
     result = {"bleu": result["score"]}

     prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in preds]
     result["gen_len"] = np.mean(prediction_lens)
     result = {k: round(v, 4) for k, v in result.items()}
     return result

你的 compute_metrics 函数现在已经准备好了,当你设置训练时会回到它。

训练
  1. pytorch 代码
    现在你可以开始训练你的模型了!使用 AutoModelForSeq2SeqLM 加载 T5
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
  
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

此时只剩下三个步骤:

  • 在 [Seq2SeqTrainingArguments] 中定义你的训练超参数。唯一需要的参数是 output_dir,用于指定保存模型的位置。通过设置 push_to_hub=True,你可以将该模型推送到 Hub 上(上传模型需要登录 Hugging Face)。在每个 epoch 结束时,[Trainer] 将评估 SacreBLEU 指标并保存训练的检查点。
  • 将训练参数与模型、数据集、分词器、数据整理器和 compute_metrics 函数一起传递给 [Seq2SeqTrainer]。
  • 调用 [Trainer.train] 来微调你的模型。
  training_args = Seq2SeqTrainingArguments(
     output_dir="my_awesome_opus_books_model",
     evaluation_strategy="epoch",
     learning_rate=2e-5,
     per_device_train_batch_size=16,
     per_device_eval_batch_size=16,
     weight_decay=0.01,
     save_total_limit=3,
     num_train_epochs=2,
     predict_with_generate=True,
     fp16=True,
     push_to_hub=True,
)

trainer = Seq2SeqTrainer(
     model=model,
     args=training_args,
     train_dataset=tokenized_books["train"],
     eval_dataset=tokenized_books["test"],
     tokenizer=tokenizer,
     data_collator=data_collator,
     compute_metrics=compute_metrics,
)

trainer.train()

一旦训练完成,使用[transformers.Trainer.push_to_hub]方法将你的模型共享到 Hub,以便每个人都可以使用你的模型:

trainer.push_to_hub()
  1. tensorflow 代码
    要在 TensorFlow 中微调模型,首先设置优化器函数、学习率计划和一些训练超参数:
from transformers import AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后,你可以使用 [TFAutoModelForSeq2SeqLM] 加载 T5

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

使用 [transformers.TFPreTrainedModel.prepare_tf_dataset] 将数据集转换为 tf.data.Dataset 格式:

tf_train_set = model.prepare_tf_dataset(
     tokenized_books["train"],
     shuffle=True,
     batch_size=16,
     collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
     tokenized_books["test"],
     shuffle=False,
     batch_size=16,
     collate_fn=data_collator,
)

使用 compile 来配置模型进行训练。请注意,Transformers 模型都具有默认的与任务相关的损失函数,因此除非你想要指定一个特定的损失函数,否则不需要指定:

import tensorflow as tf

model.compile(optimizer=optimizer)  # 不需要损失参数!

在开始训练之前,还有最后两件事情要设置:从预测中计算 SacreBLEU 指标,并提供一种将模型推送到 Hub 的方式。这两个都可以通过使用 Keras 回调 来完成。

将你的 compute_metrics 函数传递给 [transformers.KerasMetricCallback]:

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

在 [transformers.PushToHubCallback] 中指定要推送模型和分词器的位置:

from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
     output_dir="my_awesome_opus_books_model",
     tokenizer=tokenizer,
     ... 
)

然后将回调函数捆绑在一起:

callbacks = [metric_callback, push_to_hub_callback]

最后,你已经准备好开始训练模型了!使用你的训练和验证数据集、迭代次数以及回调函数来调用 fit 来微调模型:

model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks)

一旦训练完成,你的模型就会自动上传到 Hub 上,这样每个人都可以使用它!

推理

现在你已经对模型进行了微调,可以用它进行推理了!先假定想要翻译成其他语言的文本。对于 T5,根据你正在处理的任务,你需要为输入加上前缀。例如,要从英语翻译成法语,你应该按照以下方式为输入加上前缀:

text = "translate English to French: Legumes share resources with nitrogen-fixing bacteria."

尝试使用已经微调的模型进行推理的最简单方法是将其用于 [pipeline] 中。使用你的模型实例化一个翻译的 pipeline,然后将文本传递给它:

from transformers import pipeline

translator = pipeline("translation", model="my_awesome_opus_books_model")
translator(text)

结果为:

[{'translation_text': 'Legumes partagent des ressources avec des bactéries azotantes.'}]

如果你愿意,你也可以手动复制 pipeline 的结果:

  1. pytorch 代码
    将文本进行标记化,并将 input_ids 返回为 PyTorch 张量:
from transformers import AutoTokenizer
  
tokenizer = AutoTokenizer.from_pretrained("my_awesome_opus_books_model")
inputs = tokenizer(text, return_tensors="pt").input_ids

使用 [transformers.generation_utils.GenerationMixin.generate] 方法生成翻译结果。

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("my_awesome_opus_books_model")
outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)

将生成的 token id 解码回文本:

tokenizer.decode(outputs[0], skip_special_tokens=True)
  1. tensorflow 代码
    将文本进行标记化,并将 input_ids 返回为 TensorFlow 张量:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_opus_books_model")
inputs = tokenizer(text, return_tensors="tf").input_ids

使用 [transformers.generation_tf_utils.TFGenerationMixin.generate] 方法生成翻译结果。

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("my_awesome_opus_books_model")
outputs = model.generate(inputs, max_new_tokens=40, do_sample=True, top_k=30, top_p=0.95)

将生成的 token id 解码回文本:

tokenizer.decode(outputs[0], skip_special_tokens=True)
文本摘要

摘要生成一个较短的文档或文章,该摘要捕捉到所有重要信息。与翻译一样,摘要是可以被定义成序列-序列任务的另一个例子。摘要可以是:

  • 抽取式的:从文档中抽取最相关的信息。
  • 创造性的:生成新的文本,抓住最相关的信息。

在开始之前,请确保已经安装了所有必要的库文件:

pip install transformers datasets evaluate rouge_score
加载 BillSum 数据集

首先从数据集库中加载较小的加利福尼亚州法案子集的 BillSum 数据集:

from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

使用[datasets.Dataset.train_test_split]方法将数据集分割成训练集和测试集:

billsum = billsum.train_test_split(test_size=0.2)

然后查看一个示例:

billsum["train"][0]

结果为:

{
  'summary': 'Existing law authorizes state agencies to enter into contracts for the acquisition of goods or services upon approval by the Department of General Services. Existing law sets forth various requirements and prohibitions for those contracts, including, but not limited to, a prohibition on entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between spouses and domestic partners or same-sex and different-sex couples in the provision of benefits. Existing law provides that a contract entered into in violation of those requirements and prohibitions is void and authorizes the state or any person acting on behalf of the state to bring a civil action seeking a determination that a contract is in violation and therefore void. Under existing law, a willful violation of those requirements and prohibitions is a misdemeanor.\nThis bill would also prohibit a state agency from entering into contracts for the acquisition of goods or services of $100,000 or more with a contractor that discriminates between employees on the basis of gender identity in the provision of benefits, as specified. By expanding the scope of a crime, this bill would impose a state-mandated local program.\nThe California Constitution requires the state to reimburse local agencies and school districts for certain costs mandated by the state. Statutory provisions establish procedures for making that reimbursement.\nThis bill would provide that no reimbursement is required by this act for a specified reason.',
   'text': 'The people of the State of California do enact as follows:\n\n\nSECTION 1.\nSection 10295.35 is added to the Public Contract Code, to read:\n10295.35.\n(a) (1) Notwithstanding any other law, a state agency shall not enter into any contract for the acquisition of goods or services in the amount of one hundred thousand dollars ($100,000) or more with a contractor that, in the provision of benefits, discriminates between employees on the basis of an employee’s or dependent’s actual or perceived gender identity, including, but not limited to, the employee’s or dependent’s identification as transgender.\n(2) For purposes of this section, “contract” includes contracts with a cumulative amount of one hundred thousand dollars ($100,000) or more per contractor in each fiscal year.\n(3) For purposes of this section, an employee health plan is discriminatory if the plan is not consistent with Section 1365.5 of the Health and Safety Code and Section 10140 of the Insurance Code.\n(4) The requirements of this section shall apply only to those portions of a contractor’s operations that occur under any of the following conditions:\n(A) Within the state.\n(B) On real property outside the state if the property is owned by the state or if the state has a right to occupy the property, and if the contractor’s presence at that location is connected to a contract with the state.\n(C) Elsewhere in the United States where work related to a state contract is being performed.\n(b) Contractors shall treat as confidential, to the maximum extent allowed by law or by the requirement of the contractor’s insurance provider, any request by an employee or applicant for employment benefits or any documentation of eligibility for benefits submitted by an employee or applicant for employment.\n(c) After taking all reasonable measures to find a contractor that complies with this section, as determined by the state agency, the requirements of this section may be waived under any of the following circumstances:\n(1) There is only one prospective contractor willing to enter into a specific contract with the state agency.\n(2) The contract is necessary to respond to an emergency, as determined by the state agency, that endangers the public health, welfare, or safety, or the contract is necessary for the provision of essential services, and no entity that complies with the requirements of this section capable of responding to the emergency is immediately available.\n(3) The requirements of this section violate, or are inconsistent with, the terms or conditions of a grant, subvention, or agreement, if the agency has made a good faith attempt to change the terms or conditions of any grant, subvention, or agreement to authorize application of this section.\n(4) The contractor is providing wholesale or bulk water, power, or natural gas, the conveyance or transmission of the same, or ancillary services, as required for ensuring reliable services in accordance with good utility practice, if the purchase of the same cannot practically be accomplished through the standard competitive bidding procedures and the contractor is not providing direct retail services to end users.\n(d) (1) A contractor shall not be deemed to discriminate in the provision of benefits if the contractor, in providing the benefits, pays the actual costs incurred in obtaining the benefit.\n(2) If a contractor is unable to provide a certain benefit, despite taking reasonable measures to do so, the contractor shall not be deemed to discriminate in the provision of benefits.\n(e) (1) Every contract subject to this chapter shall contain a statement by which the contractor certifies that the contractor is in compliance with this section.\n(2) The department or other contracting agency shall enforce this section pursuant to its existing enforcement powers.\n(3) (A) If a contractor falsely certifies that it is in compliance with this section, the contract with that contractor shall be subject to Article 9 (commencing with Section 10420), unless, within a time period specified by the department or other contracting agency, the contractor provides to the department or agency proof that it has complied, or is in the process of complying, with this section.\n(B) The application of the remedies or penalties contained in Article 9 (commencing with Section 10420) to a contract subject to this chapter shall not preclude the application of any existing remedies otherwise available to the department or other contracting agency under its existing enforcement powers.\n(f) Nothing in this section is intended to regulate the contracting practices of any local jurisdiction.\n(g) This section shall be construed so as not to conflict with applicable federal laws, rules, or regulations. In the event that a court or agency of competent jurisdiction holds that federal law, rule, or regulation invalidates any clause, sentence, paragraph, or section of this code or the application thereof to any person or circumstances, it is the intent of the state that the court or agency sever that clause, sentence, paragraph, or section so that the remainder of this section shall remain in effect.\nSEC. 2.\nSection 10295.35 of the Public Contract Code shall not be construed to create any new enforcement authority or responsibility in the Department of General Services or any other contracting agency.\nSEC. 3.\nNo reimbursement is required by this act pursuant to Section 6 of Article XIII\u2009B of the California Constitution because the only costs that may be incurred by a local agency or school district will be incurred because this act creates a new crime or infraction, eliminates a crime or infraction, or changes the penalty for a crime or infraction, within the meaning of Section 17556 of the Government Code, or changes the definition of a crime within the meaning of Section 6 of Article XIII\u2009B of the California Constitution.',
 'title': 'An act to add Section 10295.35 to the Public Contract Code, relating to public contracts.'}

其中:

  • text: 输入样本内容
  • summary:通过对输入样本内容提取的摘要

现在使用 DataCollatorForSeq2Seq 创建一个批次的示例。在汇编期间,通过动态填充句子到批次中的最大长度,而不是填充整个数据集到最大长度,可以提高效率。

# pytorch代码
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

#tensorflow 代码
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")
预处理

下一步就是载入 T5tokenizer处理textsumarray

from transformers import AutoTokenizer

checkpoint = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

要创建的预处理函数需要:

  • 在输入前加上提示,这样 T5 就知道这是一个摘要任务。一些复合 NLP 任务的能力需要为指定任务加入提示。

  • 在标记标签时使用关键字 text_target 参数。

  • 截断序列,使其长度不超过 max_length 参数设置的最大长度。

prefix = "summarize: "

def preprocess_function(examples):
     inputs = [prefix + doc for doc in examples["text"]]
     model_inputs = tokenizer(inputs, max_length=1024, truncation=True)

     labels = tokenizer(text_target=examples["summary"], max_length=128, truncation=True)

     model_inputs["labels"] = labels["input_ids"]
     return model_inputs

要应用预处理函数到整个数据集,可以使用 Datasets [datasets.Dataset.map] 方法。通过设置 batched=True 来加速 map 函数,以一次处理数据集中的多个元素:

tokenized_billsum = billsum.map(preprocess_function, batched=True)

现在使用[DataCollatorForSeq2Seq]创建一批示例。在排序过程中,将句子动态填充到批处理中最长的长度比将整个数据集填充到最大长度更有效。

# framework=”pt”
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)

# framework=“tf”
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint, return_tensors="tf")
评估

在训练过程中,添加度量方法通常有助于评估模型的性能。您可以通过 Evaluate 库快速加载一个评估方法。对于此任务,加载 ROUGE 度量:

import evaluate

rouge = evaluate.load("rouge")

然后创建一个函数,将您的预测和标签传递给 [evaluate.EvaluationModule.compute] 来计算 ROUGE 度量:

import numpy as np

def compute_metrics(eval_pred):
     predictions, labels = eval_pred
     decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
     labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
     decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

     result = rouge.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

     prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
     result["gen_len"] = np.mean(prediction_lens)

     return {k: round(v, 4) for k, v in result.items()}

现在您的 compute_metrics 函数已准备就绪,当您设置训练时将返回到它。

训练

现在可以开始训练您的模型了!使用 [AutoModelForSeq2SeqLM] 加载 T5

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)

目前只剩下三个步骤:

  • 在 [Seq2SeqTrainingArguments] 中定义您的训练超参数。唯一所需的参数是 output_dir,指定保存模型的位置。通过设置 push_to_hub=True 将此模型推送到 Hub(您需要登录 Hugging Face 才能上传模型)。在每个 epoch 结束时,[Trainer] 将评估 ROUGE 度量并保存训练检查点。
  • 将训练参数与模型、数据集、分词器、数据汇集器和 compute_metrics 函数一起传递给 [Seq2SeqTrainer]。
  • 调用 [Trainer.train] 对模型进行微调。
training_args = Seq2SeqTrainingArguments(
     output_dir="my_awesome_billsum_model",
     evaluation_strategy="epoch",
     learning_rate=2e-5,
     per_device_train_batch_size=16,
     per_device_eval_batch_size=16,
     weight_decay=0.01,
     save_total_limit=3,
     num_train_epochs=4,
     predict_with_generate=True,
     fp16=True,
     push_to_hub=True,
)

trainer = Seq2SeqTrainer(
     model=model,
     args=training_args,
     train_dataset=tokenized_billsum["train"],
     eval_dataset=tokenized_billsum["test"],
     tokenizer=tokenizer,
     data_collator=data_collator,
     compute_metrics=compute_metrics,
)

trainer.train()

训练完成后,使用 [transformers.Trainer.push_to_hub] 方法将模型共享到 Hub,以便每个人都可以使用您的模型:

trainer.push_to_hub()

要在 TensorFlow 中微调模型,则先首先设置优化器函数、学习率计划和一些训练超参数:

from transformers import create_optimizer, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后,使用 [TFAutoModelForSeq2SeqLM] 加载 T5

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained(checkpoint)

使用 [transformers.TFPreTrainedModel.prepare_tf_dataset] 将数据集转换为 tf.data.Dataset 格式:

tf_train_set = model.prepare_tf_dataset(
     tokenized_billsum["train"],
     shuffle=True,
     batch_size=16,
     collate_fn=data_collator,
)

tf_test_set = model.prepare_tf_dataset(
     tokenized_billsum["test"],
     shuffle=False,
     batch_size=16,
     collate_fn=data_collator,
)

使用 compile 配置用于训练的模型。请注意,Transformers 模型都有一个默认的任务相关损失函数,所以您不需要指定一个,除非您想要使用自定义的:

import tensorflow as tf

model.compile(optimizer=optimizer)  # 没有损失参数!

开始训练之前,最后要做的两件事情是从预测中计算 ROUGE 分数,并提供一种将模型推送到 Hub 的方法。这两个可以通过使用 Keras 回调 来完成。

将你的 compute_metrics 函数传递给 [transformers.KerasMetricCallback]:

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

指定将你的模型和 tokenizer 推送到的位置,用 [transformers.PushToHubCallback]:

from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
     output_dir="my_awesome_billsum_model",
     tokenizer=tokenizer,
)

然后将你的回调函数捆绑在一起:

callbacks = [metric_callback, push_to_hub_callback]

最后,你可以开始训练模型了!通过调用 fit,传递你的训练集、验证集、训练轮数和回调函数来微调模型:

model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=callbacks)

训练完成后,你的模型将自动上传到 Hub,大家都可以使用!

推理

现在已经微调了一个模型,可以用它做推理了!先准备一些你想要进行摘要的文本。对于 T5,你需要根据你正在处理的任务为输入加上前缀。对于摘要,你应该如下所示为输入加上前缀:

text = "summarize: The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country. It'll lower the deficit and ask the ultra-wealthy and corporations to pay their fair share. And no one making under $400,000 per year will pay a penny more in taxes."

尝试使用微调后的模型进行推理的最简单方法是在 [pipeline] 中使用它。使用你的模型实例化一个摘要的 pipeline,并将文本传递给它:

from transformers import pipeline

summarizer = pipeline("summarization", model="stevhliu/my_awesome_billsum_model")
summarizer(text)

结果为:

[{"summary_text": "The Inflation Reduction Act lowers prescription drug costs, health care costs, and energy costs. It's the most aggressive action on tackling the climate crisis in American history, which will lift up American workers and create good-paying, union jobs across the country."}]

如果你愿意,你也可以手动复制 pipeline 的结果:将文本进行分词,将 input_ids 返回为 PyTorch 张量:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model")
inputs = tokenizer(text, return_tensors="pt").input_ids

使用 [transformers.generation_utils.GenerationMixin.generate] 方法进行摘要。

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

将生成的 token ids 解码回文本:

tokenizer.decode(outputs[0], skip_special_tokens=True)
'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.'

将文本进行分词,将 input_ids 返回为 TensorFlow 张量:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_billsum_model")
inputs = tokenizer(text, return_tensors="tf").input_ids

使用 [transformers.generation_tf_utils.TFGenerationMixin.generate] 方法进行摘要。

from transformers import TFAutoModelForSeq2SeqLM

model = TFAutoModelForSeq2SeqLM.from_pretrained("stevhliu/my_awesome_billsum_model")
outputs = model.generate(inputs, max_new_tokens=100, do_sample=False)

将生成的标记 ID 解码回文本:

tokenizer.decode(outputs[0], skip_special_tokens=True)
'the inflation reduction act lowers prescription drug costs, health care costs, and energy costs. it's the most aggressive action on tackling the climate crisis in american history. it will ask the ultra-wealthy and corporations to pay their fair share.'
多项选择

多项选择类似于问答题,不同之处在于上下文中除了提供一个问题,还提供了若干个候选答案,模型的任务是选择正确答案。
开始之前,请确保你已安装所有必需的库:

pip install transformers datasets evaluate
加载SWAG数据集

首先,从数据集库加载 SWAG 数据集的 regular 配置:

from datasets import load_dataset

swag = load_dataset("swag", "regular")

然后查看一个示例:

swag["train"][0]

结果为:

{'ending0': 'passes by walking down the street playing their instruments.',
 'ending1': 'has heard approaching them.',
 'ending2': "arrives and they're outside dancing and asleep.",
 'ending3': 'turns the lead singer watches the performance.',
 'fold-ind': '3416',
 'gold-source': 'gold',
 'label': 0,
 'sent1': 'Members of the procession walk down the street holding small horn brass instruments.',
 'sent2': 'A drum line',
 'startphrase': 'Members of the procession walk down the street holding small horn brass instruments. A drum line',
 'video-id': 'anetv_jkn6uvmqwh4'}

尽管看起来字段很多,但实际上很简单:

  • sent1sent2:这些字段显示了句子的开头,并且如果将它们连接起来,你将得到 startphrase 字段。
  • ending:为句子的可能结尾提供了一些建议,但只有一个是正确答案。
  • label:标识正确的句子结尾。
预处理

接下来,加载 BERT tokenizer 来处理句子的开头和四个可能的结尾:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

你要创建的预处理函数需要:

  1. 复制 sent1 字段的四个副本,并将每个副本与 sent2 组合以重新创建句子的开头。
  2. sent2 与四个可能的句子结尾组合。
  3. 扁平化这两个列表,以便对它们进行分词,然后在分词后重新给它们定义形状,使每个示例都有相应的 input_idsattention_masklabels 字段。
ending_names = ["ending0", "ending1", "ending2", "ending3"]

def preprocess_function(examples):
     first_sentences = [[context] * 4 for context in examples["sent1"]]
     question_headers = examples["sent2"]
     second_sentences = [
         [f"{header} {examples[end][i]}" for end in ending_names] for i, header in enumerate(question_headers)
     ]

     first_sentences = sum(first_sentences, [])
     second_sentences = sum(second_sentences, [])

     tokenized_examples = tokenizer(first_sentences, second_sentences, truncation=True)
     return {k: [v[i : i + 4] for i in range(0, len(v), 4)] for k, v in tokenized_examples.items()}

使用数据集的[datasets.Dataset.map]方法将预处理函数应用于整个数据集,通过将 batched=True 设置为同时处理数据集的多个元素,可以加快map函数的处理速度:

tokenized_swag = swag.map(preprocess_function, batched=True)

Transformers 没有适用于多选题的数据整理器,因此你需要修改[DataCollatorWithPadding]以创建一批示例。在整理过程中,将句子动态填充到批处理中的最长长度,而不是将整个数据集填充到最大长度。

DataCollatorForMultipleChoice 对所有模型输入进行扁平化、填充,然后恢复结果:

"""
# framework = "pt"
"""
from dataclasses import dataclass
from transformers.tokenization_utils_base 
import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union >>> import torch

@dataclass 
class DataCollatorForMultipleChoice: 
 # Data collator that will dynamically pad the inputs for multiple choice received.

 tokenizer: PreTrainedTokenizerBase 
 padding: Union[bool, str, PaddingStrategy] = True 
 max_length: Optional[int] = None 
 pad_to_multiple_of: Optional[int] = None

 def call(self, features): 
 label_name = "label" if "label" in features[0].keys() else "labels" 
 labels = [feature.pop(label_name) for feature in features] 
 batch_size = len(features) 
 num_choices = len(features[0]["input_ids"]) 
 flattened_features = [ 
 [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features 
 ] 
 flattened_features = sum(flattened_features, [])

 batch = self.tokenizer.pad( 
 flattened_features, 
 padding=self.padding, 
 max_length=self.max_length, 
 pad_to_multiple_of=self.pad_to_multiple_of, 
 return_tensors="pt", 
 )

 batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()} 
 batch["labels"] = torch.tensor(labels, dtype=torch.int64) 
 return batch
 
"""
# framework = "tf"
"""
from dataclasses import dataclass
from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
from typing import Optional, Union
import tensorflow as tf

@dataclass
 class DataCollatorForMultipleChoice:
     """
     Data collator that will dynamically pad the inputs for multiple choice received.
     """

     tokenizer: PreTrainedTokenizerBase
     padding: Union[bool, str, PaddingStrategy] = True
     max_length: Optional[int] = None
     pad_to_multiple_of: Optional[int] = None

     def __call__(self, features):
         label_name = "label" if "label" in features[0].keys() else "labels"
         labels = [feature.pop(label_name) for feature in features]
         batch_size = len(features)
         num_choices = len(features[0]["input_ids"])
         flattened_features = [
             [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
         ]
         flattened_features = sum(flattened_features, [])

         batch = self.tokenizer.pad(
             flattened_features,
             padding=self.padding,
             max_length=self.max_length,
             pad_to_multiple_of=self.pad_to_multiple_of,
             return_tensors="tf",
         )

         batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()}
         batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
         return batch
评估

在训练过程中包括一个指标通常有助于评估模型的性能。你可以使用评估库快速加载一个评估方法。对于这个任务,加载 accuracy 指标:

import evaluate

accuracy = evaluate.load("accuracy")

然后创建一个函数,将你的预测和标签传递给[evaluate.EvaluationModule.compute]以计算准确性:

import numpy as np


def compute_metrics(eval_pred):
     predictions, labels = eval_pred
     predictions = np.argmax(predictions, axis=1)
     return accuracy.compute(predictions=predictions, references=labels)

现在你的 compute_metrics 函数已经准备好了,当设置训练时将返回它。

训练

现在,你可以开始训练模型了!使用[AutoModelForMultipleChoice]加载 BERT

from transformers import AutoModelForMultipleChoice, TrainingArguments, Trainer

model = AutoModelForMultipleChoice.from_pretrained("bert-base-uncased")

此时,只剩下三个步骤:

  • 在[TrainingArguments]中定义你的训练超参数。唯一需要的参数是 output_dir,它指定保存你的模型的位置。通过设置 push_to_hub=True,你将该模型上传到Hub(你需要登录Hugging Face以上传你的模型)。在每个 epoch 结束时,[Trainer]将评估准确性并保存训练检查点。
  • 将训练参数与模型、数据集、tokenizer、数据整理器和 compute_metrics 函数一起传递给[Trainer]。
  • 调用[Trainer.train]进行微调。
training_args = TrainingArguments(
     output_dir="my_awesome_swag_model",
     evaluation_strategy="epoch",
     save_strategy="epoch",
     load_best_model_at_end=True,
     learning_rate=5e-5,
     per_device_train_batch_size=16,
     per_device_eval_batch_size=16,
     num_train_epochs=3,
     weight_decay=0.01,
     push_to_hub=True,
)

trainer = Trainer(
     model=model,
     args=training_args,
     train_dataset=tokenized_swag["train"],
     eval_dataset=tokenized_swag["validation"],
     tokenizer=tokenizer,
     data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
     compute_metrics=compute_metrics,
 )

trainer.train()

完成训练后,使用[transformers.Trainer.push_to_hub]方法将模型推送到Hub,以便每个人都可以使用你的模型:

trainer.push_to_hub()

如果你对使用 Keras 微调模型不熟悉,请查看这里的基本教程。

TensorFlow 中微调模型,首先设置一个优化器函数、学习率计划和一些训练超参数:

from transformers import create_optimizer

batch_size = 16
num_train_epochs = 2
total_train_steps = (len(tokenized_swag["train"]) // batch_size) * num_train_epochs
optimizer, schedule = create_optimizer(init_lr=5e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

然后,使用[TFAutoModelForMultipleChoice]加载 BERT

from transformers import TFAutoModelForMultipleChoice

model = TFAutoModelForMultipleChoice.from_pretrained("bert-base-uncased")

使用[transformers.TFPreTrainedModel.prepare_tf_dataset]将数据集转换为 tf.data.Dataset 格式:

data_collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
tf_train_set = model.prepare_tf_dataset(
     tokenized_swag["train"],
     shuffle=True,
     batch_size=batch_size,
     collate_fn=data_collator,
 )

tf_validation_set = model.prepare_tf_dataset(
     tokenized_swag["validation"],
     shuffle=False,
     batch_size=batch_size,
     collate_fn=data_collator,
 )

使用 compile 为训练配置模型。注意,Transformer 模型都有一个默认的与任务相关的损失函数,因此你不需要指定损失函数,除非你想要使用其他的:

model.compile(optimizer=optimizer)  # 没有损失参数!

在开始训练之前的最后两件事是从预测中计算准确性,并提供一种将模型上传到Hub的方法。这两个都可以使用 Keras 回调来完成。

compute_metrics 函数传递给[transformers.KerasMetricCallback]:

from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
在[~transformers.PushToHubCallback]中指定要推送模型和tokenizer的位置:

from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(push_to_hub_model_id="your-model-id", push_to_hub_organization="your-organization")

运行 model.fit 开始训练:

model.fit(
     tf_train_set,
     epochs=num_train_epochs,
     callbacks=[metric_callback, push_to_hub_callback],
     validation_data=tf_validation_set,
 )

一旦训练完成,使用 push_to_hub_callback 方法将你的模型和 tokenizer 推送到Hub,以便每个人都可以使用你的模型和tokenizer。

model.push_to_hub(push_to_hub_organization="your-organization")
push_to_hub_callback = PushToHubCallback(
     output_dir="my_awesome_model",
     tokenizer=tokenizer,
 )

然后将你的回调函数打包在一起:

callbacks = [metric_callback, push_to_hub_callback]

最后,你准备好开始训练模型了!使用你的训练和验证数据集,指定训练轮数和回调函数来微调模型,调用 fit 方法:

model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=2, callbacks=callbacks)

训练完成后,你的模型会自动上传到 Hub,这样每个人都可以使用它!

推理

现在你已经对模型进行了微调,可以用它进行推理了!

编写一些文本和两个候选答案:

prompt = "France has a bread law, Le Décret Pain, with strict rules on what is allowed in a traditional baguette."
candidate1 = "The law does not apply to croissants and brioche."
candidate2 = "The law applies to baguettes."

对每个提示和候选答案对进行标记化,并返回 PyTorch 张量。同时你还需要创建一些labels

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model")
inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="pt", padding=True)
labels = torch.tensor(0).unsqueeze(0)

将输入数据和 labels 传递给模型,并返回 logits

from transformers import AutoModelForMultipleChoice

model = AutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model")
outputs = model(**{k: v.unsqueeze(0) for k, v in inputs.items()}, labels=labels)
logits = outputs.logits

获取具有最高概率的类别:

predicted_class = logits.argmax().item()
predicted_class

结果为:

'0'

对每个提示和候选答案对进行标记化,并返回TensorFlow张量:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_swag_model")
inputs = tokenizer([[prompt, candidate1], [prompt, candidate2]], return_tensors="tf", padding=True)

将输入数据传递给模型,并返回 logits

from transformers import TFAutoModelForMultipleChoice

model = TFAutoModelForMultipleChoice.from_pretrained("my_awesome_swag_model")
inputs = {k: tf.expand_dims(v, 0) for k, v in inputs.items()}
outputs = model(inputs)
logits = outputs.logits

获取具有最高概率的类别:

predicted_class = int(tf.math.argmax(logits, axis=-1)[0])
predicted_class

结果为:

'0'

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐