hugging face transformers 库使用手册(三):精调预训练模型操作指南(API: Trainer)
使用 huggingface transformers 库进行模型精调
文章目录
精调模型说到底,还是训练模型,不过是在预训练模型的基础上继续训练罢了。
有很多训练模型的框架可以选择:
- 用 🤗 Transformers
Trainer:这个库我个人看下来,感觉比较窄,就是没那么灵活。凑活能用,但是不如原生 pytorch 自由;但是吧,本来训练过程就有很多超参而且不太好组织,用原生 pytorch 写也还是比较乱;我现在还没有见过一种比较好的训练代码模式 - TensorFlow + Keras
- native PyTorch
简单示例
这里只教怎么用 🤗 Transformers 库,以下列出大致步骤,并给出最简单的示例(示例来源)
- 准备 dataset
# 1.下载一个现成的数据集
from datasets import load_dataset
dataset = load_dataset("yelp_review_full")
>>> dataset["train"][100]
{'label': 0,
'text': 'My expectations for McDonalds are t rarely high. But ...'}
# 2.定义 Preprocessor(对于文本数据来说我们需要 tokenzierb + pad + truncation)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
# 调用 datasets.map() 可以把 preprocessing function 应用到整个数据集
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# 3. 如果你喜欢,也可以随机 sample 出更小的数据集来训练
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))
- 用
transformers.Trainer
The Trainer API supports a wide range of training options and features such as logging, gradient accumulation, and mixed precision.
# 1. 先拿到一个模型(用万能的 from_pretrained() )
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)
# 2. 设置训练参数(注意AutoConfig是模型的参数,这里要的是训练的参数,要用 TrainingArguments 对象)
from transformers import TrainingArguments
training_args = TrainingArguments(output_dir="test_trainer")
# 3. 设置 Evaluator function,毕竟你需要在训练过程中看看哪个 epoch 表现比较好
import numpy as np
import evaluate # huggingface 同样提供好用的 evaluate 包
# 这个分类任务用 accuaracy 就好了
metric = evaluate.load("accuracy")
# 4. 获得 Trainer 实例并开始训练
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
compute_metrics=compute_metrics,
)
trainer.train()
通过以上示例,就明白了了,关键要素有:
- dataset
- model
- args
- metrics
把 metrics 和 args 分开是合理的。在 native pytorch training code 中与 metrics 贴的比较近的代码一般是 optimizer,这个应该放在 args 里面,因为这个是训练相关的,而 metrics 是训练无关的,是评价相关的。
- 要弄清 dataset 怎么弄,下面会介绍 🤗 datasets 库;
- 要弄清 model 怎么弄,这个就自己写类继承
PreTrainedModel。因为这是一个保姆教程,所以这里也会介绍到底怎么写自己的 Model; - 要弄清 args 怎么弄,要知道
TrainingArguments类,这个也是 🤗 transformers 库里的,下面也会介绍; - 要弄清 metrics怎么弄,要用 🤗 Evaluate 库,之后也会介绍
🤗 Datasets
Dataset
The base class Dataset implements a Dataset backed by an Apache Arrow table.
Feature
The Features format is simple: dict[column_name, column_type]. It is a dictionary of column name and column type pairs.
>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')
>>> dataset.features
{'idx': Value(dtype='int32', id=None),
'label': ClassLabel(num_classes=2, names=['not_equivalent', 'equivalent'], names_file=None, id=None),
'sentence1': Value(dtype='string', id=None),
'sentence2': Value(dtype='string', id=None),
}
quick start
- 使用
load_dataset()获取 Dataset 对象
from datasets import load_dataset, Image
dataset = load_dataset("beans", split="train")
- 设置数据增广方法,可以用任何你喜欢的库,这里演示使用 torchvision(实际上熟练 pytorch 的人一般也就是用这个方法)
对图像和对文本的处理方法各有不同
from torchvision.transforms import Compose, ColorJitter, ToTensor
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# 对图像
jitter = Compose(
[ColorJitter(brightness=0.5, hue=0.5), ToTensor()]
)
# 对文本
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
- 写一个方法调用数据增广对象 Create a function to apply your transform to the dataset and generate the model input: pixel_values.
# 对图像
def transforms(examples):
examples["pixel_values"] = [jitter(image.convert("RGB")) for image in examples["image"]]
return examples
# 对文本
def encode(examples):
return tokenizer(examples["sentence1"], examples["sentence2"], truncation=True, padding="max_length")
- 对图像数据增广,调用
train_dataset.with_transform(),把你写的 function 传进去;对文本 tokenization,调用train_dataset.map()把你写的 function 传进去
# 对图像
dataset = dataset.with_transform(transforms)
# 对文本
dataset = dataset.map(encode, batched=True)
>>> dataset[0]
{'sentence1': 'Amrozi accused his brother , whom he called " the witness " , of deliberately distorting his evidence .',
'sentence2': 'Referring to him as only " the witness " , Amrozi accused his brother of deliberately distorting his evidence .',
'label': 1,
'idx': 0,
'input_ids': array([ 101, 7277, 2180, 5303, 4806, 1117, 1711, 117, 2292, 1119, ...]),
'token_type_ids': array([0, 0, 0, 0, 0, ..., 0, 1, 1, ...]),
'attention_mask': array([1, 1, 1, 1, 1, 1, 1, ...])}
- Set the dataset format according to the machine learning framework you’re using.(对于 pytorch 来说就是把 dataset 传到 dataloader 里面)
from torch.utils.data import DataLoader
def collate_fn(examples):
images = []
labels = []
for example in examples:
images.append((example["pixel_values"]))
labels.append(example["labels"])
pixel_values = torch.stack(images)
labels = torch.tensor(labels)
return {"pixel_values": pixel_values, "labels": labels}
dataloader = DataLoader(dataset, collate_fn=collate_fn, batch_size=4)
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
dataloader = DataLoader(dataset, batch_size=32)
看看提供的 dataset 的信息再决定要不要下载
用 load_dataset_builder() 看信息,用 load_dataset() 加载(下载)
from datasets import load_dataset_builder
>>> ds_builder = load_dataset_builder("rotten_tomatoes")
>>> # Inspect dataset description
>>> ds_builder.info.description
Movie Review Dataset. This is a dataset of containing 5,331 positive and 5,331 negative processed sentences from Rotten Tomatoes movie reviews. This data was first used in Bo Pang and Lillian Lee, ``Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales.'', Proceedings of the ACL, 2005.
>>> # Inspect dataset features
>>> ds_builder.info.features
{'label': ClassLabel(num_classes=2, names=['neg', 'pos'], id=None),
'text': Value(dtype='string', id=None)}
看了上述信息确定是你想要的数据集,用 load_dataset() 下载(加载)
from datasets import load_dataset
dataset = load_dataset("rotten_tomatoes", split="train")
数据集可能被划成训练集、测试集、验证集,get_dataset_split_names 看看有哪些集:
from datasets import get_dataset_split_names
>>> get_dataset_split_names("rotten_tomatoes")
['train', 'validation', 'test']
Create an image dataset
两种方案:
- Create an image dataset with
ImageFolderand some metadata:无代码无痛做数据集 - Create an image dataset by writing a loading script. 更麻烦但是更自由,可以自己决定 how a dataset is defined, downloaded
ImageFolder
按照如下结构组织你的图片数据:
folder/train/dog/golden_retriever.png
folder/train/cat/maine_coon.png
folder/test/dog/german_shepherd.png
folder/test/cat/bengal.png
然后调用 load_dataset() ,类型选 imagefolder,传入 folder 的地址,得到 dataset
from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
从分类任务拓展到 caption / object detection
如果还有别的信息要加入数据集,比如 text captions or bounding boxes,那么把这些信息写到 metadata.csv 文件里面,这样就不止可以做分类任务,还可以做 text captioning or object detection,变成这样:
folder/train/metadata.csv
folder/train/0001.png
folder/train/0002.png
folder/train/0003.png
甚至图片文件是 zip 也没关系
folder/metadata.csv
folder/train.zip
folder/test.zip
folder/valid.zip
那个 csv 文件里面可以这样写:
file_name,additional_feature
0001.png,This is a first value of a text feature you added to your images
0002.png,This is a second value of a text feature you added to your images
0003.png,This is a third value of a text feature you added to your images
不写 csv 写 jsonl 也可以:
{"file_name": "0001.png", "additional_feature": "This is a first value of a text feature you added to your images"}
{"file_name": "0002.png", "additional_feature": "This is a second value of a text feature you added to your images"}
{"file_name": "0003.png", "additional_feature": "This is a third value of a text feature you added to your images"}
注意注意注意
这里关键词,是之后用关键字取到这个值的关键字。比如说,如果是 caption 任务,那么用 text 替换 additional_feature :
file_name,text
0001.png,This is a golden retriever playing with a ball
0002.png,A german shepherd
0003.png,One chihuahua
from datasets import load_dataset
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder")
>>> dataset[0]["text"] # 如果写了 caption 可以这样看到
"This is a golden retriever playing with a ball"
如果是 object detect 任务,用 object 替换 additional_feature :
(注意,你这里要加上 train 或者 test 之类的子文件夹,不然会报错…很痛苦)
{"file_name": "train/0001.png", "objects": {"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}}
{"file_name": "train/0002.png", "objects": {"bbox": [[810.0, 100.0, 57.0, 28.0]], "categories": [1]}}
{"file_name": "train/0003.png", "objects": {"bbox": [[160.0, 31.0, 248.0, 616.0], [741.0, 68.0, 202.0, 401.0]], "categories": [2, 2]}}
>>> dataset = load_dataset("imagefolder", data_dir="/path/to/folder", split="train")
>>> dataset[0]["objects"]
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}
写一个 dataset 脚本
这样组织你的数据集文件:
my_dataset/
├── README.md
├── my_dataset.py
└── data/ # optional, may contain your images or TAR archives
这样使用你的数据集:
from datasets import load_dataset
dataset = load_dataset("path/to/my_dataset")
模板示例:new dataset template
Create a dataset builder class.
继承 datasets.GeneratorBasedBuilder 实现 3 个方法:
_info(self)stores information about your dataset like its description, license, and features._split_generators(self, dl_manager)downloads the dataset and defines its splits._generate_examples(self, images, metadata_path)generates the images and labels for each split.
class Food101(datasets.GeneratorBasedBuilder):
"""Food-101 Images dataset"""
# 下文会说具体的实现
def _info(self):
def _split_generators(self, dl_manager):
def _generate_examples(self, images, metadata_path):
Create dataset configurations.
你可能需要多个数据集设置(比如你的数据集是由几个子集组成)
继承 datasets.BuilderConfig 实现 2 个属性:
data_urlmetadata_urls
class Food101Config(datasets.BuilderConfig):
"""Builder Config for Food-101"""
def __init__(self, data_url, metadata_urls, **kwargs):
"""BuilderConfig for Food-101.
Args:
data_url: `string`, url to download the zip file from.
metadata_urls: dictionary with keys 'train' and 'validation' containing the archive metadata URLs
**kwargs: keyword arguments forwarded to super.
"""
super(Food101Config, self).__init__(version=datasets.Version("1.0.0"), **kwargs)
self.data_url = data_url
self.metadata_urls = metadata_urls
假设你的数据集(食物101)由两个子集(早餐和晚餐)组成:
class Food101(datasets.GeneratorBasedBuilder):
"""Food-101 Images dataset"""
BUILDER_CONFIGS = [
Food101Config(
name="breakfast",
description="Food types commonly eaten during breakfast.",
data_url="https://link-to-breakfast-foods.zip",
metadata_urls={
"train": "https://link-to-breakfast-foods-train.txt",
"validation": "https://link-to-breakfast-foods-validation.txt"
},
,
Food101Config(
name="dinner",
description="Food types commonly eaten during dinner.",
data_url="https://link-to-dinner-foods.zip",
metadata_urls={
"train": "https://link-to-dinner-foods-train.txt",
"validation": "https://link-to-dinner-foods-validation.txt"
},
)...
]
然后就可以这样用:
from datasets import load_dataset
ds = load_dataset("food101", "breakfast", split="train")
Add dataset metadata.
这个信息是用户在调用 dataset.info() 方法时返回的信息(实际上返回的是一个 DatasetInfo 对象):
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("food101")
ds_builder.info
要提供的信息包括:
- description provides a concise description of the dataset.
- features specify the dataset column types. Since you’re creating an image loading script, you’ll need to include the Image feature.
- supervised_keys specify the input feature and label.
- homepage provides a link to the dataset homepage.
- citation is a BibTeX citation of the dataset.
- license states the dataset’s license.
示例:
def _info(self):
return datasets.DatasetInfo(
description=_DESCRIPTION,
features=datasets.Features(
{
"image": datasets.Image(),
"label": datasets.ClassLabel(names=_NAMES),
}
),
supervised_keys=("image", "label"),
homepage=_HOMEPAGE,
citation=_CITATION,
license=_LICENSE,
task_templates=[ImageClassification(image_column="image", label_column="label")],
)
Download and define the dataset splits.
def _split_generators(self, dl_manager):
archive_path = dl_manager.download(_BASE_URL)
split_metadata_paths = dl_manager.download(_METADATA_URLS)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"images": dl_manager.iter_archive(archive_path),
"metadata_path": split_metadata_paths["train"],
},
),
datasets.SplitGenerator(
name=datasets.Split.VALIDATION,
gen_kwargs={
"images": dl_manager.iter_archive(archive_path),
"metadata_path": split_metadata_paths["test"],
},
),
]
Generate the dataset.
generate_examples accepts the images and metadata_path from the previous method as arguments.
def _generate_examples(self, images, metadata_path):
"""Generate images and labels for splits."""
with open(metadata_path, encoding="utf-8") as f:
files_to_keep = set(f.read().split("\n"))
for file_path, file_obj in images:
if file_path.startswith(_IMAGES_DIR):
if file_path[len(_IMAGES_DIR) : -len(".jpg")] in files_to_keep:
label = file_path.split("/")[2]
yield file_path, {
"image": {"path": file_path, "bytes": file_obj.read()},
"label": label,
}
Create a dataset loading script
暂时用不到,贴个链接,有需要的时候再学 Create a dataset loading script
加入 preprocess
depending on your dataset modality, you’ll need to:
- Tokenize a text dataset.
- Resample an audio dataset.
- Apply transforms to an image dataset.
data augmentations
用 AutoFeatureExtractor.from_pretrained 获得 feature extractor
应用方法
from transformers import AutoFeatureExtractor
from datasets import load_dataset, Image
from torchvision.transforms import RandomRotation
# 1. 获得数据集对象、预处理handler对象
feature_extractor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224-in21k")
dataset = load_dataset("beans", split="train")
# 2. 写一个方法,把数据传入 torchvision 中的数据增广对象,并返回处理后的数据
rotate = RandomRotation(degrees=(0, 90))
def transforms(examples):
# 默认图片输入模型的关键字都是 pixel_values,所以也要把名字换成这个
examples["pixel_values"] = [rotate(image.convert("RGB")) for image in examples["image"]]
return examples
# 调用 train_dataset.set_transform 方法,把你写的 function 传进去
dataset.set_transform(transforms)
dataset[0]["pixel_values"]
tokenizer
用 AutoTokenizer.from_pretrained 获得 tokenizer,把数据传入 tokenizer 会返回一个字典,字典 3 个 key:
input_ids: the numbers representing the tokens in the text.token_type_ids: indicates which sequence a token belongs to if there is more than one sequence.attention_mask: indicates whether a token should be masked or not.
tokenizer(dataset[0]["text"])
{'input_ids': [101, 1103, 2067, 1110, 17348, ...],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, ...],
'attention_mask': [1, 1, 1, 1, 1, 1, ...]}
应用方法
from transformers import AutoTokenizer
from datasets import load_dataset
# 1. 获取 tokenzier 对象和 dataset 对象
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
dataset = load_dataset("rotten_tomatoes", split="train")
# 2. 写一个方法,把数据传入 tokenzier,并返回处理后的数据
def tokenization(example):
return tokenizer(example["text"])
# 3. 调用 train_dataset.map 方法,把你写的 function 传进去
dataset = dataset.map(tokenization, batched=True)
# 4. 调用 train_dataset.set_format 把数据集的名字换一下
dataset.set_format(type="torch", columns=["input_ids", "token_type_ids", "attention_mask", "labels"])
dataset.format['type']
Model 怎么弄
- 写自己的 sublayer (encoder / decoder)等等,继承
nn.Module即可
这里的重点是,所有的 sublayer 的 init 参数都是 config
class MyEncoder(nn.Module):
def __init__(self, config):
super().__init__()
self.l = nn.Linear(3, config.hidden_size)
def forward(self, x):
return self.l(x)
class MyDecoder(nn.Module):
...
- 用一个 Model 把 sublayers 串起来,这个 Model 继承
PreTrainedModel
这里重点是,(a) 设置好config_class(b) 注意 init 参数是对应的 Config 类的实例
class MyModel(PreTrainedModel):
config_class = MyConfig # 这个 MyConfig 继承 PretrainedConfig
def __init__(self, config: MyConfig):
# 注意 init 方法的参数是对应的 Config 类
super().__init__()
self.encoder = MyEncoder(config)
self.decoder = MyDecoder(config)
def forward(self, x):
return self.decoder(self.encoder(x))
- 把自己写的类注册到 AutoModel 工厂名单中(如果不打算用
AutoModel.from_pretrained('my-model')获得模型,而是打算用MyModel(myConfig)获得模型,那这一步不做也无所谓)
AutoConfig.register("my-model", MyConfig)
AutoModel.register(MyConfig, MyModel)
- 补写 MyConfig,这个继承
PretrainedConfig,是与模型参数有关的所有超参都在这里。这里最重要的是设置cls.model_type = 'my-model'
class MyConfig(PretrainedConfig):
# 赋值共有参数
model_type = "my-model"
def __init__(self, hidden_size=768, **kwargs):
super().__init__(**kwargs)
self.hidden_size = hidden_size
- 获得 model 实例
model = AutoModel.from_pretrained('my-model')
# 或者
myConfig = MyConfig(hidden_size=256)
model = MyModel(myConfig)
TrainingArguments
有 97 个可以初始化的参数,相当可怕。
实际实例化的时候一般只传第一个参数进去:
output_dir (str)— The output directory where the model predictions and checkpoints will be written.
直接初始化的时候把参数都传进去是行不通的,可以通过以下方法分别设置参数,也更清楚一些:
set_dataloader
所有和 dataloader 有关的参数都在这了
- train_batch_size: int = 8
- eval_batch_size: int = 8
- drop_last: bool = False
- num_workers: int = 0
- pin_memory: bool = True
- auto_find_batch_size: bool = False
- Whether to find a batch size that will fit into memory automatically through exponential decay, avoiding CUDA Out-of-Memory errors. Requires accelerate to be installed (pip install accelerate)
- ignore_data_skip: bool = False
- When resuming training, whether or not to skip the epochs and batches to get the data loading at the same stage as in the previous training.
- sampler_seed: typing.Optional[int] = None
- Random seed to be used with data samplers. If not set, random generators for data sampling will use the same seed as self.seed.
from transformers import TrainingArguments
args = TrainingArguments("working_dir")
args = args.set_dataloader(train_batch_size=16, eval_batch_size=64)
>>> args.per_device_train_batch_size
16
set_evaluate
所有和训练过程中的中间测试有关的参数都通过本方法设置
strategy: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'no'这个参数如果不设为 no,那么属性do_eval自动设为 True"no": No evaluation is done during training."steps": Evaluation is done (and logged) every steps."epoch": Evaluation is done at the end of each epoch.
steps: int = 500- Number of update steps between two evaluations if strategy=“steps”.
batch_size: int = 8- The batch size per device (GPU/TPU core/CPU…) used for evaluation.
accumulation_steps: typing.Optional[int] = Nonedelay: typing.Optional[float] = Noneloss_only: bool = False- Ignores all outputs except the loss.
jit_mode: bool = False
from transformers import TrainingArguments
args = TrainingArguments("working_dir")
args = args.set_evaluate(strategy="steps", steps=100)
>>> args.eval_steps
100
set_logging
所有和训练过程中记录相关的参数都通过本方法设置
strategy: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'steps'- “no”: No save is done during training.
- “epoch”: Save is done at the end of each epoch.
- “steps”: Save is done every save_steps.
steps: int = 500- Number of update steps between two logs if strategy=“steps”.
report_to: typing.Union[str, typing.List[str]] = 'none'- The list of integrations to report the results and logs to. Supported platforms are “azure_ml”, “comet_ml”, “mlflow”, “neptune”, “tensorboard”,“clearml” and “wandb”. Use “all” to report to all integrations installed, “none” for no integrations.
level: str = 'passive'- Logger log level to use on the main process. Possible choices are the log levels as strings: “passive”, debug", “info”, “warning”, “error” and “critical”
first_step: bool = Falsenan_inf_filter: bool = Falseon_each_node: bool = Falsereplica_level: str = 'passive'
from transformers import TrainingArguments
args = TrainingArguments("working_dir")
args = args.set_logging(strategy="steps", steps=100)
>>> args.logging_steps
100
set_lr_scheduler
name: typing.Union[str, transformers.trainer_utils.SchedulerType] = 'linear'- “linear”
- “cosine”
- “cosine_with_restarts”
- “polynomial”
- “constant”
- “constant_with_warmup”
- “inverse_sqrt”
num_epochs: float = 3.0max_steps: int = -1warmup_ratio: float = 0warmup_steps: int = 0
from transformers import TrainingArguments
args = TrainingArguments("working_dir")
args = args.set_lr_scheduler(name="cosine", warmup_ratio=0.05)
>>> args.warmup_ratio
0.05
set_optimizer
所有和训练用的优化器有关的超参都通过本方法设置
name: typing.Union[str, transformers.training_args.OptimizerNames] = 'adamw_hf'- “adamw_hf”
- “adamw_torch”
- “adamw_torch_fused”
- “adamw_apex_fused”
- “adamw_anyprecision”
- “adafactor”
learning_rate: float = 5e-05- The initial learning rate.
weight_decay: float = 0- The weight decay to apply (if not zero) to all layers except all bias and LayerNorm weights.
beta1: float = 0.9- The beta1 hyperparameter for the adam optimizer or its variants.
beta2: float = 0.999epsilon: float = 1e-08args: typing.Optional[str] = None- Optional arguments that are supplied to AnyPrecisionAdamW (only useful when optim=“adamw_anyprecision”)
from transformers import TrainingArguments
args = TrainingArguments("working_dir")
args = args.set_optimizer(name="adamw_torch", beta1=0.8)
>>> args.optim
'adamw_torch'
set_save
strategy: typing.Union[str, transformers.trainer_utils.IntervalStrategy] = 'steps'- “no”: No save is done during training.
- “epoch”: Save is done at the end of each epoch.
- “steps”: Save is done every save_steps.
steps: int = 500total_limit: typing.Optional[int] = Noneon_each_node: bool = False
set_training
设置一个 train 需要的参数,不如前面那些方法那么具体,只能浅浅设置一下
learning_rate: float = 5e-05batch_size: int = 8weight_decay: float = 0num_epochs: float = 3max_steps: int = -1- If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs. In case of using a finite iterable dataset the training may stop before reaching the set number of steps when all data is exhausted.
gradient_accumulation_steps: int = 1- Number of updates steps to accumulate the gradients for, before performing a backward/update pass.
seed: int = 42gradient_checkpointing: bool = False
from transformers import TrainingArguments
args = TrainingArguments("working_dir")
args = args.set_training(learning_rate=1e-4, batch_size=32)
>>> args.learning_rate
1e-4
🤗 Evaluate
Any metric, comparison, or measurement is loaded with the evaluate.load function
import evaluate
accuracy = evaluate.load("accuracy")
# explicitly pass the type
word_length = evaluate.load("word_length", module_type="measurement")
查看提供了哪些 modules
With list_evaluation_modules() you can check what modules are available on the hub.
>>> evaluate.list_evaluation_modules(
... module_type="comparison",
... include_community=False,
... with_details=True)
[{'name': 'mcnemar', 'type': 'comparison', 'community': False, 'likes': 1},
{'name': 'exact_match', 'type': 'comparison', 'community': False, 'likes': 0}]
如何利用 metric 计算
最简单的方法是直接调用 metric.compute()
>>> accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])
{'accuracy': 0.5}
另外一种情况是,算出一个结果计算一次精度,然后再更新整体精度;这时调用 metric.add() ,然后再调用 metric.compute()
>>> for ref, pred in zip([0,1,0,1], [1,0,0,1]):
>>> accuracy.add(references=ref, predictions=pred)
>>> accuracy.compute()
{'accuracy': 0.5}
还有一种情况是,算出一个 batch 的结果计算一次精度,这时先调用 metric.add_batch() 再调用 metric.compute()
>>> for refs, preds in zip([[0,1],[0,1]], [[1,0],[0,1]]):
>>> accuracy.add_batch(references=refs, predictions=preds)
>>> accuracy.compute()
{'accuracy': 0.5}
套到你的代码上:
for model_inputs, gold_standards in evaluation_dataset:
predictions = model(model_inputs)
metric.add_batch(references=gold_standards, predictions=predictions)
metric.compute()
如何一次性计算好几个 metric
用 evaluate.combine() ,这个返回的东西当做普通的 metric 使用即可
clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"])
>>> clf_metrics.compute(predictions=[0, 1, 0], references=[0, 1, 1])
{
'accuracy': 0.667,
'f1': 0.667,
'precision': 1.0,
'recall': 0.5
}
如何保存 evaluate 结果
调用 evaluate.save() (再也不用手动保存啦,妈妈再也不用担心我搞不清哪个结果对应谁啦)
result = accuracy.compute(references=[0,1,0,1], predictions=[1,0,0,1])
hyperparams = {"model": "bert-base-uncased"}
evaluate.save("./results/"experiment="run 42", **result, **hyperparams)
保存效果如下:
{
"experiment": "run 42",
"accuracy": 0.5,
"model": "bert-base-uncased",
"_timestamp": "2022-05-30T22:09:11.959469",
"_git_commit_hash": "123456789abcdefghijkl",
"_evaluate_version": "0.1.0",
"_python_version": "3.9.12 (main, Mar 26 2022, 15:51:15) \n[Clang 13.1.6 (clang-1316.0.21.2)]",
"_interpreter_path": "/Users/leandro/git/evaluate/env/bin/python"
}
如何 expand 自己的 metric
继承 evaluate.Metric 或者 evaluate.EvaluationModule
另起一个 py 文件,实现:
_compute(self, references, predictions)- 这个必须重写,这个方法是调用
metric.compute()的时候实际计算的方法
- 这个必须重写,这个方法是调用
_download_and_prepare(self, dl_manager)- 有些 metric 计算的时候可能要用到别的东西,需要下载的东西,在这里面写
_info(self)- 这个也必须重写。这个方法返回的是一个
EvaluationModuleInfo对象,需要定义如下属性:EvaluationModuleInfo.descriptionprovides a brief description about your evalution module.EvaluationModuleInfo.citationcontains a BibTex citation for the evalution module.EvaluationModuleInfo.inputs_descriptiondescribes the expected inputs and outputs. It may also provide an example usage of the evalution module.EvaluationModuleInfo.featuresdefines the name and type of the predictions and references. This has to be either a singledatasets.Featuresobject or a list of datasets.Features objects if multiple input types are allowed. (Feature 见上文介绍的 Dataset 的 Feature 部分)
- 这个也必须重写。这个方法返回的是一个
使用方法:
把新写的这个 py 文件的路径传到 evaluate.load()
# load() 的第一个参数的说明
"""
path (``str``):
path to the evaluation processing script with the evaluation builder. Can be either:
- a local path to processing script or the directory containing the script (if the script has the same name as the directory), e.g. ``'./metrics/rouge'`` or ``'./metrics/rouge/rouge.py'``
- a evaluation module identifier on the HuggingFace evaluate repo e.g. ``'rouge'`` or ``'bleu'`` that are in either ``'metrics/'``, ``'comparisons/'``, or ``'measurements/'`` depending on the provided ``module_type``.
"""
🤗 Trainer
要实例化一个 Trainer 对象,然后调用这个对象的 train() 方法开始训练。
实例化对象需要传入的参数:
- model (PreTrainedModel or torch.nn.Module, optional)
- The model to train, evaluate or use for predictions.
- args (TrainingArguments, optional)
- The arguments to tweak for training. Will default to a basic instance of TrainingArguments with the output_dir set to a directory named tmp_trainer in the current directory if not provided.
- data_collator (DataCollator, optional)
- The function to use to form a batch from a list of elements of train_dataset or eval_dataset.
- train_dataset (torch.utils.data.Dataset or torch.utils.data.IterableDataset, optional)
- The dataset to use for training. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed.
- eval_dataset (Union[torch.utils.data.Dataset, Dict[str, torch.utils.data.Dataset])], optional)
- The dataset to use for evaluation. If it is a Dataset, columns not accepted by the model.forward() method are automatically removed.
- tokenizer (PreTrainedTokenizerBase, optional)
- The tokenizer used to preprocess the data.
- model_init (Callable[[], PreTrainedModel], optional)
- A function that instantiates the model to be used.
- compute_metrics (Callable[[EvalPrediction], Dict], optional)
- The function that will be used to compute metrics at evaluation. Must take a EvalPrediction and return a dictionary string to metric values.
- callbacks (List of TrainerCallback, optional)
- A list of callbacks to customize the training loop.
- optimizers (Tuple[torch.optim.Optimizer, torch.optim.lr_scheduler.LambdaLR], optional)
- A tuple containing the optimizer and the scheduler to use. Will default to an instance of AdamW on your model and a scheduler given by get_linear_schedule_with_warmup() controlled by args.
- preprocess_logits_for_metrics (Callable[[torch.Tensor, torch.Tensor], torch.Tensor], optional)
- A function that preprocess the logits right before caching them at each evaluation step. Must take two tensors, the logits and the labels, and return the logits once processed as desired. The modifications made by this function will be reflected in the predictions received by compute_metrics.
开始训练调用方法 trainer.train()
更多推荐
所有评论(0)