Transformers 4.37 中文文档（五）

原文：huggingface.co/docs/transformers目标检测原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/object_detection目标检测是计算机视觉任务，用于检测图像中的实例（如人类、建筑物或汽车）。目标检测模型接收图像作为输入，并输出检测到的对象的边界框的坐标和相关标签。一幅图像可以包含多个对象，每个对象

布客飞龙

1092人浏览 · 2024-06-23 12:09:00

布客飞龙 · 2024-06-23 12:09:00 发布

原文：huggingface.co/docs/transformers

目标检测

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/object_detection

目标检测是计算机视觉任务，用于检测图像中的实例（如人类、建筑物或汽车）。目标检测模型接收图像作为输入，并输出检测到的对象的边界框的坐标和相关标签。一幅图像可以包含多个对象，每个对象都有自己的边界框和标签（例如，它可以有一辆汽车和一座建筑物），每个对象可以出现在图像的不同部分（例如，图像可以有几辆汽车）。这个任务通常用于自动驾驶，用于检测行人、道路标志和交通灯等。其他应用包括在图像中计数对象、图像搜索等。

在本指南中，您将学习如何：

对DETR进行微调，这是一个将卷积主干与编码器-解码器 Transformer 结合的模型，在CPPE-5数据集上进行训练。
使用您微调的模型进行推断。

本教程中所示的任务由以下模型架构支持：

条件 DETR, 可变 DETR, DETA, DETR, 表格 Transformer, YOLOS

在开始之前，请确保已安装所有必要的库：

pip install -q datasets transformers evaluate timm albumentations

您将使用🤗数据集从 Hugging Face Hub 加载数据集，🤗转换器来训练您的模型，并使用albumentations来增强数据。目前需要使用timm来加载 DETR 模型的卷积主干。

我们鼓励您与社区分享您的模型。登录到您的 Hugging Face 帐户并将其上传到 Hub。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 CPPE-5 数据集

CPPE-5 数据集包含带有注释的图像，用于识别 COVID-19 大流行背景下的医疗个人防护装备（PPE）。

首先加载数据集：

>>> from datasets import load_dataset

>>> cppe5 = load_dataset("cppe-5")
>>> cppe5
DatasetDict({
    train: Dataset({
        features: ['image_id', 'image', 'width', 'height', 'objects'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['image_id', 'image', 'width', 'height', 'objects'],
        num_rows: 29
    })
})

您将看到这个数据集已经带有一个包含 1000 张图像的训练集和一个包含 29 张图像的测试集。

熟悉数据，探索示例的外观。

>>> cppe5["train"][0]
{'image_id': 15,
 'image': <PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=943x663 at 0x7F9EC9E77C10>,
 'width': 943,
 'height': 663,
 'objects': {'id': [114, 115, 116, 117],
  'area': [3796, 1596, 152768, 81002],
  'bbox': [[302.0, 109.0, 73.0, 52.0],
   [810.0, 100.0, 57.0, 28.0],
   [160.0, 31.0, 248.0, 616.0],
   [741.0, 68.0, 202.0, 401.0]],
  'category': [4, 4, 0, 0]}}

数据集中的示例具有以下字段：

image_id：示例图像 id
image：包含图像的PIL.Image.Image对象
width：图像的宽度
height：图像的高度
objects：包含图像中对象的边界框元数据的字典：
- id：注释 id
- area：边界框的面积
- bbox：对象的边界框（以COCO 格式）
- category：对象的类别，可能的值包括防护服（0）、面罩（1）、手套（2）、护目镜（3）和口罩（4）

您可能会注意到bbox字段遵循 COCO 格式，这是 DETR 模型期望的格式。然而，objects内部字段的分组与 DETR 所需的注释格式不同。在使用此数据进行训练之前，您需要应用一些预处理转换。

为了更好地理解数据，可视化数据集中的一个示例。

>>> import numpy as np
>>> import os
>>> from PIL import Image, ImageDraw

>>> image = cppe5["train"][0]["image"]
>>> annotations = cppe5["train"][0]["objects"]
>>> draw = ImageDraw.Draw(image)

>>> categories = cppe5["train"].features["objects"].feature["category"].names

>>> id2label = {index: x for index, x in enumerate(categories, start=0)}
>>> label2id = {v: k for k, v in id2label.items()}

>>> for i in range(len(annotations["id"])):
...     box = annotations["bbox"][i]
...     class_idx = annotations["category"][i]
...     x, y, w, h = tuple(box)
...     # Check if coordinates are normalized or not
...     if max(box) > 1.0:
...         # Coordinates are un-normalized, no need to re-scale them
...         x1, y1 = int(x), int(y)
...         x2, y2 = int(x + w), int(y + h)
...     else:
...         # Coordinates are normalized, re-scale them
...         x1 = int(x * width)
...         y1 = int(y * height)
...         x2 = int((x + w) * width)
...         y2 = int((y + h) * height)
...     draw.rectangle((x, y, x + w, y + h), outline="red", width=1)
...     draw.text((x, y), id2label[class_idx], fill="white")

>>> image

外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传

要可视化带有关联标签的边界框，您可以从数据集的元数据中获取标签，特别是category字段。您还需要创建映射标签 id 到标签类别（id2label）以及反向映射（label2id）的字典。在设置模型时，您可以稍后使用它们。包括这些映射将使您的模型在 Hugging Face Hub 上共享时可以被其他人重复使用。请注意，上述代码中绘制边界框的部分假定它是以XYWH（x，y 坐标和框的宽度和高度）格式。对于其他格式如(x1，y1，x2，y2)可能无法正常工作。

作为熟悉数据的最后一步，探索可能存在的问题。目标检测数据集的一个常见问题是边界框“拉伸”到图像边缘之外。这种“失控”的边界框可能会在训练过程中引发错误，应在此阶段加以解决。在这个数据集中有一些示例存在这个问题。为了简化本指南中的操作，我们将这些图像从数据中删除。

>>> remove_idx = [590, 821, 822, 875, 876, 878, 879]
>>> keep = [i for i in range(len(cppe5["train"])) if i not in remove_idx]
>>> cppe5["train"] = cppe5["train"].select(keep)

预处理数据

要微调模型，您必须预处理您计划使用的数据，以精确匹配预训练模型使用的方法。AutoImageProcessor 负责处理图像数据以创建pixel_values，pixel_mask和labels，供 DETR 模型训练。图像处理器具有一些属性，您无需担心：

image_mean = [0.485, 0.456, 0.406 ]
image_std = [0.229, 0.224, 0.225]

这些是用于在模型预训练期间对图像进行归一化的均值和标准差。在进行推理或微调预训练图像模型时，这些值至关重要。

从要微调的模型相同的检查点实例化图像处理器。

>>> from transformers import AutoImageProcessor

>>> checkpoint = "facebook/detr-resnet-50"
>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)

在将图像传递给image_processor之前，对数据集应用两个预处理转换：

增强图像
重新格式化注释以满足 DETR 的期望

首先，为了确保模型不会在训练数据上过拟合，您可以使用任何数据增强库进行图像增强。这里我们使用Albumentations…此库确保转换影响图像并相应更新边界框。🤗数据集库文档有一个详细的关于如何为目标检测增强图像的指南，它使用相同的数据集作为示例。在这里应用相同的方法，将每个图像调整为(480, 480)，水平翻转并增加亮度：

>>> import albumentations
>>> import numpy as np
>>> import torch

>>> transform = albumentations.Compose(
...     [
...         albumentations.Resize(480, 480),
...         albumentations.HorizontalFlip(p=1.0),
...         albumentations.RandomBrightnessContrast(p=1.0),
...     ],
...     bbox_params=albumentations.BboxParams(format="coco", label_fields=["category"]),
... )

image_processor期望注释采用以下格式：{'image_id': int, 'annotations': List[Dict]}，其中每个字典是一个 COCO 对象注释。让我们添加一个函数来为单个示例重新格式化注释：

>>> def formatted_anns(image_id, category, area, bbox):
...     annotations = []
...     for i in range(0, len(category)):
...         new_ann = {
...             "image_id": image_id,
...             "category_id": category[i],
...             "isCrowd": 0,
...             "area": area[i],
...             "bbox": list(bbox[i]),
...         }
...         annotations.append(new_ann)

...     return annotations

现在，您可以将图像和注释转换组合在一起，用于一批示例：

>>> # transforming a batch
>>> def transform_aug_ann(examples):
...     image_ids = examples["image_id"]
...     images, bboxes, area, categories = [], [], [], []
...     for image, objects in zip(examples["image"], examples["objects"]):
...         image = np.array(image.convert("RGB"))[:, :, ::-1]
...         out = transform(image=image, bboxes=objects["bbox"], category=objects["category"])

...         area.append(objects["area"])
...         images.append(out["image"])
...         bboxes.append(out["bboxes"])
...         categories.append(out["category"])

...     targets = [
...         {"image_id": id_, "annotations": formatted_anns(id_, cat_, ar_, box_)}
...         for id_, cat_, ar_, box_ in zip(image_ids, categories, area, bboxes)
...     ]

...     return image_processor(images=images, annotations=targets, return_tensors="pt")

使用🤗数据集的with_transform方法将此预处理函数应用于整个数据集。此方法在加载数据集元素时动态应用转换。

此时，您可以检查数据集经过转换后的示例是什么样子。您应该看到一个带有pixel_values的张量，一个带有pixel_mask的张量和labels。

>>> cppe5["train"] = cppe5["train"].with_transform(transform_aug_ann)
>>> cppe5["train"][15]
{'pixel_values': tensor([[[ 0.9132,  0.9132,  0.9132,  ..., -1.9809, -1.9809, -1.9809],
          [ 0.9132,  0.9132,  0.9132,  ..., -1.9809, -1.9809, -1.9809],
          [ 0.9132,  0.9132,  0.9132,  ..., -1.9638, -1.9638, -1.9638],
          ...,
          [-1.5699, -1.5699, -1.5699,  ..., -1.9980, -1.9980, -1.9980],
          [-1.5528, -1.5528, -1.5528,  ..., -1.9980, -1.9809, -1.9809],
          [-1.5528, -1.5528, -1.5528,  ..., -1.9980, -1.9809, -1.9809]],

         [[ 1.3081,  1.3081,  1.3081,  ..., -1.8431, -1.8431, -1.8431],
          [ 1.3081,  1.3081,  1.3081,  ..., -1.8431, -1.8431, -1.8431],
          [ 1.3081,  1.3081,  1.3081,  ..., -1.8256, -1.8256, -1.8256],
          ...,
          [-1.3179, -1.3179, -1.3179,  ..., -1.8606, -1.8606, -1.8606],
          [-1.3004, -1.3004, -1.3004,  ..., -1.8606, -1.8431, -1.8431],
          [-1.3004, -1.3004, -1.3004,  ..., -1.8606, -1.8431, -1.8431]],

         [[ 1.4200,  1.4200,  1.4200,  ..., -1.6476, -1.6476, -1.6476],
          [ 1.4200,  1.4200,  1.4200,  ..., -1.6476, -1.6476, -1.6476],
          [ 1.4200,  1.4200,  1.4200,  ..., -1.6302, -1.6302, -1.6302],
          ...,
          [-1.0201, -1.0201, -1.0201,  ..., -1.5604, -1.5604, -1.5604],
          [-1.0027, -1.0027, -1.0027,  ..., -1.5604, -1.5430, -1.5430],
          [-1.0027, -1.0027, -1.0027,  ..., -1.5604, -1.5430, -1.5430]]]),
 'pixel_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         ...,
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]]),
 'labels': {'size': tensor([800, 800]), 'image_id': tensor([756]), 'class_labels': tensor([4]), 'boxes': tensor([[0.7340, 0.6986, 0.3414, 0.5944]]), 'area': tensor([519544.4375]), 'iscrowd': tensor([0]), 'orig_size': tensor([480, 480])}}

您已成功增强了单个图像并准备好它们的注释。然而，预处理还没有完成。在最后一步中，创建一个自定义的collate_fn来将图像批量处理在一起。将图像（现在是pixel_values）填充到批次中最大的图像，并创建一个相应的pixel_mask来指示哪些像素是真实的（1），哪些是填充的（0）。

>>> def collate_fn(batch):
...     pixel_values = [item["pixel_values"] for item in batch]
...     encoding = image_processor.pad(pixel_values, return_tensors="pt")
...     labels = [item["labels"] for item in batch]
...     batch = {}
...     batch["pixel_values"] = encoding["pixel_values"]
...     batch["pixel_mask"] = encoding["pixel_mask"]
...     batch["labels"] = labels
...     return batch

训练 DETR 模型

在前几节中，您已经完成了大部分繁重的工作，现在您已经准备好训练您的模型了！即使在调整大小后，此数据集中的图像仍然相当大。这意味着微调此模型将需要至少一个 GPU。

训练包括以下步骤：

使用与预处理中相同的检查点加载模型 AutoModelForObjectDetection。
在 TrainingArguments 中定义您的训练超参数。
将训练参数传递给 Trainer，以及模型、数据集、图像处理器和数据整理器。
调用 train()来微调您的模型。

在从用于预处理的相同检查点加载模型时，请记住传递您从数据集元数据中创建的label2id和id2label映射。此外，我们指定ignore_mismatched_sizes=True以用新的替换现有的分类头。

>>> from transformers import AutoModelForObjectDetection

>>> model = AutoModelForObjectDetection.from_pretrained(
...     checkpoint,
...     id2label=id2label,
...     label2id=label2id,
...     ignore_mismatched_sizes=True,
... )

在 TrainingArguments 中使用output_dir指定保存模型的位置，然后根据需要配置超参数。重要的是不要删除未使用的列，因为这将删除图像列。没有图像列，您无法创建pixel_values。因此，将remove_unused_columns设置为False。如果希望通过将其推送到 Hub 来共享您的模型，请将push_to_hub设置为True（您必须登录到 Hugging Face 才能上传您的模型）。

>>> from transformers import TrainingArguments

>>> training_args = TrainingArguments(
...     output_dir="detr-resnet-50_finetuned_cppe5",
...     per_device_train_batch_size=8,
...     num_train_epochs=10,
...     fp16=True,
...     save_steps=200,
...     logging_steps=50,
...     learning_rate=1e-5,
...     weight_decay=1e-4,
...     save_total_limit=2,
...     remove_unused_columns=False,
...     push_to_hub=True,
... )

最后，将所有内容汇总，并调用 train()：

>>> from transformers import Trainer

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=collate_fn,
...     train_dataset=cppe5["train"],
...     tokenizer=image_processor,
... )

>>> trainer.train()

如果在training_args中将push_to_hub设置为True，则训练检查点将被推送到 Hugging Face Hub。在训练完成后，通过调用 push_to_hub()方法将最终模型也推送到 Hub。

>>> trainer.push_to_hub()

评估

目标检测模型通常使用一组COCO 风格指标进行评估。您可以使用现有的指标实现之一，但在这里，您将使用来自torchvision的指标来评估推送到 Hub 的最终模型。

要使用torchvision评估器，您需要准备一个真实的 COCO 数据集。构建 COCO 数据集的 API 要求数据以特定格式存储，因此您需要首先将图像和注释保存到磁盘上。就像您为训练准备数据时一样，来自cppe5["test"]的注释需要进行格式化。但是，图像应保持原样。

评估步骤需要一些工作，但可以分为三个主要步骤。首先，准备cppe5["test"]集：格式化注释并将数据保存到磁盘上。

>>> import json

>>> # format annotations the same as for training, no need for data augmentation
>>> def val_formatted_anns(image_id, objects):
...     annotations = []
...     for i in range(0, len(objects["id"])):
...         new_ann = {
...             "id": objects["id"][i],
...             "category_id": objects["category"][i],
...             "iscrowd": 0,
...             "image_id": image_id,
...             "area": objects["area"][i],
...             "bbox": objects["bbox"][i],
...         }
...         annotations.append(new_ann)

...     return annotations

>>> # Save images and annotations into the files torchvision.datasets.CocoDetection expects
>>> def save_cppe5_annotation_file_images(cppe5):
...     output_json = {}
...     path_output_cppe5 = f"{os.getcwd()}/cppe5/"

...     if not os.path.exists(path_output_cppe5):
...         os.makedirs(path_output_cppe5)

...     path_anno = os.path.join(path_output_cppe5, "cppe5_ann.json")
...     categories_json = [{"supercategory": "none", "id": id, "name": id2label[id]} for id in id2label]
...     output_json["images"] = []
...     output_json["annotations"] = []
...     for example in cppe5:
...         ann = val_formatted_anns(example["image_id"], example["objects"])
...         output_json["images"].append(
...             {
...                 "id": example["image_id"],
...                 "width": example["image"].width,
...                 "height": example["image"].height,
...                 "file_name": f"{example['image_id']}.png",
...             }
...         )
...         output_json["annotations"].extend(ann)
...     output_json["categories"] = categories_json

...     with open(path_anno, "w") as file:
...         json.dump(output_json, file, ensure_ascii=False, indent=4)

...     for im, img_id in zip(cppe5["image"], cppe5["image_id"]):
...         path_img = os.path.join(path_output_cppe5, f"{img_id}.png")
...         im.save(path_img)

...     return path_output_cppe5, path_anno

接下来，准备一个可以与cocoevaluator一起使用的CocoDetection类的实例。

>>> import torchvision

>>> class CocoDetection(torchvision.datasets.CocoDetection):
...     def __init__(self, img_folder, image_processor, ann_file):
...         super().__init__(img_folder, ann_file)
...         self.image_processor = image_processor

...     def __getitem__(self, idx):
...         # read in PIL image and target in COCO format
...         img, target = super(CocoDetection, self).__getitem__(idx)

...         # preprocess image and target: converting target to DETR format,
...         # resizing + normalization of both image and target)
...         image_id = self.ids[idx]
...         target = {"image_id": image_id, "annotations": target}
...         encoding = self.image_processor(images=img, annotations=target, return_tensors="pt")
...         pixel_values = encoding["pixel_values"].squeeze()  # remove batch dimension
...         target = encoding["labels"][0]  # remove batch dimension

...         return {"pixel_values": pixel_values, "labels": target}

>>> im_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5")

>>> path_output_cppe5, path_anno = save_cppe5_annotation_file_images(cppe5["test"])
>>> test_ds_coco_format = CocoDetection(path_output_cppe5, im_processor, path_anno)

最后，加载指标并运行评估。

>>> import evaluate
>>> from tqdm import tqdm

>>> model = AutoModelForObjectDetection.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5")
>>> module = evaluate.load("ybelkada/cocoevaluate", coco=test_ds_coco_format.coco)
>>> val_dataloader = torch.utils.data.DataLoader(
...     test_ds_coco_format, batch_size=8, shuffle=False, num_workers=4, collate_fn=collate_fn
... )

>>> with torch.no_grad():
...     for idx, batch in enumerate(tqdm(val_dataloader)):
...         pixel_values = batch["pixel_values"]
...         pixel_mask = batch["pixel_mask"]

...         labels = [
...             {k: v for k, v in t.items()} for t in batch["labels"]
...         ]  # these are in DETR format, resized + normalized

...         # forward pass
...         outputs = model(pixel_values=pixel_values, pixel_mask=pixel_mask)

...         orig_target_sizes = torch.stack([target["orig_size"] for target in labels], dim=0)
...         results = im_processor.post_process(outputs, orig_target_sizes)  # convert outputs of model to Pascal VOC format (xmin, ymin, xmax, ymax)

...         module.add(prediction=results, reference=labels)
...         del batch

>>> results = module.compute()
>>> print(results)
Accumulating evaluation results...
DONE (t=0.08s).
IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.352
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.681
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.292
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.168
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.208
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.429
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.274
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.484
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.501
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.191
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.323
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.590

通过调整 TrainingArguments 中的超参数，这些结果可以进一步改善。试一试吧！

推断

现在您已经微调了一个 DETR 模型，对其进行了评估，并将其上传到 Hugging Face Hub，您可以将其用于推断。尝试使用您微调的模型进行推断的最简单方法是在 Pipeline 中使用它。使用您的模型实例化一个用于目标检测的流水线，并将图像传递给它：

>>> from transformers import pipeline
>>> import requests

>>> url = "https://i.imgur.com/2lnWoly.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> obj_detector = pipeline("object-detection", model="devonho/detr-resnet-50_finetuned_cppe5")
>>> obj_detector(image)

如果您愿意，也可以手动复制流水线的结果：

>>> image_processor = AutoImageProcessor.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5")
>>> model = AutoModelForObjectDetection.from_pretrained("devonho/detr-resnet-50_finetuned_cppe5")

>>> with torch.no_grad():
...     inputs = image_processor(images=image, return_tensors="pt")
...     outputs = model(**inputs)
...     target_sizes = torch.tensor([image.size[::-1]])
...     results = image_processor.post_process_object_detection(outputs, threshold=0.5, target_sizes=target_sizes)[0]

>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
...     box = [round(i, 2) for i in box.tolist()]
...     print(
...         f"Detected {model.config.id2label[label.item()]} with confidence "
...         f"{round(score.item(), 3)} at location {box}"
...     )
Detected Coverall with confidence 0.566 at location [1215.32, 147.38, 4401.81, 3227.08]
Detected Mask with confidence 0.584 at location [2449.06, 823.19, 3256.43, 1413.9]

让我们绘制结果：

>>> draw = ImageDraw.Draw(image)

>>> for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
...     box = [round(i, 2) for i in box.tolist()]
...     x, y, x2, y2 = tuple(box)
...     draw.rectangle((x, y, x2, y2), outline="red", width=1)
...     draw.text((x, y), model.config.id2label[label.item()], fill="white")

>>> image

在一张新图片上的目标检测结果

零样本目标检测

原文链接：huggingface.co/docs/transformers/v4.37.2/en/tasks/zero_shot_object_detection

传统上，用于目标检测的模型需要标记的图像数据集进行训练，并且仅限于检测训练数据集中的类别集。

零样本目标检测由使用不同方法的 OWL-ViT 模型支持。OWL-ViT 是一个开放词汇的目标检测器。这意味着它可以基于自由文本查询在图像中检测对象，而无需在标记的数据集上对模型进行微调。

OWL-ViT 利用多模态表示执行开放词汇检测。它将 CLIP 与轻量级对象分类和定位头结合起来。通过将自由文本查询嵌入到 CLIP 的文本编码器中，并将其用作对象分类和定位头的输入，实现了开放词汇检测。关联图像及其相应的文本描述，ViT 将图像块作为输入进行处理。OWL-ViT 的作者首先从头开始训练 CLIP，然后使用二部匹配损失在标准目标检测数据集上端到端地微调 OWL-ViT。

通过这种方法，模型可以基于文本描述检测对象，而无需事先在标记的数据集上进行训练。

在本指南中，您将学习如何使用 OWL-ViT：

基于文本提示检测对象
用于批量目标检测
用于图像引导的目标检测

在开始之前，请确保已安装所有必要的库：

pip install -q transformers

零样本目标检测管道

尝试使用 OWL-ViT 进行推理的最简单方法是在 Hugging Face Hub 上的管道()中使用它。从Hugging Face Hub 上的检查点实例化一个零样本目标检测管道：

>>> from transformers import pipeline

>>> checkpoint = "google/owlvit-base-patch32"
>>> detector = pipeline(model=checkpoint, task="zero-shot-object-detection")

接下来，选择一个您想要检测对象的图像。这里我们将使用宇航员 Eileen Collins 的图像，该图像是NASA Great Images 数据集的一部分。

>>> import skimage
>>> import numpy as np
>>> from PIL import Image

>>> image = skimage.data.astronaut()
>>> image = Image.fromarray(np.uint8(image)).convert("RGB")

>>> image

宇航员 Eileen Collins

将图像和要查找的候选对象标签传递给管道。这里我们直接传递图像；其他合适的选项包括图像的本地路径或图像 url。我们还传递了所有要查询图像的项目的文本描述。

>>> predictions = detector(
...     image,
...     candidate_labels=["human face", "rocket", "nasa badge", "star-spangled banner"],
... )
>>> predictions
[{'score': 0.3571370542049408,
  'label': 'human face',
  'box': {'xmin': 180, 'ymin': 71, 'xmax': 271, 'ymax': 178}},
 {'score': 0.28099656105041504,
  'label': 'nasa badge',
  'box': {'xmin': 129, 'ymin': 348, 'xmax': 206, 'ymax': 427}},
 {'score': 0.2110239565372467,
  'label': 'rocket',
  'box': {'xmin': 350, 'ymin': -1, 'xmax': 468, 'ymax': 288}},
 {'score': 0.13790413737297058,
  'label': 'star-spangled banner',
  'box': {'xmin': 1, 'ymin': 1, 'xmax': 105, 'ymax': 509}},
 {'score': 0.11950037628412247,
  'label': 'nasa badge',
  'box': {'xmin': 277, 'ymin': 338, 'xmax': 327, 'ymax': 380}},
 {'score': 0.10649408400058746,
  'label': 'rocket',
  'box': {'xmin': 358, 'ymin': 64, 'xmax': 424, 'ymax': 280}}]

让我们可视化预测：

>>> from PIL import ImageDraw

>>> draw = ImageDraw.Draw(image)

>>> for prediction in predictions:
...     box = prediction["box"]
...     label = prediction["label"]
...     score = prediction["score"]

...     xmin, ymin, xmax, ymax = box.values()
...     draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
...     draw.text((xmin, ymin), f"{label}: {round(score,2)}", fill="white")

>>> image

NASA 图像上的可视化预测

手动文本提示的零样本目标检测

现在您已经看到如何使用零样本目标检测管道，让我们手动复制相同的结果。

从Hugging Face Hub 上的检查点加载模型和相关处理器。这里我们将使用与之前相同的检查点：

>>> from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

>>> model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

让我们选择不同的图像来改变一下。

>>> import requests

>>> url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
>>> im = Image.open(requests.get(url, stream=True).raw)
>>> im

海滩照片

使用处理器为模型准备输入。处理器结合了一个图像处理器，通过调整大小和归一化来为模型准备图像，以及一个 CLIPTokenizer，负责处理文本输入。

>>> text_queries = ["hat", "book", "sunglasses", "camera"]
>>> inputs = processor(text=text_queries, images=im, return_tensors="pt")

通过模型传递输入，后处理并可视化结果。由于图像处理器在将图像馈送到模型之前调整了图像的大小，因此您需要使用 post_process_object_detection()方法，以确保预测的边界框相对于原始图像具有正确的坐标：

>>> import torch

>>> with torch.no_grad():
...     outputs = model(**inputs)
...     target_sizes = torch.tensor([im.size[::-1]])
...     results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes)[0]

>>> draw = ImageDraw.Draw(im)

>>> scores = results["scores"].tolist()
>>> labels = results["labels"].tolist()
>>> boxes = results["boxes"].tolist()

>>> for box, score, label in zip(boxes, scores, labels):
...     xmin, ymin, xmax, ymax = box
...     draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
...     draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")

>>> im

带有检测对象的海滩照片

批处理

您可以传递多组图像和文本查询以在多个图像中搜索不同（或相同）的对象。让我们一起使用宇航员图像和海滩图像。对于批处理，您应该将文本查询作为嵌套列表传递给处理器，并将图像作为 PIL 图像、PyTorch 张量或 NumPy 数组的列表。

>>> images = [image, im]
>>> text_queries = [
...     ["human face", "rocket", "nasa badge", "star-spangled banner"],
...     ["hat", "book", "sunglasses", "camera"],
... ]
>>> inputs = processor(text=text_queries, images=images, return_tensors="pt")

以前，用于后处理的是将单个图像的大小作为张量传递，但您也可以传递一个元组，或者在有多个图像的情况下，传递一个元组列表。让我们为这两个示例创建预测，并可视化第二个示例（image_idx = 1）。

>>> with torch.no_grad():
...     outputs = model(**inputs)
...     target_sizes = [x.size[::-1] for x in images]
...     results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes)

>>> image_idx = 1
>>> draw = ImageDraw.Draw(images[image_idx])

>>> scores = results[image_idx]["scores"].tolist()
>>> labels = results[image_idx]["labels"].tolist()
>>> boxes = results[image_idx]["boxes"].tolist()

>>> for box, score, label in zip(boxes, scores, labels):
...     xmin, ymin, xmax, ymax = box
...     draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
...     draw.text((xmin, ymin), f"{text_queries[image_idx][label]}: {round(score,2)}", fill="white")

>>> images[image_idx]

带有检测到的对象的海滩照片

图像引导的对象检测

除了使用文本查询进行零样本对象检测外，OWL-ViT 还提供了图像引导的对象检测。这意味着您可以使用图像查询在目标图像中找到相似的对象。与文本查询不同，只允许一个示例图像。

让我们以一张沙发上有两只猫的图像作为目标图像，以一张单猫图像作为查询：

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image_target = Image.open(requests.get(url, stream=True).raw)

>>> query_url = "http://images.cocodataset.org/val2017/000000524280.jpg"
>>> query_image = Image.open(requests.get(query_url, stream=True).raw)

让我们快速查看这些图像：

>>> import matplotlib.pyplot as plt

>>> fig, ax = plt.subplots(1, 2)
>>> ax[0].imshow(image_target)
>>> ax[1].imshow(query_image)

在预处理步骤中，现在需要使用query_images而不是文本查询：

>>> inputs = processor(images=image_target, query_images=query_image, return_tensors="pt")

对于预测，不要将输入传递给模型，而是将它们传递给 image_guided_detection()。除了现在没有标签之外，绘制预测与以前相同。

>>> with torch.no_grad():
...     outputs = model.image_guided_detection(**inputs)
...     target_sizes = torch.tensor([image_target.size[::-1]])
...     results = processor.post_process_image_guided_detection(outputs=outputs, target_sizes=target_sizes)[0]

>>> draw = ImageDraw.Draw(image_target)

>>> scores = results["scores"].tolist()
>>> boxes = results["boxes"].tolist()

>>> for box, score, label in zip(boxes, scores, labels):
...     xmin, ymin, xmax, ymax = box
...     draw.rectangle((xmin, ymin, xmax, ymax), outline="white", width=4)

>>> image_target

带有边界框的猫

零样本图像分类

原文链接：huggingface.co/docs/transformers/v4.37.2/en/tasks/zero_shot_image_classification

零样本图像分类是一个任务，涉及使用未明确训练包含来自这些特定类别的标记示例的数据的模型将图像分类为不同的类别。

传统上，图像分类需要在特定一组带标签的图像上训练模型，该模型学习将某些图像特征“映射”到标签。当需要将这样的模型用于引入新标签集的分类任务时，需要进行微调以“重新校准”模型。

相比之下，零样本或开放词汇图像分类模型通常是多模态模型，已经在大量图像和相关描述的数据集上进行了训练。这些模型学习了对齐的视觉-语言表示，可用于许多下游任务，包括零样本图像分类。

这是一种更灵活的图像分类方法，允许模型推广到新的和未见过的类别，而无需额外的训练数据，并且使用户能够使用目标对象的自由形式文本描述查询图像。

在本指南中，您将学习如何：

创建一个零样本图像分类管道
手动运行零样本图像分类推理

在开始之前，请确保已安装所有必要的库：

pip install -q transformers

零样本图像分类管道

尝试使用支持零样本图像分类的模型进行推理的最简单方法是使用相应的 pipeline()。从Hugging Face Hub 上的检查点实例化一个管道：

>>> from transformers import pipeline

>>> checkpoint = "openai/clip-vit-large-patch14"
>>> detector = pipeline(model=checkpoint, task="zero-shot-image-classification")

接下来，选择一个您想要分类的图像。

>>> from PIL import Image
>>> import requests

>>> url = "https://unsplash.com/photos/g8oS8-82DxI/download?ixid=MnwxMjA3fDB8MXx0b3BpY3x8SnBnNktpZGwtSGt8fHx8fDJ8fDE2NzgxMDYwODc&force=true&w=640"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image

猫头鹰的照片

将图像和候选对象标签传递给管道。在这里，我们直接传递图像；其他合适的选项包括图像的本地路径或图像 url。候选标签可以像这个例子中那样简单，也可以更具描述性。

>>> predictions = detector(image, candidate_labels=["fox", "bear", "seagull", "owl"])
>>> predictions
[{'score': 0.9996670484542847, 'label': 'owl'},
 {'score': 0.000199399160919711, 'label': 'seagull'},
 {'score': 7.392891711788252e-05, 'label': 'fox'},
 {'score': 5.96074532950297e-05, 'label': 'bear'}]

手动进行零样本图像分类

现在您已经看到如何使用零样本图像分类管道，让我们看看如何手动运行零样本图像分类。

首先从Hugging Face Hub 上的检查点加载模型和相关处理器。在这里，我们将使用与之前相同的检查点：

>>> from transformers import AutoProcessor, AutoModelForZeroShotImageClassification

>>> model = AutoModelForZeroShotImageClassification.from_pretrained(checkpoint)
>>> processor = AutoProcessor.from_pretrained(checkpoint)

让我们换一张不同的图片。

>>> from PIL import Image
>>> import requests

>>> url = "https://unsplash.com/photos/xBRQfR2bqNI/download?ixid=MnwxMjA3fDB8MXxhbGx8fHx8fHx8fHwxNjc4Mzg4ODEx&force=true&w=640"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image

汽车的照片

使用处理器为模型准备输入。处理器结合了一个图像处理器，通过调整大小和归一化来为模型准备图像，以及一个标记器，负责处理文本输入。

>>> candidate_labels = ["tree", "car", "bike", "cat"]
>>> inputs = processor(images=image, text=candidate_labels, return_tensors="pt", padding=True)

通过模型传递输入，并对结果进行后处理：

>>> import torch

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits = outputs.logits_per_image[0]
>>> probs = logits.softmax(dim=-1).numpy()
>>> scores = probs.tolist()

>>> result = [
...     {"score": score, "label": candidate_label}
...     for score, candidate_label in sorted(zip(probs, candidate_labels), key=lambda x: -x[0])
... ]

>>> result
[{'score': 0.998572, 'label': 'car'},
 {'score': 0.0010570387, 'label': 'bike'},
 {'score': 0.0003393686, 'label': 'tree'},
 {'score': 3.1572064e-05, 'label': 'cat'}]

单目深度估计

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/monocular_depth_estimation

单目深度估计是一个涉及从单个图像预测场景深度信息的计算机视觉任务。换句话说，它是从单个摄像机视角估计场景中物体的距离的过程。

单目深度估计具有各种应用，包括 3D 重建，增强现实，自动驾驶和机器人技术。这是一个具有挑战性的任务，因为它要求模型理解场景中物体之间以及相应深度信息之间的复杂关系，这些关系可能受到光照条件、遮挡和纹理等因素的影响。

本教程中展示的任务由以下模型架构支持：

DPT, GLPN

在本指南中，您将学习如何：

创建深度估计管道
手动运行深度估计推断

在开始之前，请确保已安装所有必要的库：

pip install -q transformers

深度估计管道

尝试使用支持深度估计的模型进行推断的最简单方法是使用相应的 pipeline()。从Hugging Face Hub 上的检查点实例化一个管道：

>>> from transformers import pipeline

>>> checkpoint = "vinvino02/glpn-nyu"
>>> depth_estimator = pipeline("depth-estimation", model=checkpoint)

接下来，选择要分析的图像：

>>> from PIL import Image
>>> import requests

>>> url = "https://unsplash.com/photos/HwBAsSbPBDU/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MzR8fGNhciUyMGluJTIwdGhlJTIwc3RyZWV0fGVufDB8MHx8fDE2Nzg5MDEwODg&force=true&w=640"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> image

繁忙街道的照片

将图像传递给管道。

>>> predictions = depth_estimator(image)

管道返回一个带有两个条目的字典。第一个条目名为predicted_depth，是一个张量，其值为每个像素的以米为单位的深度。第二个条目depth是一个 PIL 图像，可视化深度估计结果。

让我们看一下可视化结果：

>>> predictions["depth"]

深度估计可视化

手动进行深度估计推断

现在您已经看到如何使用深度估计管道，让我们看看如何手动复制相同的结果。

从Hugging Face Hub 上的检查点加载模型和相关处理器开始。这里我们将使用与之前相同的检查点：

>>> from transformers import AutoImageProcessor, AutoModelForDepthEstimation

>>> checkpoint = "vinvino02/glpn-nyu"

>>> image_processor = AutoImageProcessor.from_pretrained(checkpoint)
>>> model = AutoModelForDepthEstimation.from_pretrained(checkpoint)

使用image_processor准备模型的图像输入，该处理器将处理必要的图像转换，如调整大小和归一化：

>>> pixel_values = image_processor(image, return_tensors="pt").pixel_values

通过模型传递准备好的输入：

>>> import torch

>>> with torch.no_grad():
...     outputs = model(pixel_values)
...     predicted_depth = outputs.predicted_depth

可视化结果：

>>> import numpy as np

>>> # interpolate to original size
>>> prediction = torch.nn.functional.interpolate(
...     predicted_depth.unsqueeze(1),
...     size=image.size[::-1],
...     mode="bicubic",
...     align_corners=False,
... ).squeeze()
>>> output = prediction.numpy()

>>> formatted = (output * 255 / np.max(output)).astype("uint8")
>>> depth = Image.fromarray(formatted)
>>> depth

深度估计可视化

图像到图像任务指南

原文链接：huggingface.co/docs/transformers/v4.37.2/en/tasks/image_to_image

图像到图像任务是一个应用程序接收图像并输出另一幅图像的任务。这包括各种子任务，包括图像增强（超分辨率、低光增强、去雨等）、图像修补等。

本指南将向您展示如何：

使用图像到图像管道进行超分辨率任务，
运行相同任务的图像到图像模型，而不使用管道。

请注意，截至本指南发布时，图像到图像管道仅支持超分辨率任务。

让我们开始安装必要的库。

pip install transformers

现在我们可以使用Swin2SR 模型初始化管道。然后，通过调用图像来推断管道。目前，此管道仅支持Swin2SR 模型。

from transformers import pipeline

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
pipe = pipeline(task="image-to-image", model="caidas/swin2SR-lightweight-x2-64", device=device)

现在，让我们加载一张图像。

from PIL import Image
import requests

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/cat.jpg"
image = Image.open(requests.get(url, stream=True).raw)

print(image.size)

# (532, 432)

一只猫的照片

现在我们可以使用管道进行推断。我们将得到一张猫图像的放大版本。

upscaled = pipe(image)
print(upscaled.size)

# (1072, 880)

如果您希望自己进行推断而不使用管道，可以使用 transformers 的Swin2SRForImageSuperResolution和Swin2SRImageProcessor类。我们将使用相同的模型检查点。让我们初始化模型和处理器。

from transformers import Swin2SRForImageSuperResolution, Swin2SRImageProcessor 

model = Swin2SRForImageSuperResolution.from_pretrained("caidas/swin2SR-lightweight-x2-64").to(device)
processor = Swin2SRImageProcessor("caidas/swin2SR-lightweight-x2-64")

pipeline抽象了我们必须自己完成的预处理和后处理步骤，因此让我们对图像进行预处理。我们将图像传递给处理器，然后将像素值移动到 GPU。

pixel_values = processor(image, return_tensors="pt").pixel_values
print(pixel_values.shape)

pixel_values = pixel_values.to(device)

现在我们可以通过将像素值传递给模型来推断图像。

import torch

with torch.no_grad():
  outputs = model(pixel_values)

输出是一个类型为ImageSuperResolutionOutput的对象，看起来像下面这样👇

(loss=None, reconstruction=tensor([[[[0.8270, 0.8269, 0.8275,  ..., 0.7463, 0.7446, 0.7453],
          [0.8287, 0.8278, 0.8283,  ..., 0.7451, 0.7448, 0.7457],
          [0.8280, 0.8273, 0.8269,  ..., 0.7447, 0.7446, 0.7452],
          ...,
          [0.5923, 0.5933, 0.5924,  ..., 0.0697, 0.0695, 0.0706],
          [0.5926, 0.5932, 0.5926,  ..., 0.0673, 0.0687, 0.0705],
          [0.5927, 0.5914, 0.5922,  ..., 0.0664, 0.0694, 0.0718]]]],
       device='cuda:0'), hidden_states=None, attentions=None)

我们需要获取reconstruction并对其进行后处理以进行可视化。让我们看看它是什么样子的。

outputs.reconstruction.data.shape
# torch.Size([1, 3, 880, 1072])

我们需要挤压输出并去掉轴 0，裁剪值，然后将其转换为 numpy 浮点数。然后我们将排列轴以获得形状[1072, 880]，最后将输出带回范围[0, 255]。

import numpy as np

# squeeze, take to CPU and clip the values
output = outputs.reconstruction.data.squeeze().cpu().clamp_(0, 1).numpy()
# rearrange the axes
output = np.moveaxis(output, source=0, destination=-1)
# bring values back to pixel values range
output = (output * 255.0).round().astype(np.uint8)
Image.fromarray(output)

一只猫的放大照片

计算机视觉知识蒸馏

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/knowledge_distillation_for_image_classification

知识蒸馏是一种技术，用于将知识从一个更大、更复杂的模型（教师）转移到一个更小、更简单的模型（学生）。为了从一个模型中提取知识到另一个模型，我们采用一个在特定任务上（本例中为图像分类）训练过的预训练教师模型，并随机初始化一个学生模型用于图像分类训练。接下来，我们训练学生模型以最小化其输出与教师输出之间的差异，从而使其模仿行为。这最初是由 Hinton 等人在神经网络中提取知识中首次引入的。在这个指南中，我们将进行特定任务的知识蒸馏。我们将使用 beans 数据集。

这个指南演示了如何使用 🤗 Transformers 的 Trainer API 将一个 fine-tuned ViT 模型（教师模型）蒸馏到一个 MobileNet（学生模型）。

让我们安装进行蒸馏和评估过程所需的库。

pip install transformers datasets accelerate tensorboard evaluate --upgrade

在这个例子中，我们使用 merve/beans-vit-224 模型作为教师模型。这是一个基于 google/vit-base-patch16-224-in21k 在 beans 数据集上微调的图像分类模型。我们将将这个模型蒸馏到一个随机初始化的 MobileNetV2。

我们现在将加载数据集。

from datasets import load_dataset

dataset = load_dataset("beans")

我们可以从任一模型中使用图像处理器，因为在这种情况下它们返回相同分辨率的相同输出。我们将使用 dataset 的 map() 方法将预处理应用于数据集的每个拆分。

from transformers import AutoImageProcessor
teacher_processor = AutoImageProcessor.from_pretrained("merve/beans-vit-224")

def process(examples):
    processed_inputs = teacher_processor(examples["image"])
    return processed_inputs

processed_datasets = dataset.map(process, batched=True)

基本上，我们希望学生模型（随机初始化的 MobileNet）模仿教师模型（微调的视觉变换器）。为了实现这一点，我们首先从教师和学生中获取 logits 输出。然后，我们将它们中的每一个除以控制每个软目标重要性的参数 temperature。一个称为 lambda 的参数权衡了蒸馏损失的重要性。在这个例子中，我们将使用 temperature=5 和 lambda=0.5。我们将使用 Kullback-Leibler 散度损失来计算学生和教师之间的差异。给定两个数据 P 和 Q，KL 散度解释了我们需要多少额外信息来用 Q 表示 P。如果两者相同，它们的 KL 散度为零，因为不需要其他信息来解释 P。因此，在知识蒸馏的背景下，KL 散度是有用的。

from transformers import TrainingArguments, Trainer
import torch
import torch.nn as nn
import torch.nn.functional as F

class ImageDistilTrainer(Trainer):
    def __init__(self, teacher_model=None, student_model=None, temperature=None, lambda_param=None,  *args, **kwargs):
        super().__init__(model=student_model, *args, **kwargs)
        self.teacher = teacher_model
        self.student = student_model
        self.loss_function = nn.KLDivLoss(reduction="batchmean")
        device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.teacher.to(device)
        self.teacher.eval()
        self.temperature = temperature
        self.lambda_param = lambda_param

    def compute_loss(self, student, inputs, return_outputs=False):
        student_output = self.student(**inputs)

        with torch.no_grad():
          teacher_output = self.teacher(**inputs)

        # Compute soft targets for teacher and student
        soft_teacher = F.softmax(teacher_output.logits / self.temperature, dim=-1)
        soft_student = F.log_softmax(student_output.logits / self.temperature, dim=-1)

        # Compute the loss
        distillation_loss = self.loss_function(soft_student, soft_teacher) * (self.temperature ** 2)

        # Compute the true label loss
        student_target_loss = student_output.loss

        # Calculate final loss
        loss = (1. - self.lambda_param) * student_target_loss + self.lambda_param * distillation_loss
        return (loss, student_output) if return_outputs else loss

我们现在将登录到 Hugging Face Hub，这样我们就可以通过 Trainer 将我们的模型推送到 Hugging Face Hub。

from huggingface_hub import notebook_login

notebook_login()

让我们设置 TrainingArguments、教师模型和学生模型。

from transformers import AutoModelForImageClassification, MobileNetV2Config, MobileNetV2ForImageClassification

training_args = TrainingArguments(
    output_dir="my-awesome-model",
    num_train_epochs=30,
    fp16=True,
    logging_dir=f"{repo_name}/logs",
    logging_strategy="epoch",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    report_to="tensorboard",
    push_to_hub=True,
    hub_strategy="every_save",
    hub_model_id=repo_name,
    )

num_labels = len(processed_datasets["train"].features["labels"].names)

# initialize models
teacher_model = AutoModelForImageClassification.from_pretrained(
    "merve/beans-vit-224",
    num_labels=num_labels,
    ignore_mismatched_sizes=True
)

# training MobileNetV2 from scratch
student_config = MobileNetV2Config()
student_config.num_labels = num_labels
student_model = MobileNetV2ForImageClassification(student_config)

我们可以使用 compute_metrics 函数在测试集上评估我们的模型。这个函数将在训练过程中用于计算我们模型的 准确率 和 f1。

import evaluate
import numpy as np

accuracy = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    acc = accuracy.compute(references=labels, predictions=np.argmax(predictions, axis=1))
    return {"accuracy": acc["accuracy"]}

让我们使用我们定义的训练参数初始化 Trainer。我们还将初始化我们的数据收集器。

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()
trainer = ImageDistilTrainer(
    student_model=student_model,
    teacher_model=teacher_model,
    training_args=training_args,
    train_dataset=processed_datasets["train"],
    eval_dataset=processed_datasets["validation"],
    data_collator=data_collator,
    tokenizer=teacher_processor,
    compute_metrics=compute_metrics,
    temperature=5,
    lambda_param=0.5
)

我们现在可以训练我们的模型。

trainer.train()

我们可以在测试集上评估模型。

trainer.evaluate(processed_datasets["test"])

在测试集上，我们的模型达到了 72％的准确率。为了对蒸馏效率进行合理性检查，我们还使用相同的超参数从头开始在豆类数据集上训练 MobileNet，并观察到测试集上的 63％准确率。我们邀请读者尝试不同的预训练教师模型、学生架构、蒸馏参数，并报告他们的发现。蒸馏模型的训练日志和检查点可以在此存储库中找到，从头开始训练的 MobileNetV2 可以在此存储库中找到。

多模态

图像字幕

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/image_captioning

图像字幕是预测给定图像的字幕的任务。它的常见实际应用包括帮助视觉障碍人士，帮助他们在不同情况下导航。因此，图像字幕通过向人们描述图像来帮助提高人们对内容的可访问性。

本指南将向您展示如何：

微调图像字幕模型。
用于推理的微调模型。

在开始之前，请确保您已安装所有必要的库：

pip install transformers datasets evaluate -q
pip install jiwer -q

我们鼓励您登录您的 Hugging Face 账户，这样您就可以上传并与社区分享您的模型。在提示时，输入您的令牌以登录：

from huggingface_hub import notebook_login

notebook_login()

加载 Pokemon BLIP 字幕数据集

使用🤗数据集库加载一个由{图像-标题}对组成的数据集。要在 PyTorch 中创建自己的图像字幕数据集，您可以参考此笔记本。

from datasets import load_dataset

ds = load_dataset("lambdalabs/pokemon-blip-captions")
ds

DatasetDict({
    train: Dataset({
        features: ['image', 'text'],
        num_rows: 833
    })
})

数据集有两个特征，图像和文本。

许多图像字幕数据集包含每个图像的多个字幕。在这种情况下，一个常见的策略是在训练过程中在可用的字幕中随机抽取一个字幕。

使用[~datasets.Dataset.train_test_split]方法将数据集的训练集拆分为训练集和测试集：

ds = ds["train"].train_test_split(test_size=0.1)
train_ds = ds["train"]
test_ds = ds["test"]

让我们从训练集中可视化几个样本。

from textwrap import wrap
import matplotlib.pyplot as plt
import numpy as np

def plot_images(images, captions):
    plt.figure(figsize=(20, 20))
    for i in range(len(images)):
        ax = plt.subplot(1, len(images), i + 1)
        caption = captions[i]
        caption = "\n".join(wrap(caption, 12))
        plt.title(caption)
        plt.imshow(images[i])
        plt.axis("off")

sample_images_to_visualize = [np.array(train_ds[i]["image"]) for i in range(5)]
sample_captions = [train_ds[i]["text"] for i in range(5)]
plot_images(sample_images_to_visualize, sample_captions)

样本训练图像

预处理数据集

由于数据集具有两种模态（图像和文本），预处理流水线将预处理图像和标题。

为此，加载与您即将微调的模型相关联的处理器类。

from transformers import AutoProcessor

checkpoint = "microsoft/git-base"
processor = AutoProcessor.from_pretrained(checkpoint)

处理器将在内部预处理图像（包括调整大小和像素缩放）并对标题进行标记。

def transforms(example_batch):
    images = [x for x in example_batch["image"]]
    captions = [x for x in example_batch["text"]]
    inputs = processor(images=images, text=captions, padding="max_length")
    inputs.update({"labels": inputs["input_ids"]})
    return inputs

train_ds.set_transform(transforms)
test_ds.set_transform(transforms)

有了准备好的数据集，您现在可以为微调设置模型。

加载基础模型

将“microsoft/git-base”加载到AutoModelForCausalLM对象中。

from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(checkpoint)

评估

图像字幕模型通常使用Rouge Score或Word Error Rate进行评估。在本指南中，您将使用 Word Error Rate (WER)。

我们使用🤗评估库来做到这一点。有关 WER 的潜在限制和其他注意事项，请参考此指南。

from evaluate import load
import torch

wer = load("wer")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predicted = logits.argmax(-1)
    decoded_labels = processor.batch_decode(labels, skip_special_tokens=True)
    decoded_predictions = processor.batch_decode(predicted, skip_special_tokens=True)
    wer_score = wer.compute(predictions=decoded_predictions, references=decoded_labels)
    return {"wer_score": wer_score}

训练！

现在，您已经准备好开始微调模型了。您将使用🤗Trainer 来进行此操作。

首先，使用 TrainingArguments 定义训练参数。

from transformers import TrainingArguments, Trainer

model_name = checkpoint.split("/")[1]

training_args = TrainingArguments(
    output_dir=f"{model_name}-pokemon",
    learning_rate=5e-5,
    num_train_epochs=50,
    fp16=True,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=2,
    save_total_limit=3,
    evaluation_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=50,
    logging_steps=50,
    remove_unused_columns=False,
    push_to_hub=True,
    label_names=["labels"],
    load_best_model_at_end=True,
)

然后将它们与数据集和模型一起传递给🤗 Trainer。

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

要开始训练，只需在 Trainer 对象上调用 train()。

trainer.train()

您应该看到随着训练的进行，训练损失平稳下降。

一旦训练完成，使用 push_to_hub()方法将您的模型共享到 Hub，以便每个人都可以使用您的模型：

trainer.push_to_hub()

推理

从test_ds中取一个样本图像来测试模型。

from PIL import Image
import requests

url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/pokemon.png"
image = Image.open(requests.get(url, stream=True).raw)
image

测试图片为模型准备图像。

device = "cuda" if torch.cuda.is_available() else "cpu"

inputs = processor(images=image, return_tensors="pt").to(device)
pixel_values = inputs.pixel_values

调用generate并解码预测。

generated_ids = model.generate(pixel_values=pixel_values, max_length=50)
generated_caption = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_caption)

a drawing of a pink and blue pokemon

看起来微调的模型生成了一个相当不错的字幕！

文档问答

原始文本：huggingface.co/docs/transformers/v4.37.2/en/tasks/document_question_answering

文档问答，也称为文档视觉问答，是一个涉及提供关于文档图像的问题的答案的任务。支持此任务的模型的输入通常是图像和问题的组合，输出是用自然语言表达的答案。这些模型利用多种模态，包括文本、单词的位置（边界框）和图像本身。

本指南说明了如何：

在 DocVQA 数据集上对 LayoutLMv2 进行微调。
使用您微调的模型进行推断。

本教程中演示的任务由以下模型架构支持：

LayoutLM, LayoutLMv2, LayoutLMv3

LayoutLMv2 通过在标记的最终隐藏状态之上添加一个问题-回答头来解决文档问答任务，以预测答案的开始和结束标记的位置。换句话说，这个问题被视为抽取式问答：在给定上下文的情况下，提取哪个信息片段回答问题。上下文来自 OCR 引擎的输出，这里是 Google 的 Tesseract。

在开始之前，请确保您已安装所有必要的库。LayoutLMv2 依赖于 detectron2、torchvision 和 tesseract。

pip install -q transformers datasets

pip install 'git+https://github.com/facebookresearch/detectron2.git'
pip install torchvision

sudo apt install tesseract-ocr
pip install -q pytesseract

安装完所有依赖项后，请重新启动您的运行时。

我们鼓励您与社区分享您的模型。登录到您的 Hugging Face 账户，将其上传到 🤗 Hub。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

让我们定义一些全局变量。

>>> model_checkpoint = "microsoft/layoutlmv2-base-uncased"
>>> batch_size = 4

加载数据

在本指南中，我们使用了一个小样本的预处理 DocVQA，您可以在 🤗 Hub 上找到。如果您想使用完整的 DocVQA 数据集，您可以在 DocVQA 主页上注册并下载。如果您这样做了，要继续本指南，请查看如何将文件加载到 🤗 数据集。

>>> from datasets import load_dataset

>>> dataset = load_dataset("nielsr/docvqa_1200_examples")
>>> dataset
DatasetDict({
    train: Dataset({
        features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'],
        num_rows: 1000
    })
    test: Dataset({
        features: ['id', 'image', 'query', 'answers', 'words', 'bounding_boxes', 'answer'],
        num_rows: 200
    })
})

如您所见，数据集已经分为训练集和测试集。查看一个随机示例，以熟悉特征。

>>> dataset["train"].features

这里是各个字段代表的含义：

id：示例的 id
image：包含文档图像的 PIL.Image.Image 对象
query：问题字符串 - 自然语言提出的问题，可以是多种语言
answers：人类注释者提供的正确答案列表
words 和 bounding_boxes：OCR 的结果，我们这里不会使用
answer：由另一个模型匹配的答案，我们这里不会使用

让我们只保留英文问题，并且删除包含另一个模型预测的 answer 特征。我们还将从注释者提供的答案集中取第一个答案。或者，您可以随机抽样。

>>> updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"])
>>> updated_dataset = updated_dataset.map(
...     lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"]
... )

请注意，本指南中使用的 LayoutLMv2 检查点已经训练了 max_position_embeddings = 512（您可以在检查点的 config.json 文件中找到此信息）。我们可以截断示例，但为了避免答案可能在大型文档的末尾并最终被截断的情况，这里我们将删除几个示例，其中嵌入可能会超过 512。如果您的数据集中大多数文档很长，您可以实现一个滑动窗口策略 - 详细信息请查看此笔记本。

>>> updated_dataset = updated_dataset.filter(lambda x: len(x["words"]) + len(x["question"].split()) < 512)

此时让我们还从数据集中删除 OCR 特征。这些是 OCR 的结果，用于微调不同的模型。如果我们想要使用它们，它们仍然需要一些处理，因为它们不符合我们在本指南中使用的模型的输入要求。相反，我们可以在原始数据上同时使用 LayoutLMv2Processor 进行 OCR 和标记化。这样我们将得到与模型预期输入匹配的输入。如果您想手动处理图像，请查看LayoutLMv2模型文档以了解模型期望的输入格式。

>>> updated_dataset = updated_dataset.remove_columns("words")
>>> updated_dataset = updated_dataset.remove_columns("bounding_boxes")

最后，如果我们不查看一个图像示例，数据探索就不会完成。

>>> updated_dataset["train"][11]["image"]

DocVQA 图像示例

预处理数据

文档问答任务是一个多模态任务，您需要确保每种模态的输入都按照模型的期望进行预处理。让我们从加载 LayoutLMv2Processor 开始，它内部结合了一个可以处理图像数据的图像处理器和一个可以编码文本数据的标记器。

>>> from transformers import AutoProcessor

>>> processor = AutoProcessor.from_pretrained(model_checkpoint)

预处理文档图像

首先，让我们通过处理器中的image_processor来为模型准备文档图像。默认情况下，图像处理器将图像调整大小为 224x224，确保它们具有正确的颜色通道顺序，应用 OCR 与 tesseract 获取单词和归一化边界框。在本教程中，所有这些默认设置正是我们所需要的。编写一个函数，将默认图像处理应用于一批图像，并返回 OCR 的结果。

>>> image_processor = processor.image_processor

>>> def get_ocr_words_and_boxes(examples):
...     images = [image.convert("RGB") for image in examples["image"]]
...     encoded_inputs = image_processor(images)

...     examples["image"] = encoded_inputs.pixel_values
...     examples["words"] = encoded_inputs.words
...     examples["boxes"] = encoded_inputs.boxes

...     return examples

为了以快速的方式将此预处理应用于整个数据集，请使用map。

>>> dataset_with_ocr = updated_dataset.map(get_ocr_words_and_boxes, batched=True, batch_size=2)

预处理文本数据

一旦我们对图像应用了 OCR，我们需要对数据集的文本部分进行编码，以准备模型使用。这涉及将我们在上一步中获得的单词和框转换为标记级别的input_ids、attention_mask、token_type_ids和bbox。对于文本预处理，我们将需要处理器中的tokenizer。

>>> tokenizer = processor.tokenizer

除了上述预处理之外，我们还需要为模型添加标签。对于🤗 Transformers 中的xxxForQuestionAnswering模型，标签包括start_positions和end_positions，指示答案的起始和结束的标记在哪里。

让我们从这里开始。定义一个辅助函数，可以在较大的列表（单词列表）中找到一个子列表（答案拆分为单词）。

此函数将接受两个列表作为输入，words_list和answer_list。然后，它将遍历words_list，检查words_list中当前单词（words_list[i]）是否等于answer_list的第一个单词（answer_list[0])，以及从当前单词开始且与answer_list相同长度的words_list子列表是否等于answer_list。如果这个条件为真，表示找到了匹配，函数将记录匹配及其起始索引（idx）和结束索引（idx + len(answer_list) - 1）。如果找到了多个匹配，函数将仅返回第一个。如果没有找到匹配，函数将返回（None，0 和 0）。

>>> def subfinder(words_list, answer_list):
...     matches = []
...     start_indices = []
...     end_indices = []
...     for idx, i in enumerate(range(len(words_list))):
...         if words_list[i] == answer_list[0] and words_list[i : i + len(answer_list)] == answer_list:
...             matches.append(answer_list)
...             start_indices.append(idx)
...             end_indices.append(idx + len(answer_list) - 1)
...     if matches:
...         return matches[0], start_indices[0], end_indices[0]
...     else:
...         return None, 0, 0

为了说明此函数如何找到答案的位置，让我们在一个示例上使用它：

>>> example = dataset_with_ocr["train"][1]
>>> words = [word.lower() for word in example["words"]]
>>> match, word_idx_start, word_idx_end = subfinder(words, example["answer"].lower().split())
>>> print("Question: ", example["question"])
>>> print("Words:", words)
>>> print("Answer: ", example["answer"])
>>> print("start_index", word_idx_start)
>>> print("end_index", word_idx_end)
Question:  Who is in  cc in this letter?
Words: ['wie', 'baw', 'brown', '&', 'williamson', 'tobacco', 'corporation', 'research', '&', 'development', 'internal', 'correspondence', 'to:', 'r.', 'h.', 'honeycutt', 'ce:', 't.f.', 'riehl', 'from:', '.', 'c.j.', 'cook', 'date:', 'may', '8,', '1995', 'subject:', 'review', 'of', 'existing', 'brainstorming', 'ideas/483', 'the', 'major', 'function', 'of', 'the', 'product', 'innovation', 'graup', 'is', 'to', 'develop', 'marketable', 'nove!', 'products', 'that', 'would', 'be', 'profitable', 'to', 'manufacture', 'and', 'sell.', 'novel', 'is', 'defined', 'as:', 'of', 'a', 'new', 'kind,', 'or', 'different', 'from', 'anything', 'seen', 'or', 'known', 'before.', 'innovation', 'is', 'defined', 'as:', 'something', 'new', 'or', 'different', 'introduced;', 'act', 'of', 'innovating;', 'introduction', 'of', 'new', 'things', 'or', 'methods.', 'the', 'products', 'may', 'incorporate', 'the', 'latest', 'technologies,', 'materials', 'and', 'know-how', 'available', 'to', 'give', 'then', 'a', 'unique', 'taste', 'or', 'look.', 'the', 'first', 'task', 'of', 'the', 'product', 'innovation', 'group', 'was', 'to', 'assemble,', 'review', 'and', 'categorize', 'a', 'list', 'of', 'existing', 'brainstorming', 'ideas.', 'ideas', 'were', 'grouped', 'into', 'two', 'major', 'categories', 'labeled', 'appearance', 'and', 'taste/aroma.', 'these', 'categories', 'are', 'used', 'for', 'novel', 'products', 'that', 'may', 'differ', 'from', 'a', 'visual', 'and/or', 'taste/aroma', 'point', 'of', 'view', 'compared', 'to', 'canventional', 'cigarettes.', 'other', 'categories', 'include', 'a', 'combination', 'of', 'the', 'above,', 'filters,', 'packaging', 'and', 'brand', 'extensions.', 'appearance', 'this', 'category', 'is', 'used', 'for', 'novel', 'cigarette', 'constructions', 'that', 'yield', 'visually', 'different', 'products', 'with', 'minimal', 'changes', 'in', 'smoke', 'chemistry', 'two', 'cigarettes', 'in', 'cne.', 'emulti-plug', 'te', 'build', 'yaur', 'awn', 'cigarette.', 'eswitchable', 'menthol', 'or', 'non', 'menthol', 'cigarette.', '*cigarettes', 'with', 'interspaced', 'perforations', 'to', 'enable', 'smoker', 'to', 'separate', 'unburned', 'section', 'for', 'future', 'smoking.', '«short', 'cigarette,', 'tobacco', 'section', '30', 'mm.', '«extremely', 'fast', 'buming', 'cigarette.', '«novel', 'cigarette', 'constructions', 'that', 'permit', 'a', 'significant', 'reduction', 'iretobacco', 'weight', 'while', 'maintaining', 'smoking', 'mechanics', 'and', 'visual', 'characteristics.', 'higher', 'basis', 'weight', 'paper:', 'potential', 'reduction', 'in', 'tobacco', 'weight.', '«more', 'rigid', 'tobacco', 'column;', 'stiffing', 'agent', 'for', 'tobacco;', 'e.g.', 'starch', '*colored', 'tow', 'and', 'cigarette', 'papers;', 'seasonal', 'promotions,', 'e.g.', 'pastel', 'colored', 'cigarettes', 'for', 'easter', 'or', 'in', 'an', 'ebony', 'and', 'ivory', 'brand', 'containing', 'a', 'mixture', 'of', 'all', 'black', '(black', 'paper', 'and', 'tow)', 'and', 'ail', 'white', 'cigarettes.', '499150498']
Answer:  T.F. Riehl
start_index 17
end_index 18

然而，一旦示例被编码，它们将看起来像这样：

>>> encoding = tokenizer(example["question"], example["words"], example["boxes"])
>>> tokenizer.decode(encoding["input_ids"])
[CLS] who is in cc in this letter? [SEP] wie baw brown & williamson tobacco corporation research & development ...

我们需要找到编码输入中答案的位置。

token_type_ids告诉我们哪些标记属于问题，哪些属于文档的单词。
tokenizer.cls_token_id将帮助找到输入开头的特殊标记。
word_ids将帮助将原始words中找到的答案与完全编码输入中的相同答案进行匹配，并确定编码输入中答案的起始/结束位置。

有了这个想法，让我们创建一个函数来对数据集中的一批示例进行编码：

>>> def encode_dataset(examples, max_length=512):
...     questions = examples["question"]
...     words = examples["words"]
...     boxes = examples["boxes"]
...     answers = examples["answer"]

...     # encode the batch of examples and initialize the start_positions and end_positions
...     encoding = tokenizer(questions, words, boxes, max_length=max_length, padding="max_length", truncation=True)
...     start_positions = []
...     end_positions = []

...     # loop through the examples in the batch
...     for i in range(len(questions)):
...         cls_index = encoding["input_ids"][i].index(tokenizer.cls_token_id)

...         # find the position of the answer in example's words
...         words_example = [word.lower() for word in words[i]]
...         answer = answers[i]
...         match, word_idx_start, word_idx_end = subfinder(words_example, answer.lower().split())

...         if match:
...             # if match is found, use `token_type_ids` to find where words start in the encoding
...             token_type_ids = encoding["token_type_ids"][i]
...             token_start_index = 0
...             while token_type_ids[token_start_index] != 1:
...                 token_start_index += 1

...             token_end_index = len(encoding["input_ids"][i]) - 1
...             while token_type_ids[token_end_index] != 1:
...                 token_end_index -= 1

...             word_ids = encoding.word_ids(i)[token_start_index : token_end_index + 1]
...             start_position = cls_index
...             end_position = cls_index

...             # loop over word_ids and increase `token_start_index` until it matches the answer position in words
...             # once it matches, save the `token_start_index` as the `start_position` of the answer in the encoding
...             for id in word_ids:
...                 if id == word_idx_start:
...                     start_position = token_start_index
...                 else:
...                     token_start_index += 1

...             # similarly loop over `word_ids` starting from the end to find the `end_position` of the answer
...             for id in word_ids[::-1]:
...                 if id == word_idx_end:
...                     end_position = token_end_index
...                 else:
...                     token_end_index -= 1

...             start_positions.append(start_position)
...             end_positions.append(end_position)

...         else:
...             start_positions.append(cls_index)
...             end_positions.append(cls_index)

...     encoding["image"] = examples["image"]
...     encoding["start_positions"] = start_positions
...     encoding["end_positions"] = end_positions

...     return encoding

现在我们有了这个预处理函数，我们可以对整个数据集进行编码：

>>> encoded_train_dataset = dataset_with_ocr["train"].map(
...     encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr["train"].column_names
... )
>>> encoded_test_dataset = dataset_with_ocr["test"].map(
...     encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr["test"].column_names
... )

让我们看看编码数据集的特征是什么样子的：

>>> encoded_train_dataset.features
{'image': Sequence(feature=Sequence(feature=Sequence(feature=Value(dtype='uint8', id=None), length=-1, id=None), length=-1, id=None), length=-1, id=None),
 'input_ids': Sequence(feature=Value(dtype='int32', id=None), length=-1, id=None),
 'token_type_ids': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'attention_mask': Sequence(feature=Value(dtype='int8', id=None), length=-1, id=None),
 'bbox': Sequence(feature=Sequence(feature=Value(dtype='int64', id=None), length=-1, id=None), length=-1, id=None),
 'start_positions': Value(dtype='int64', id=None),
 'end_positions': Value(dtype='int64', id=None)}

评估

文档问题回答的评估需要大量的后处理。为了避免占用太多时间，本指南跳过了评估步骤。Trainer 在训练过程中仍会计算评估损失，因此您不会完全不了解模型的性能。提取式问答通常使用 F1/完全匹配进行评估。如果您想自己实现，请查看 Hugging Face 课程的问答章节获取灵感。

训练

恭喜！您已成功完成本指南中最困难的部分，现在您已经准备好训练自己的模型。训练包括以下步骤：

使用与预处理相同的检查点加载 AutoModelForDocumentQuestionAnswering 模型。
在 TrainingArguments 中定义您的训练超参数。
定义一个将示例批处理在一起的函数，这里 DefaultDataCollator 将做得很好
将训练参数传递给 Trainer，以及模型、数据集和数据收集器。
调用 train()来微调您的模型。

>>> from transformers import AutoModelForDocumentQuestionAnswering

>>> model = AutoModelForDocumentQuestionAnswering.from_pretrained(model_checkpoint)

在 TrainingArguments 中使用output_dir指定保存模型的位置，并根据需要配置超参数。如果希望与社区分享模型，请将push_to_hub设置为True（您必须登录 Hugging Face 才能上传模型）。在这种情况下，output_dir也将是将推送模型检查点的存储库的名称。

>>> from transformers import TrainingArguments

>>> # REPLACE THIS WITH YOUR REPO ID
>>> repo_id = "MariaK/layoutlmv2-base-uncased_finetuned_docvqa"

>>> training_args = TrainingArguments(
...     output_dir=repo_id,
...     per_device_train_batch_size=4,
...     num_train_epochs=20,
...     save_steps=200,
...     logging_steps=50,
...     evaluation_strategy="steps",
...     learning_rate=5e-5,
...     save_total_limit=2,
...     remove_unused_columns=False,
...     push_to_hub=True,
... )

定义一个简单的数据收集器来将示例批处理在一起。

>>> from transformers import DefaultDataCollator

>>> data_collator = DefaultDataCollator()

最后，将所有内容汇总，并调用 train()：

>>> from transformers import Trainer

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     data_collator=data_collator,
...     train_dataset=encoded_train_dataset,
...     eval_dataset=encoded_test_dataset,
...     tokenizer=processor,
... )

>>> trainer.train()

要将最终模型添加到🤗 Hub，创建一个模型卡并调用push_to_hub：

>>> trainer.create_model_card()
>>> trainer.push_to_hub()

推理

现在您已经微调了一个 LayoutLMv2 模型，并将其上传到🤗 Hub，您可以用它进行推理。尝试使用微调模型进行推理的最简单方法是在 Pipeline 中使用它。

让我们举个例子：

>>> example = dataset["test"][2]
>>> question = example["query"]["en"]
>>> image = example["image"]
>>> print(question)
>>> print(example["answers"])
'Who is ‘presiding’ TRRF GENERAL SESSION (PART 1)?'
['TRRF Vice President', 'lee a. waller']

接下来，使用您的模型为文档问题回答实例化一个流水线，并将图像+问题组合传递给它。

>>> from transformers import pipeline

>>> qa_pipeline = pipeline("document-question-answering", model="MariaK/layoutlmv2-base-uncased_finetuned_docvqa")
>>> qa_pipeline(image, question)
[{'score': 0.9949808120727539,
  'answer': 'Lee A. Waller',
  'start': 55,
  'end': 57}]

如果愿意，也可以手动复制流水线的结果：

将一张图片和一个问题，使用模型的处理器为其准备好。
将结果或预处理通过模型前向传递。
模型返回start_logits和end_logits，指示答案起始处和答案结束处的标记。两者的形状都是(batch_size, sequence_length)。
对start_logits和end_logits的最后一个维度进行 argmax 操作，以获取预测的start_idx和end_idx。
使用分词器解码答案。

>>> import torch
>>> from transformers import AutoProcessor
>>> from transformers import AutoModelForDocumentQuestionAnswering

>>> processor = AutoProcessor.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa")
>>> model = AutoModelForDocumentQuestionAnswering.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa")

>>> with torch.no_grad():
...     encoding = processor(image.convert("RGB"), question, return_tensors="pt")
...     outputs = model(**encoding)
...     start_logits = outputs.start_logits
...     end_logits = outputs.end_logits
...     predicted_start_idx = start_logits.argmax(-1).item()
...     predicted_end_idx = end_logits.argmax(-1).item()

>>> processor.tokenizer.decode(encoding.input_ids.squeeze()[predicted_start_idx : predicted_end_idx + 1])
'lee a. waller'

ataset,
… tokenizer=processor,
… )

trainer.train()


要将最终模型添加到🤗 Hub，创建一个模型卡并调用`push_to_hub`：

```py
>>> trainer.create_model_card()
>>> trainer.push_to_hub()

推理

现在您已经微调了一个 LayoutLMv2 模型，并将其上传到🤗 Hub，您可以用它进行推理。尝试使用微调模型进行推理的最简单方法是在 Pipeline 中使用它。

让我们举个例子：

>>> example = dataset["test"][2]
>>> question = example["query"]["en"]
>>> image = example["image"]
>>> print(question)
>>> print(example["answers"])
'Who is ‘presiding’ TRRF GENERAL SESSION (PART 1)?'
['TRRF Vice President', 'lee a. waller']

接下来，使用您的模型为文档问题回答实例化一个流水线，并将图像+问题组合传递给它。

>>> from transformers import pipeline

>>> qa_pipeline = pipeline("document-question-answering", model="MariaK/layoutlmv2-base-uncased_finetuned_docvqa")
>>> qa_pipeline(image, question)
[{'score': 0.9949808120727539,
  'answer': 'Lee A. Waller',
  'start': 55,
  'end': 57}]

如果愿意，也可以手动复制流水线的结果：

将一张图片和一个问题，使用模型的处理器为其准备好。
将结果或预处理通过模型前向传递。
模型返回start_logits和end_logits，指示答案起始处和答案结束处的标记。两者的形状都是(batch_size, sequence_length)。
对start_logits和end_logits的最后一个维度进行 argmax 操作，以获取预测的start_idx和end_idx。
使用分词器解码答案。

>>> import torch
>>> from transformers import AutoProcessor
>>> from transformers import AutoModelForDocumentQuestionAnswering

>>> processor = AutoProcessor.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa")
>>> model = AutoModelForDocumentQuestionAnswering.from_pretrained("MariaK/layoutlmv2-base-uncased_finetuned_docvqa")

>>> with torch.no_grad():
...     encoding = processor(image.convert("RGB"), question, return_tensors="pt")
...     outputs = model(**encoding)
...     start_logits = outputs.start_logits
...     end_logits = outputs.end_logits
...     predicted_start_idx = start_logits.argmax(-1).item()
...     predicted_end_idx = end_logits.argmax(-1).item()

>>> processor.tokenizer.decode(encoding.input_ids.squeeze()[predicted_start_idx : predicted_end_idx + 1])
'lee a. waller'