数据块教程

在所有应用中使用数据块

在本教程中，我们将学习如何在各种任务中使用数据块 API 以及如何调试数据块。数据块 API 的名称来源于它的设计方式：构建 DataLoaders 对象所需的所有信息（输入类型、目标类型、如何标记、如何分割等）都被封装在一个块中，并且你可以混合搭配这些块。

从头构建 `DataBlock`

本教程的其余部分将提供许多示例，但我们首先从头构建一个 DataBlock 来解决我们在视觉教程中看到的猫狗大战问题。首先，我们导入视觉所需的一切。

from fastai.data.all import *
from fastai.vision.all import *

第一步是下载并解压我们的数据（如果尚未完成）并获取其位置。

path = untar_data(URLs.PETS)

正如我们所见，所有文件名都在“images”文件夹中。get_image_files 函数有助于获取子文件夹中的所有图像。

fnames = get_image_files(path/"images")

让我们从一个空的 DataBlock 开始。

dblock = DataBlock()

本质上，DataBlock 只是一个关于如何组装数据的蓝图。除非你向它传递源数据，否则它不会做任何事情。你可以选择使用 DataBlock.datasets 或 DataBlock.dataloaders 方法将源数据转换为 Datasets 或 DataLoaders。由于我们尚未准备好用于批处理的数据，因此这里的 dataloaders 方法会失败，但我们可以看看它如何在 Datasets 中进行转换。在这里，我们传递数据源，即所有文件名。

dsets = dblock.datasets(fnames)
dsets.train[0]

(Path('/home/jhoward/.fastai/data/oxford-iiit-pet/images/Birman_82.jpg'),
 Path('/home/jhoward/.fastai/data/oxford-iiit-pet/images/Birman_82.jpg'))

默认情况下，数据块 API 假定我们有一个输入和一个目标，这就是为什么我们看到文件名重复出现两次。

我们可以做的第一件事是使用一个 get_items 函数来实际组装数据块中的项目。

dblock = DataBlock(get_items = get_image_files)

区别在于，然后你将包含图像的文件夹作为源传递，而不是所有文件名。

dsets = dblock.datasets(path/"images")
dsets.train[0]

(Path('/home/jhoward/.fastai/data/oxford-iiit-pet/images/english_cocker_spaniel_76.jpg'),
 Path('/home/jhoward/.fastai/data/oxford-iiit-pet/images/english_cocker_spaniel_76.jpg'))

我们的输入已准备好作为图像处理（因为图像可以从文件名构建），但我们的目标尚未准备好。由于我们面临猫狗大战问题，我们需要将文件名转换为“cat”或“dog”（或 True 或 False）。让我们为此构建一个函数。

def label_func(fname):
    return "cat" if fname.name[0].isupper() else "dog"

然后，我们可以通过将其作为 get_y 传递来告诉我们的数据块使用它来标记我们的目标。

dblock = DataBlock(get_items = get_image_files,
                   get_y     = label_func)

dsets = dblock.datasets(path/"images")
dsets.train[0]

(Path('/home/jhoward/.fastai/data/oxford-iiit-pet/images/staffordshire_bull_terrier_77.jpg'),
 'dog')

现在我们的输入和目标已准备好，我们可以指定类型来告诉数据块 API 我们的输入是图像，目标是类别。在数据块 API 中，类型由块表示，这里我们使用 ImageBlock 和 CategoryBlock。

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func)

dsets = dblock.datasets(path/"images")
dsets.train[0]

(PILImage mode=RGB size=361x500, TensorCategory(1))

我们可以看到 DataBlock 如何自动添加打开图像所需的转换，或者它如何将名称“dog”更改为索引 1（使用特殊张量类型 TensorCategory(1)）。为此，它创建了一个从类别到索引的映射，称为“vocab”，我们可以通过这种方式访问。

dsets.vocab

['cat', 'dog']

请注意，你可以混合搭配任何块作为输入和目标，这就是为什么 API 被命名为数据块 API。你也可以拥有两个以上的块（如果你有多个输入和/或目标），你只需要将 n_inp 传递给 DataBlock 来告诉库有多少输入（其余的是目标），并向 get_x 和/或 get_y 传递函数列表（以解释如何处理每个项目以使其类型就绪）。请参阅下面的目标检测示例。

下一步是控制如何创建验证集。我们通过向 DataBlock 传递一个 splitter 来实现。例如，这里是如何进行随机分割。

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter())

dsets = dblock.datasets(path/"images")
dsets.train[0]

(PILImage mode=RGB size=320x480, TensorCategory(0))

最后一步是指定项目转换和批量转换（与我们在 ImageDataLoaders 工厂方法中做的方式相同）。

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter(),
                   item_tfms = Resize(224))

通过该调整大小操作，我们现在能够将项目批量处理，最终可以调用 dataloaders 将我们的 DataBlock 转换为 DataLoaders 对象。

dls = dblock.dataloaders(path/"images")
dls.show_batch()

我们通常一次性构建数据块的方式是回答一系列问题：

你的输入/目标的类型是什么？这里是图像和类别。
你的数据在哪里？这里在子文件夹的文件名中。
是否需要对输入应用某些操作？这里不需要。
是否需要对目标应用某些操作？这里是 label_func 函数。
如何分割数据？这里是随机分割。
是否需要对已形成的项目应用某些操作？这里是调整大小。
是否需要对已形成的批量应用某些操作？这里不需要。

这给我们带来了这样的设计：

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter(),
                   item_tfms = Resize(224))

对于两个回答为“否”的问题，如果答案不同，我们将传递的相应参数是 get_x 和 batch_tfms。

图像分类

让我们从图像分类问题的示例开始。图像分类问题有两种：单标签问题（每张图像有一个给定标签）或多标签问题（每张图像可以有多个或没有标签）。我们将在这里介绍这两种类型。

from fastai.vision.all import *

MNIST（单标签）

MNIST 是一个包含 0 到 9 手写数字的数据集。通过回答以下问题，我们可以非常容易地将其加载到数据块 API 中：

我们的输入和目标的类型是什么？黑白图像和标签。
数据在哪里？在子文件夹中。
我们如何知道样本是属于训练集还是验证集？通过查看祖父文件夹。
我们如何知道图像的标签？通过查看父文件夹。

在 API 层面，这些答案翻译如下：

mnist = DataBlock(blocks=(ImageBlock(cls=PILImageBW), CategoryBlock), 
                  get_items=get_image_files, 
                  splitter=GrandparentSplitter(),
                  get_y=parent_label)

我们的类型变成了块：一个用于图像（使用黑白 PILImageBW 类），一个用于类别。通过 get_image_files 函数在子文件夹中搜索所有图像文件名。训练/验证集的分割通过 GrandparentSplitter 实现。获取目标（通常称为 y）的函数是 parent_label。

要了解 fastai 库为读取、标记或分割提供哪些对象，请查阅 data.transforms 模块。

数据块本身只是一个蓝图。它不做任何事情，也不检查错误。你必须向它提供数据源才能实际收集数据。这通过 .dataloaders 方法完成。

dls = mnist.dataloaders(untar_data(URLs.MNIST_TINY))
dls.show_batch(max_n=9, figsize=(4,4))

如果上一步出了问题，或者如果你只是好奇幕后发生了什么，请使用 summary 方法。它会详细地一步一步执行，你会看到过程在哪里失败了。

mnist.summary(untar_data(URLs.MNIST_TINY))

Setting-up type transforms pipelines
Collecting items from /home/jhoward/.fastai/data/mnist_tiny
Found 2856 items
2 datasets of sizes 1418,1398
Setting up Pipeline: PILBase.create
Setting up Pipeline: parent_label -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}

Building one sample
  Pipeline: PILBase.create
    starting from
      /home/jhoward/.fastai/data/mnist_tiny/mnist_tiny/train/7/7420.png
    applying PILBase.create gives
      PILImageBW mode=L size=28x28
  Pipeline: parent_label -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}
    starting from
      /home/jhoward/.fastai/data/mnist_tiny/mnist_tiny/train/7/7420.png
    applying parent_label gives
      7
    applying Categorize -- {'vocab': None, 'sort': True, 'add_na': False} gives
      TensorCategory(1)

Final sample: (PILImageBW mode=L size=28x28, TensorCategory(1))


Collecting items from /home/jhoward/.fastai/data/mnist_tiny
Found 2856 items
2 datasets of sizes 1418,1398
Setting up Pipeline: PILBase.create
Setting up Pipeline: parent_label -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}
Setting up after_item: Pipeline: ToTensor
Setting up before_batch: Pipeline: 
Setting up after_batch: Pipeline: IntToFloatTensor -- {'div': 255.0, 'div_mask': 1}

Building one batch
Applying item_tfms to the first sample:
  Pipeline: ToTensor
    starting from
      (PILImageBW mode=L size=28x28, TensorCategory(1))
    applying ToTensor gives
      (TensorImageBW of size 1x28x28, TensorCategory(1))

Adding the next 3 samples

No before_batch transform to apply

Collating items in a batch

Applying batch_tfms to the batch built
  Pipeline: IntToFloatTensor -- {'div': 255.0, 'div_mask': 1}
    starting from
      (TensorImageBW of size 4x1x28x28, TensorCategory([1, 1, 1, 1], device='cuda:0'))
    applying IntToFloatTensor -- {'div': 255.0, 'div_mask': 1} gives
      (TensorImageBW of size 4x1x28x28, TensorCategory([1, 1, 1, 1], device='cuda:0'))

让我们看另一个例子！

Pets（单标签）

The Oxford IIIT Pets dataset 是一个包含狗和猫图片的数据集，有 37 个不同的品种。与 MNIST 相比，一个微小但非常重要的区别是这里的图像尺寸不完全相同。在 MNIST 中，它们都是 28x28 像素，但在这里它们具有不同的纵横比或尺寸。因此，我们需要添加一些操作来使它们都具有相同的大小，以便能够将它们组合成一个批量。我们还将看到如何添加数据增强。

所以让我们回顾之前的问题并添加两个新问题：

我们的输入和目标的类型是什么？图像和标签。
数据在哪里？在子文件夹中。
我们如何知道样本是属于训练集还是验证集？我们将进行随机分割。
我们如何知道图像的标签？通过查看父文件夹。
是否需要对给定样本应用函数？是的，我们需要将所有内容调整到给定的大小。
是否需要对创建后的批量应用函数？是的，我们需要数据增强。

pets = DataBlock(blocks=(ImageBlock, CategoryBlock), 
                 get_items=get_image_files, 
                 splitter=RandomSplitter(),
                 get_y=Pipeline([attrgetter("name"), RegexLabeller(pat = r'^(.*)_\d+.jpg$')]),
                 item_tfms=Resize(128),
                 batch_tfms=aug_transforms())

就像 MNIST 一样，我们可以看到这些问题的答案如何直接转换为 API。我们的类型变成了块：一个用于图像，一个用于类别。通过 get_image_files 函数在子文件夹中搜索所有图像文件名。训练/验证集的分割通过 RandomSplitter 实现。获取目标（通常称为 y）的函数是两个转换的组合：我们获取 Path 文件名的名称属性，然后应用正则表达式获取类别。为了将这两个转换组合成一个，我们使用 Pipeline。

最后，我们在项目级别应用调整大小，并在批量级别应用 aug_transforms()。

dls = pets.dataloaders(untar_data(URLs.PETS)/"images")
dls.show_batch(max_n=9)

现在让我们看看如何将相同的 API 用于多标签问题。

Pascal（多标签）

The Pascal dataset 最初是一个目标检测数据集（我们需要预测图片中的物体在哪里）。但它包含许多带有各种物体的图片，因此它为多标签问题提供了一个很好的示例。让我们下载它并查看数据。

pascal_source = untar_data(URLs.PASCAL_2007)
df = pd.read_csv(pascal_source/"train.csv")

df.head()

	fname	labels	is_valid
0	000005.jpg	chair	True
1	000007.jpg	car	True
2	000009.jpg	horse person	True
3	000012.jpg	car	False
4	000016.jpg	bicycle	True

所以看起来我们有一列是文件名，一列是标签（用空格分隔），还有一列告诉我们文件名是否应该放在验证集中。

有多种方法可以将这些数据放入 DataBlock 中，让我们逐一介绍，但首先，让我们回答我们惯常的问题：

我们的输入和目标的类型是什么？图像和多个标签。
数据在哪里？在一个数据框中。
我们如何知道样本是属于训练集还是验证集？数据框中的一列。
我们如何获取图像？通过查看 fname 列。
我们如何知道图像的标签？通过查看 labels 列。
是否需要对给定样本应用函数？是的，我们需要将所有内容调整到给定的大小。
是否需要对创建后的批量应用函数？是的，我们需要数据增强。

请注意，与之前相比多了一个问题：这里我们无需使用 get_items 函数，因为我们已经将所有数据放在一个地方。但我们需要对原始数据框进行一些处理以获取输入，读取第一列并在文件名之前添加正确的文件夹。这就是我们作为 get_x 传递的内容。

pascal = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                   splitter=ColSplitter(),
                   get_x=ColReader(0, pref=pascal_source/"train"),
                   get_y=ColReader(1, label_delim=' '),
                   item_tfms=Resize(224),
                   batch_tfms=aug_transforms())

同样，我们可以看到问题的答案如何直接转换为 API。我们的类型变成了块：一个用于图像，一个用于多类别。分割通过 ColSplitter 实现（默认为名为 is_valid 的列）。获取输入（通常称为 x）的函数是针对第一列带有前缀的 ColReader，获取目标（通常称为 y）的函数是针对第二列带有空格分隔符的 ColReader。我们在项目级别应用调整大小，并在批量级别应用 aug_transforms()。

dls = pascal.dataloaders(df)
dls.show_batch()

另一种方法是直接使用函数作为 get_x 和 get_y 的值。

pascal = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                   splitter=ColSplitter(),
                   get_x=lambda x:pascal_source/"train"/f'{x[0]}',
                   get_y=lambda x:x[1].split(' '),
                   item_tfms=Resize(224),
                   batch_tfms=aug_transforms())

dls = pascal.dataloaders(df)
dls.show_batch()

或者，我们可以使用列名作为属性（因为数据框的行是 pandas series）。

pascal = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                   splitter=ColSplitter(),
                   get_x=lambda o:f'{pascal_source}/train/'+o.fname,
                   get_y=lambda o:o.labels.split(),
                   item_tfms=Resize(224),
                   batch_tfms=aug_transforms())

dls = pascal.dataloaders(df)
dls.show_batch()

最有效的方法（为了避免迭代数据框的行，这可能需要很长时间）是使用 from_columns 方法。它将使用 get_items 将列转换为 numpy 数组。缺点是，由于提取相关列后丢失了数据框，我们不能再使用 ColSplitter。这里我们在手动从数据框中提取验证集索引后使用了 IndexSplitter。

def _pascal_items(x): return (
    f'{pascal_source}/train/'+x.fname, x.labels.str.split())
valid_idx = df[df['is_valid']].index.values

pascal = DataBlock.from_columns(blocks=(ImageBlock, MultiCategoryBlock),
                   get_items=_pascal_items,
                   splitter=IndexSplitter(valid_idx),
                   item_tfms=Resize(224),
                   batch_tfms=aug_transforms())

dls = pascal.dataloaders(df)
dls.show_batch()

图像定位

有各种属于图像定位类别的问题：图像分割（预测图像每个像素类别的任务）、坐标预测（预测图像上的一个或多个关键点）和目标检测（在物体周围绘制框以进行检测）。

让我们看一个这些问题的示例，以及如何在每种情况下使用数据块 API。

分割

我们将使用 CamVid 数据集的一个小子集作为示例。

path = untar_data(URLs.CAMVID_TINY)

让我们回顾我们惯常的问题：

我们的输入和目标的类型是什么？图像和分割掩码。
数据在哪里？在子文件夹中。
我们如何知道样本是属于训练集还是验证集？我们将进行随机分割。
我们如何知道图像的标签？通过查看“labels”文件夹中的相应文件。
是否需要对创建后的批量应用函数？是的，我们需要数据增强。

camvid = DataBlock(blocks=(ImageBlock, MaskBlock(codes = np.loadtxt(path/'codes.txt', dtype=str))),
    get_items=get_image_files,
    splitter=RandomSplitter(),
    get_y=lambda o: path/'labels'/f'{o.stem}_P{o.suffix}',
    batch_tfms=aug_transforms())

The MaskBlock 是通过 codes 生成的，它提供了掩码像素值与其对应物体（如汽车、道路、行人等）之间的对应关系。其余部分现在应该看起来很熟悉了。

dls = camvid.dataloaders(path/"images")
dls.show_batch()

/home/jhoward/mambaforge/lib/python3.9/site-packages/torch/_tensor.py:1142: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  ret = func(*args, **kwargs)

点

对于这个例子，我们将使用 BiWi kinect head pose dataset 的一个小子集。它包含人物照片，任务是预测他们的头部中心在哪里。我们已经将这个小数据集保存为一个字典，将文件名映射到中心坐标。

biwi_source = untar_data(URLs.BIWI_SAMPLE)
fn2ctr = load_pickle(biwi_source/'centers.pkl')

然后我们可以回顾我们惯常的问题：

我们的输入和目标的类型是什么？图像和点。
数据在哪里？在子文件夹中。
我们如何知道样本是属于训练集还是验证集？我们将进行随机分割。
我们如何知道图像的标签？通过使用 fn2ctr 字典。
是否需要对创建后的批量应用函数？是的，我们需要数据增强。

biwi = DataBlock(blocks=(ImageBlock, PointBlock),
                 get_items=get_image_files,
                 splitter=RandomSplitter(),
                 get_y=lambda o:fn2ctr[o.name].flip(0),
                 batch_tfms=aug_transforms())

然后我们可以使用它来创建一个 DataLoaders。

dls = biwi.dataloaders(biwi_source)
dls.show_batch(max_n=9)

边界框

对于这个任务，我们将使用 COCO 数据集的一个小子集。它包含带有日常物体的图片，目标是通过在物体周围绘制一个框来预测物体的位置。

fastai 库带有一个名为 get_annotations 的函数，它将解释 train.json 的内容，并给我们一个从文件名到（边界框，标签）的字典。

coco_source = untar_data(URLs.COCO_TINY)
images, lbl_bbox = get_annotations(coco_source/'train.json')
img2bbox = dict(zip(images, lbl_bbox))

然后我们可以回顾我们惯常的问题：

我们的输入和目标的类型是什么？图像和边界框。
数据在哪里？在子文件夹中。
我们如何知道样本是属于训练集还是验证集？我们将进行随机分割。
我们如何知道图像的标签？通过使用 img2bbox 字典。
是否需要对给定样本应用函数？是的，我们需要将所有内容调整到给定的大小。
是否需要对创建后的批量应用函数？是的，我们需要数据增强。

coco = DataBlock(blocks=(ImageBlock, BBoxBlock, BBoxLblBlock),
                 get_items=get_image_files,
                 splitter=RandomSplitter(),
                 get_y=[lambda o: img2bbox[o.name][0], lambda o: img2bbox[o.name][1]], 
                 item_tfms=Resize(128),
                 batch_tfms=aug_transforms(),
                 n_inp=1)

请注意，我们提供了三种类型，因为我们有两个目标：边界框和标签。这就是为什么我们在最后传递 n_inp=1，以告诉库输入在哪里结束以及目标在哪里开始。

这也是为什么我们向 get_y 传递一个列表的原因：由于我们有两个目标，我们必须告诉库如何标记每个目标（如果你不想对其中一个做任何事情，可以使用 noop）。

dls = coco.dataloaders(coco_source)
dls.show_batch(max_n=9)

文本

我们将展示两个示例：语言模型和文本分类。请注意，使用数据块 API，你可以将前面的多标签示例应用于输入为文本的问题。

from fastai.text.all import *

语言模型

我们将使用一个由 IMDb 电影评论组成的数据集。像往常一样，我们可以用一行代码通过 untar_data 下载它。

path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
df.head()

	label	text	is_valid
0	negative	Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!	False
1	positive	This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script. But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som...	False
2	negative	Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy. Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li...	False
3	positive	Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production. Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic...	False
4	negative	This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr...	False

我们可以看到它由（相当长！）标记为积极或消极的评论组成。让我们回顾我们惯常的问题：

我们的输入和目标的类型是什么？文本，我们没有真正的目标，因为目标是从输入派生的。
数据在哪里？在一个数据框中。
我们如何知道样本是属于训练集还是验证集？我们有一个 is_valid 列。
我们如何获取输入？在 text 列中。

imdb_lm = DataBlock(blocks=TextBlock.from_df('text', is_lm=True),
                    get_x=ColReader('text'),
                    splitter=ColSplitter())

由于这里没有目标，我们只需指定一个块。TextBlock 与其他 TransformBlock 有点特殊：为了能够在设置期间高效地标记所有文本，你需要使用类方法 from_folder 或 from_df。

注意：TestBlock 分词过程会将分词后的输入放入名为 text 的列中。get_x 的 ColReader 将始终引用 text，即使原始文本输入在数据框中位于另一个名称的列中。

然后，通过将数据框传递给 dataloaders 方法，我们可以将数据导入 DataLoaders 中。

dls = imdb_lm.dataloaders(df, bs=64, seq_len=72)
dls.show_batch(max_n=6)

	text	text_
0	xxbos xxmaj not sure if it was right or wrong , but i read thru the other comments before watching the xxunk have to say i disagree with most of the negative comments or problems people have had with it . \n\n xxmaj as a first time " lone xxmaj wolf " director / producer , i like to see things that i can xxunk to , not necessarily from the pro	xxmaj not sure if it was right or wrong , but i read thru the other comments before watching the xxunk have to say i disagree with most of the negative comments or problems people have had with it . \n\n xxmaj as a first time " lone xxmaj wolf " director / producer , i like to see things that i can xxunk to , not necessarily from the pro 's
1	and each and every actor . xxmaj it 's like they all think they 're the main part of the movie and scream " notice xxup me ! " over and over again . xxmaj the bad guy has his bad - guy music going on and says sinister bad - guy - like things , just in case you did n't quite catch on . xxmaj the good guy does brave	each and every actor . xxmaj it 's like they all think they 're the main part of the movie and scream " notice xxup me ! " over and over again . xxmaj the bad guy has his bad - guy music going on and says sinister bad - guy - like things , just in case you did n't quite catch on . xxmaj the good guy does brave and
2	innocently helps the xxmaj confederate hide . xxmaj later , when he returns to kill her father , the little girl 's xxunk is remembered . a sweet , small story from director xxup xxunk . xxmaj griffith . xxmaj location footage and humanity are xxunk displayed . \n\n xxrep 4 * xxmaj in the xxmaj border xxmaj states ( 6 / 13 / 10 ) xxup xxunk . xxmaj griffith ~	helps the xxmaj confederate hide . xxmaj later , when he returns to kill her father , the little girl 's xxunk is remembered . a sweet , small story from director xxup xxunk . xxmaj griffith . xxmaj location footage and humanity are xxunk displayed . \n\n xxrep 4 * xxmaj in the xxmaj border xxmaj states ( 6 / 13 / 10 ) xxup xxunk . xxmaj griffith ~ xxmaj
3	when they real winner should of been xxmaj xxunk xxmaj fiennes for " sunshine " . xxmaj if you have n't seen this movie yet , watch it and you 'll agree . " eyes xxmaj wide xxmaj shut " when released xxunk no nominations . xxmaj and as far as this year goes , well , the bad choices were all over the place ! xxmaj xxunk xxmaj xxunk gets no	they real winner should of been xxmaj xxunk xxmaj fiennes for " sunshine " . xxmaj if you have n't seen this movie yet , watch it and you 'll agree . " eyes xxmaj wide xxmaj shut " when released xxunk no nominations . xxmaj and as far as this year goes , well , the bad choices were all over the place ! xxmaj xxunk xxmaj xxunk gets no "
4	xxmaj in this case , however , when combined with the moody atmosphere , and the fact that the small town of xxmaj red xxmaj rock seems almost empty of normal daily life , the coincidences and unlikely timing suggest a story that , beyond " xxunk " , is … surreal . xxmaj it 's almost as if fate deliberately xxunk with improbable events so as to force xxmaj michael to	in this case , however , when combined with the moody atmosphere , and the fact that the small town of xxmaj red xxmaj rock seems almost empty of normal daily life , the coincidences and unlikely timing suggest a story that , beyond " xxunk " , is … surreal . xxmaj it 's almost as if fate deliberately xxunk with improbable events so as to force xxmaj michael to come
5	is not over the top and enough twists and turns to keep you interested until the end . \n\n xxmaj well directed , well acted and a good story . xxbos xxmaj not the worst movie xxmaj i 've seen but definitely not very good either . i myself am a paintball player , used to play airball a lot and going from woods to airball is quite a large change .	not over the top and enough twists and turns to keep you interested until the end . \n\n xxmaj well directed , well acted and a good story . xxbos xxmaj not the worst movie xxmaj i 've seen but definitely not very good either . i myself am a paintball player , used to play airball a lot and going from woods to airball is quite a large change . xxmaj

文本分类

对于文本分类，让我们回顾我们惯常的问题：

我们的输入和目标的类型是什么？文本和类别。
数据在哪里？在一个数据框中。
我们如何知道样本是属于训练集还是验证集？我们有一个 is_valid 列。
我们如何获取输入？在 text 列中。
我们如何获取目标？在 label 列中。

imdb_clas = DataBlock(blocks=(TextBlock.from_df('text', seq_len=72, vocab=dls.vocab), CategoryBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader('label'),
                      splitter=ColSplitter())

就像在前面的例子中一样，我们使用类方法来构建一个 TextBlock。我们可以向它传递语言模型的词汇表（这对于 ULMFit 方法非常有用）。我们还展示了 seq_len 参数（默认为 72），这只是因为你需要确保在这里以及在你的 text_classifier_learner 中使用相同的值。

警告

你需要确保在 TextBlock 和稍后将定义的 Learner 中使用相同的 seq_len。

dls = imdb_clas.dataloaders(df, bs=64)
dls.show_batch()

	text	category
0	xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n't quite feel right . xxmaj victor xxmaj vargas suffers from a certain xxunk on the director 's part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an xxunk storyline would make the film critic proof . xxmaj he was right , but it did n't fool me . xxmaj raising xxmaj victor xxmaj vargas is	negative
1	xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with the xxunk possible scenarios to get the two protagonists together in the end . xxmaj in fact , all its charm is xxunk , contained within the characters and the setting and the plot … which is highly believable to xxunk . xxmaj it 's easy to think that such a love story , as beautiful as any other ever told , * could * happen to you … a feeling you do n't often get from other romantic comedies	positive
2	xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of " at xxmaj the xxmaj movies " in taking xxmaj steven xxmaj soderbergh to task . \n\n xxmaj it 's usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh 's most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh 's main challenge . xxmaj strange , after 20 - odd years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside " edgy " projects . \n\n xxmaj none of this excuses him this present ,	negative
3	xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj xxunk . \n\n xxmaj the format is the same as xxmaj max xxmaj xxunk ' " la xxmaj xxunk , " based on a play by xxmaj arthur xxmaj xxunk , who is given an " inspired by " credit . xxmaj it starts from one person , a prostitute , standing on a street corner in xxmaj brooklyn . xxmaj she is picked up by a home contractor , who has sex with her on the hood of a car , but ca n't come . xxmaj he refuses to pay her . xxmaj when he 's off xxunk , she	positive
4	xxbos i really wanted to love this show . i truly , honestly did . \n\n xxmaj for the first time , gay viewers get their own version of the " the xxmaj bachelor " . xxmaj with the help of his obligatory " hag " xxmaj xxunk , xxmaj james , a good looking , well - to - do thirty - something has the chance of love with 15 suitors ( or " mates " as they are referred to in the show ) . xxmaj the only problem is half of them are straight and xxmaj james does n't know this . xxmaj if xxmaj james picks a gay one , they get a trip to xxmaj new xxmaj zealand , and xxmaj if he picks a straight one , straight guy gets $ 25 , xxrep 3 0 . xxmaj how can this not be fun	negative
5	xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphics that are terribly dated today , the game xxunk you into the role of xxunk even * think * xxmaj i 'm going to attempt spelling his last name ! ) , an xxmaj american xxup xxunk . caught in an underground bunker . xxmaj you fight and search your way through xxunk in order to achieve different xxunk for the six xxunk , let 's face it , most of them are just an excuse to hand you a weapon	positive
6	xxbos xxmaj the xxmaj blob starts with one of the most bizarre theme songs ever , xxunk by an uncredited xxmaj burt xxmaj xxunk of all people ! xxmaj you really have to hear it to believe it , xxmaj the xxmaj blob may be worth watching just for this song alone & my user comment summary is just a little taste of the classy lyrics … xxmaj after this xxunk opening credits sequence xxmaj the xxmaj blob introduces us , the viewer that is , to xxmaj steve xxmaj xxunk ( steve mcqueen as xxmaj steven mcqueen ) & his girlfriend xxmaj jane xxmaj martin ( xxunk xxmaj xxunk ) who are xxunk on their own somewhere & witness what looks like a meteorite falling to xxmaj earth in nearby woods . xxmaj an old man ( xxunk xxmaj xxunk as xxmaj xxunk xxmaj xxunk ) who lives in	negative
7	xxbos xxmaj the year 2005 saw no xxunk than 3 filmed productions of xxup h. xxup g. xxmaj wells ' great novel , " war of the xxmaj worlds " . xxmaj this is perhaps the least well - known and very probably the best of them . xxmaj no other version of xxunk has ever attempted not only to present the story very much as xxmaj wells wrote it , but also to create the atmosphere of the time in which it was supposed to take place : the last year of the 19th xxmaj century , 1900 … using xxmaj wells ' original setting , in and near xxmaj xxunk , xxmaj england . \n\n imdb seems xxunk to what they regard as " spoilers " . xxmaj that might apply with some films , where the ending might actually be a surprise , but with regard to	positive
8	xxbos xxmaj well , what can i say . \n\n " what the xxmaj xxunk do we xxmaj know " has achieved the nearly impossible - leaving behind such masterpieces of the genre as " the xxmaj xxunk " , " the xxmaj xxunk xxmaj master " , " xxunk " , and so fourth , it will go down in history as the single worst movie i have ever seen in its xxunk . xxmaj and that , ladies and gentlemen , is impressive indeed , for i have seen many a bad movie . \n\n xxmaj this masterpiece of modern cinema consists of two xxunk parts , xxunk between a silly and contrived plot about an extremely annoying photographer , abandoned by her husband and forced to take anti - xxunk to survive , and a bunch of talking heads going on about how quantum physics supposedly xxunk	negative

表格数据

表格数据并没有真正使用数据块 API，因为它依赖于另一个 API，即 TabularPandas，用于高效的预处理和批量处理（在不久的将来会添加一些与数据块 API 配合良好的效率较低的 API）。你仍然可以使用不同的块来处理目标。

from fastai.tabular.core import *

作为我们的示例，我们将查看 adult 数据集的一个子集，该数据集包含一些人口普查数据，任务是预测某人的收入是否超过 5 万美元。

adult_source = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(adult_source/'adult.csv')
df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

在表格数据问题中，我们需要将列分为表示连续变量（如 age）的列和表示分类变量（如 education）的列。

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']

fastai 中的标准预处理，使用这些预处理器：

procs = [Categorify, FillMissing, Normalize]

Categorify 将分类列转换为索引，FillMissing 将填充连续列中的缺失值（如果存在），并添加一个 na 分类列（如果需要）。Normalize 将对连续列进行归一化（减去均值并除以标准差）。

我们仍然可以使用任何分割器来创建我们想要的分割。

splits = RandomSplitter()(range_of(df))

然后所有内容都放入一个 TabularPandas 对象中。

to = TabularPandas(df, procs, cat_names, cont_names, y_names="salary", splits=splits, y_block=CategoryBlock)

我们使用 y_block=CategoryBlock 只是为了向你展示如何为目标自定义块，但它通常会从数据中推断出来，所以通常不需要传递它。

dls = to.dataloaders()
dls.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Self-emp-not-inc	Prof-school	Never-married	Prof-specialty	Not-in-family	White	False	34.0	204374.999924	15.0	>=50k
1	Private	Some-college	Never-married	Adm-clerical	Not-in-family	White	False	62.0	141307.999756	10.0	<50k
2	Private	Assoc-acdm	Never-married	Other-service	Not-in-family	White	False	23.0	152188.999004	12.0	<50k
3	Private	HS-grad	Divorced	Craft-repair	Unmarried	White	False	38.0	27407.999090	9.0	<50k
4	Private	Bachelors	Never-married	Prof-specialty	Not-in-family	White	False	32.0	340917.004812	13.0	>=50k
5	Private	Bachelors	Never-married	Prof-specialty	Not-in-family	White	False	22.0	153515.999598	13.0	<50k
6	Self-emp-not-inc	Doctorate	Never-married	Prof-specialty	Not-in-family	White	False	46.0	165754.000335	16.0	<50k
7	Private	Masters	Married-civ-spouse	Prof-specialty	Husband	White	False	33.0	202050.999896	14.0	<50k
8	Private	Assoc-acdm	Divorced	Sales	Unmarried	White	False	40.0	197919.000079	12.0	<50k
9	?	Some-college	Never-married	?	Own-child	White	False	18.0	264924.000434	10.0	<50k

从头构建 DataBlock

图像分类

MNIST（单标签）

Pets（单标签）

Pascal（多标签）

图像定位

分割

点

边界框

文本

语言模型

文本分类

表格数据

从头构建 `DataBlock`