数据块教程

在所有应用中使用数据块

在本教程中,我们将学习如何在各种任务中使用数据块 API 以及如何调试数据块。数据块 API 的名称来源于它的设计方式:构建 DataLoaders 对象所需的所有信息(输入类型、目标类型、如何标记、如何分割等)都被封装在一个块中,并且你可以混合搭配这些块。

从头构建 DataBlock

本教程的其余部分将提供许多示例,但我们首先从头构建一个 DataBlock 来解决我们在视觉教程中看到的猫狗大战问题。首先,我们导入视觉所需的一切。

from fastai.data.all import *
from fastai.vision.all import *

第一步是下载并解压我们的数据(如果尚未完成)并获取其位置。

path = untar_data(URLs.PETS)

正如我们所见,所有文件名都在“images”文件夹中。get_image_files 函数有助于获取子文件夹中的所有图像。

fnames = get_image_files(path/"images")

让我们从一个空的 DataBlock 开始。

dblock = DataBlock()

本质上,DataBlock 只是一个关于如何组装数据的蓝图。除非你向它传递源数据,否则它不会做任何事情。你可以选择使用 DataBlock.datasetsDataBlock.dataloaders 方法将源数据转换为 DatasetsDataLoaders。由于我们尚未准备好用于批处理的数据,因此这里的 dataloaders 方法会失败,但我们可以看看它如何在 Datasets 中进行转换。在这里,我们传递数据源,即所有文件名。

dsets = dblock.datasets(fnames)
dsets.train[0]
(Path('/home/jhoward/.fastai/data/oxford-iiit-pet/images/Birman_82.jpg'),
 Path('/home/jhoward/.fastai/data/oxford-iiit-pet/images/Birman_82.jpg'))

默认情况下,数据块 API 假定我们有一个输入和一个目标,这就是为什么我们看到文件名重复出现两次。

我们可以做的第一件事是使用一个 get_items 函数来实际组装数据块中的项目。

dblock = DataBlock(get_items = get_image_files)

区别在于,然后你将包含图像的文件夹作为源传递,而不是所有文件名。

dsets = dblock.datasets(path/"images")
dsets.train[0]
(Path('/home/jhoward/.fastai/data/oxford-iiit-pet/images/english_cocker_spaniel_76.jpg'),
 Path('/home/jhoward/.fastai/data/oxford-iiit-pet/images/english_cocker_spaniel_76.jpg'))

我们的输入已准备好作为图像处理(因为图像可以从文件名构建),但我们的目标尚未准备好。由于我们面临猫狗大战问题,我们需要将文件名转换为“cat”或“dog”(或 TrueFalse)。让我们为此构建一个函数。

def label_func(fname):
    return "cat" if fname.name[0].isupper() else "dog"

然后,我们可以通过将其作为 get_y 传递来告诉我们的数据块使用它来标记我们的目标。

dblock = DataBlock(get_items = get_image_files,
                   get_y     = label_func)

dsets = dblock.datasets(path/"images")
dsets.train[0]
(Path('/home/jhoward/.fastai/data/oxford-iiit-pet/images/staffordshire_bull_terrier_77.jpg'),
 'dog')

现在我们的输入和目标已准备好,我们可以指定类型来告诉数据块 API 我们的输入是图像,目标是类别。在数据块 API 中,类型由块表示,这里我们使用 ImageBlockCategoryBlock

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func)

dsets = dblock.datasets(path/"images")
dsets.train[0]
(PILImage mode=RGB size=361x500, TensorCategory(1))

我们可以看到 DataBlock 如何自动添加打开图像所需的转换,或者它如何将名称“dog”更改为索引 1(使用特殊张量类型 TensorCategory(1))。为此,它创建了一个从类别到索引的映射,称为“vocab”,我们可以通过这种方式访问。

dsets.vocab
['cat', 'dog']

请注意,你可以混合搭配任何块作为输入和目标,这就是为什么 API 被命名为数据块 API。你也可以拥有两个以上的块(如果你有多个输入和/或目标),你只需要将 n_inp 传递给 DataBlock 来告诉库有多少输入(其余的是目标),并向 get_x 和/或 get_y 传递函数列表(以解释如何处理每个项目以使其类型就绪)。请参阅下面的目标检测示例。

下一步是控制如何创建验证集。我们通过向 DataBlock 传递一个 splitter 来实现。例如,这里是如何进行随机分割。

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter())

dsets = dblock.datasets(path/"images")
dsets.train[0]
(PILImage mode=RGB size=320x480, TensorCategory(0))

最后一步是指定项目转换和批量转换(与我们在 ImageDataLoaders 工厂方法中做的方式相同)。

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter(),
                   item_tfms = Resize(224))

通过该调整大小操作,我们现在能够将项目批量处理,最终可以调用 dataloaders 将我们的 DataBlock 转换为 DataLoaders 对象。

dls = dblock.dataloaders(path/"images")
dls.show_batch()

我们通常一次性构建数据块的方式是回答一系列问题:

  • 你的输入/目标的类型是什么?这里是图像和类别。
  • 你的数据在哪里?这里在子文件夹的文件名中。
  • 是否需要对输入应用某些操作?这里不需要。
  • 是否需要对目标应用某些操作?这里是 label_func 函数。
  • 如何分割数据?这里是随机分割。
  • 是否需要对已形成的项目应用某些操作?这里是调整大小。
  • 是否需要对已形成的批量应用某些操作?这里不需要。

这给我们带来了这样的设计:

dblock = DataBlock(blocks    = (ImageBlock, CategoryBlock),
                   get_items = get_image_files,
                   get_y     = label_func,
                   splitter  = RandomSplitter(),
                   item_tfms = Resize(224))

对于两个回答为“否”的问题,如果答案不同,我们将传递的相应参数是 get_xbatch_tfms

图像分类

让我们从图像分类问题的示例开始。图像分类问题有两种:单标签问题(每张图像有一个给定标签)或多标签问题(每张图像可以有多个或没有标签)。我们将在这里介绍这两种类型。

from fastai.vision.all import *

MNIST(单标签)

MNIST 是一个包含 0 到 9 手写数字的数据集。通过回答以下问题,我们可以非常容易地将其加载到数据块 API 中:

  • 我们的输入和目标的类型是什么?黑白图像和标签。
  • 数据在哪里?在子文件夹中。
  • 我们如何知道样本是属于训练集还是验证集?通过查看祖父文件夹。
  • 我们如何知道图像的标签?通过查看父文件夹。

在 API 层面,这些答案翻译如下:

mnist = DataBlock(blocks=(ImageBlock(cls=PILImageBW), CategoryBlock), 
                  get_items=get_image_files, 
                  splitter=GrandparentSplitter(),
                  get_y=parent_label)

我们的类型变成了块:一个用于图像(使用黑白 PILImageBW 类),一个用于类别。通过 get_image_files 函数在子文件夹中搜索所有图像文件名。训练/验证集的分割通过 GrandparentSplitter 实现。获取目标(通常称为 y)的函数是 parent_label

要了解 fastai 库为读取、标记或分割提供哪些对象,请查阅 data.transforms 模块。

数据块本身只是一个蓝图。它不做任何事情,也不检查错误。你必须向它提供数据源才能实际收集数据。这通过 .dataloaders 方法完成。

dls = mnist.dataloaders(untar_data(URLs.MNIST_TINY))
dls.show_batch(max_n=9, figsize=(4,4))

如果上一步出了问题,或者如果你只是好奇幕后发生了什么,请使用 summary 方法。它会详细地一步一步执行,你会看到过程在哪里失败了。

mnist.summary(untar_data(URLs.MNIST_TINY))
Setting-up type transforms pipelines
Collecting items from /home/jhoward/.fastai/data/mnist_tiny
Found 2856 items
2 datasets of sizes 1418,1398
Setting up Pipeline: PILBase.create
Setting up Pipeline: parent_label -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}

Building one sample
  Pipeline: PILBase.create
    starting from
      /home/jhoward/.fastai/data/mnist_tiny/mnist_tiny/train/7/7420.png
    applying PILBase.create gives
      PILImageBW mode=L size=28x28
  Pipeline: parent_label -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}
    starting from
      /home/jhoward/.fastai/data/mnist_tiny/mnist_tiny/train/7/7420.png
    applying parent_label gives
      7
    applying Categorize -- {'vocab': None, 'sort': True, 'add_na': False} gives
      TensorCategory(1)

Final sample: (PILImageBW mode=L size=28x28, TensorCategory(1))


Collecting items from /home/jhoward/.fastai/data/mnist_tiny
Found 2856 items
2 datasets of sizes 1418,1398
Setting up Pipeline: PILBase.create
Setting up Pipeline: parent_label -> Categorize -- {'vocab': None, 'sort': True, 'add_na': False}
Setting up after_item: Pipeline: ToTensor
Setting up before_batch: Pipeline: 
Setting up after_batch: Pipeline: IntToFloatTensor -- {'div': 255.0, 'div_mask': 1}

Building one batch
Applying item_tfms to the first sample:
  Pipeline: ToTensor
    starting from
      (PILImageBW mode=L size=28x28, TensorCategory(1))
    applying ToTensor gives
      (TensorImageBW of size 1x28x28, TensorCategory(1))

Adding the next 3 samples

No before_batch transform to apply

Collating items in a batch

Applying batch_tfms to the batch built
  Pipeline: IntToFloatTensor -- {'div': 255.0, 'div_mask': 1}
    starting from
      (TensorImageBW of size 4x1x28x28, TensorCategory([1, 1, 1, 1], device='cuda:0'))
    applying IntToFloatTensor -- {'div': 255.0, 'div_mask': 1} gives
      (TensorImageBW of size 4x1x28x28, TensorCategory([1, 1, 1, 1], device='cuda:0'))

让我们看另一个例子!

Pets(单标签)

The Oxford IIIT Pets dataset 是一个包含狗和猫图片的 数据集,有 37 个不同的品种。与 MNIST 相比,一个微小但非常重要的区别是这里的图像尺寸不完全相同。在 MNIST 中,它们都是 28x28 像素,但在这里它们具有不同的纵横比或尺寸。因此,我们需要添加一些操作来使它们都具有相同的大小,以便能够将它们组合成一个批量。我们还将看到如何添加数据增强。

所以让我们回顾之前的问题并添加两个新问题:

  • 我们的输入和目标的类型是什么?图像和标签。
  • 数据在哪里?在子文件夹中。
  • 我们如何知道样本是属于训练集还是验证集?我们将进行随机分割。
  • 我们如何知道图像的标签?通过查看父文件夹。
  • 是否需要对给定样本应用函数?是的,我们需要将所有内容调整到给定的大小。
  • 是否需要对创建后的批量应用函数?是的,我们需要数据增强。
pets = DataBlock(blocks=(ImageBlock, CategoryBlock), 
                 get_items=get_image_files, 
                 splitter=RandomSplitter(),
                 get_y=Pipeline([attrgetter("name"), RegexLabeller(pat = r'^(.*)_\d+.jpg$')]),
                 item_tfms=Resize(128),
                 batch_tfms=aug_transforms())

就像 MNIST 一样,我们可以看到这些问题的答案如何直接转换为 API。我们的类型变成了块:一个用于图像,一个用于类别。通过 get_image_files 函数在子文件夹中搜索所有图像文件名。训练/验证集的分割通过 RandomSplitter 实现。获取目标(通常称为 y)的函数是两个转换的组合:我们获取 Path 文件名的名称属性,然后应用正则表达式获取类别。为了将这两个转换组合成一个,我们使用 Pipeline

最后,我们在项目级别应用调整大小,并在批量级别应用 aug_transforms()

dls = pets.dataloaders(untar_data(URLs.PETS)/"images")
dls.show_batch(max_n=9)

现在让我们看看如何将相同的 API 用于多标签问题。

Pascal(多标签)

The Pascal dataset 最初是一个目标检测数据集(我们需要预测图片中的物体在哪里)。但它包含许多带有各种物体的图片,因此它为多标签问题提供了一个很好的示例。让我们下载它并查看数据。

pascal_source = untar_data(URLs.PASCAL_2007)
df = pd.read_csv(pascal_source/"train.csv")
df.head()
fname labels is_valid
0 000005.jpg chair True
1 000007.jpg car True
2 000009.jpg horse person True
3 000012.jpg car False
4 000016.jpg bicycle True

所以看起来我们有一列是文件名,一列是标签(用空格分隔),还有一列告诉我们文件名是否应该放在验证集中。

有多种方法可以将这些数据放入 DataBlock 中,让我们逐一介绍,但首先,让我们回答我们惯常的问题:

  • 我们的输入和目标的类型是什么?图像和多个标签。
  • 数据在哪里?在一个数据框中。
  • 我们如何知道样本是属于训练集还是验证集?数据框中的一列。
  • 我们如何获取图像?通过查看 fname 列。
  • 我们如何知道图像的标签?通过查看 labels 列。
  • 是否需要对给定样本应用函数?是的,我们需要将所有内容调整到给定的大小。
  • 是否需要对创建后的批量应用函数?是的,我们需要数据增强。

请注意,与之前相比多了一个问题:这里我们无需使用 get_items 函数,因为我们已经将所有数据放在一个地方。但我们需要对原始数据框进行一些处理以获取输入,读取第一列并在文件名之前添加正确的文件夹。这就是我们作为 get_x 传递的内容。

pascal = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                   splitter=ColSplitter(),
                   get_x=ColReader(0, pref=pascal_source/"train"),
                   get_y=ColReader(1, label_delim=' '),
                   item_tfms=Resize(224),
                   batch_tfms=aug_transforms())

同样,我们可以看到问题的答案如何直接转换为 API。我们的类型变成了块:一个用于图像,一个用于多类别。分割通过 ColSplitter 实现(默认为名为 is_valid 的列)。获取输入(通常称为 x)的函数是针对第一列带有前缀的 ColReader,获取目标(通常称为 y)的函数是针对第二列带有空格分隔符的 ColReader。我们在项目级别应用调整大小,并在批量级别应用 aug_transforms()

dls = pascal.dataloaders(df)
dls.show_batch()

另一种方法是直接使用函数作为 get_xget_y 的值。

pascal = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                   splitter=ColSplitter(),
                   get_x=lambda x:pascal_source/"train"/f'{x[0]}',
                   get_y=lambda x:x[1].split(' '),
                   item_tfms=Resize(224),
                   batch_tfms=aug_transforms())

dls = pascal.dataloaders(df)
dls.show_batch()

或者,我们可以使用列名作为属性(因为数据框的行是 pandas series)。

pascal = DataBlock(blocks=(ImageBlock, MultiCategoryBlock),
                   splitter=ColSplitter(),
                   get_x=lambda o:f'{pascal_source}/train/'+o.fname,
                   get_y=lambda o:o.labels.split(),
                   item_tfms=Resize(224),
                   batch_tfms=aug_transforms())

dls = pascal.dataloaders(df)
dls.show_batch()

最有效的方法(为了避免迭代数据框的行,这可能需要很长时间)是使用 from_columns 方法。它将使用 get_items 将列转换为 numpy 数组。缺点是,由于提取相关列后丢失了数据框,我们不能再使用 ColSplitter。这里我们在手动从数据框中提取验证集索引后使用了 IndexSplitter

def _pascal_items(x): return (
    f'{pascal_source}/train/'+x.fname, x.labels.str.split())
valid_idx = df[df['is_valid']].index.values

pascal = DataBlock.from_columns(blocks=(ImageBlock, MultiCategoryBlock),
                   get_items=_pascal_items,
                   splitter=IndexSplitter(valid_idx),
                   item_tfms=Resize(224),
                   batch_tfms=aug_transforms())
dls = pascal.dataloaders(df)
dls.show_batch()

图像定位

有各种属于图像定位类别的问题:图像分割(预测图像每个像素类别的任务)、坐标预测(预测图像上的一个或多个关键点)和目标检测(在物体周围绘制框以进行检测)。

让我们看一个这些问题的示例,以及如何在每种情况下使用数据块 API。

分割

我们将使用 CamVid 数据集 的一个小子集作为示例。

path = untar_data(URLs.CAMVID_TINY)

让我们回顾我们惯常的问题:

  • 我们的输入和目标的类型是什么?图像和分割掩码。
  • 数据在哪里?在子文件夹中。
  • 我们如何知道样本是属于训练集还是验证集?我们将进行随机分割。
  • 我们如何知道图像的标签?通过查看“labels”文件夹中的相应文件。
  • 是否需要对创建后的批量应用函数?是的,我们需要数据增强。
camvid = DataBlock(blocks=(ImageBlock, MaskBlock(codes = np.loadtxt(path/'codes.txt', dtype=str))),
    get_items=get_image_files,
    splitter=RandomSplitter(),
    get_y=lambda o: path/'labels'/f'{o.stem}_P{o.suffix}',
    batch_tfms=aug_transforms())

The MaskBlock 是通过 codes 生成的,它提供了掩码像素值与其对应物体(如汽车、道路、行人等)之间的对应关系。其余部分现在应该看起来很熟悉了。

dls = camvid.dataloaders(path/"images")
dls.show_batch()
/home/jhoward/mambaforge/lib/python3.9/site-packages/torch/_tensor.py:1142: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  ret = func(*args, **kwargs)

对于这个例子,我们将使用 BiWi kinect head pose dataset 的一个小子集。它包含人物照片,任务是预测他们的头部中心在哪里。我们已经将这个小数据集保存为一个字典,将文件名映射到中心坐标。

biwi_source = untar_data(URLs.BIWI_SAMPLE)
fn2ctr = load_pickle(biwi_source/'centers.pkl')

然后我们可以回顾我们惯常的问题:

  • 我们的输入和目标的类型是什么?图像和点。
  • 数据在哪里?在子文件夹中。
  • 我们如何知道样本是属于训练集还是验证集?我们将进行随机分割。
  • 我们如何知道图像的标签?通过使用 fn2ctr 字典。
  • 是否需要对创建后的批量应用函数?是的,我们需要数据增强。
biwi = DataBlock(blocks=(ImageBlock, PointBlock),
                 get_items=get_image_files,
                 splitter=RandomSplitter(),
                 get_y=lambda o:fn2ctr[o.name].flip(0),
                 batch_tfms=aug_transforms())

然后我们可以使用它来创建一个 DataLoaders

dls = biwi.dataloaders(biwi_source)
dls.show_batch(max_n=9)

边界框

对于这个任务,我们将使用 COCO 数据集 的一个小子集。它包含带有日常物体的图片,目标是通过在物体周围绘制一个框来预测物体的位置。

fastai 库带有一个名为 get_annotations 的函数,它将解释 train.json 的内容,并给我们一个从文件名到(边界框,标签)的字典。

coco_source = untar_data(URLs.COCO_TINY)
images, lbl_bbox = get_annotations(coco_source/'train.json')
img2bbox = dict(zip(images, lbl_bbox))

然后我们可以回顾我们惯常的问题:

  • 我们的输入和目标的类型是什么?图像和边界框。
  • 数据在哪里?在子文件夹中。
  • 我们如何知道样本是属于训练集还是验证集?我们将进行随机分割。
  • 我们如何知道图像的标签?通过使用 img2bbox 字典。
  • 是否需要对给定样本应用函数?是的,我们需要将所有内容调整到给定的大小。
  • 是否需要对创建后的批量应用函数?是的,我们需要数据增强。
coco = DataBlock(blocks=(ImageBlock, BBoxBlock, BBoxLblBlock),
                 get_items=get_image_files,
                 splitter=RandomSplitter(),
                 get_y=[lambda o: img2bbox[o.name][0], lambda o: img2bbox[o.name][1]], 
                 item_tfms=Resize(128),
                 batch_tfms=aug_transforms(),
                 n_inp=1)

请注意,我们提供了三种类型,因为我们有两个目标:边界框和标签。这就是为什么我们在最后传递 n_inp=1,以告诉库输入在哪里结束以及目标在哪里开始。

这也是为什么我们向 get_y 传递一个列表的原因:由于我们有两个目标,我们必须告诉库如何标记每个目标(如果你不想对其中一个做任何事情,可以使用 noop)。

dls = coco.dataloaders(coco_source)
dls.show_batch(max_n=9)

文本

我们将展示两个示例:语言模型和文本分类。请注意,使用数据块 API,你可以将前面的多标签示例应用于输入为文本的问题。

from fastai.text.all import *

语言模型

我们将使用一个由 IMDb 电影评论组成的数据集。像往常一样,我们可以用一行代码通过 untar_data 下载它。

path = untar_data(URLs.IMDB_SAMPLE)
df = pd.read_csv(path/'texts.csv')
df.head()
label text is_valid
0 negative Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff! False
1 positive This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.

But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som...
False
2 negative Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.

Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li...
False
3 positive Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.

Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie "Duty, Honor, Country" are not just mere words blathered from the lips of a high-brassed offic...
False
4 negative This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr... False

我们可以看到它由(相当长!)标记为积极或消极的评论组成。让我们回顾我们惯常的问题:

  • 我们的输入和目标的类型是什么?文本,我们没有真正的目标,因为目标是从输入派生的。
  • 数据在哪里?在一个数据框中。
  • 我们如何知道样本是属于训练集还是验证集?我们有一个 is_valid 列。
  • 我们如何获取输入?在 text 列中。
imdb_lm = DataBlock(blocks=TextBlock.from_df('text', is_lm=True),
                    get_x=ColReader('text'),
                    splitter=ColSplitter())

由于这里没有目标,我们只需指定一个块。TextBlock 与其他 TransformBlock 有点特殊:为了能够在设置期间高效地标记所有文本,你需要使用类方法 from_folderfrom_df

注意:TestBlock 分词过程会将分词后的输入放入名为 text 的列中。get_xColReader 将始终引用 text,即使原始文本输入在数据框中位于另一个名称的列中。

然后,通过将数据框传递给 dataloaders 方法,我们可以将数据导入 DataLoaders 中。

dls = imdb_lm.dataloaders(df, bs=64, seq_len=72)
dls.show_batch(max_n=6)
text text_
0 xxbos xxmaj not sure if it was right or wrong , but i read thru the other comments before watching the xxunk have to say i disagree with most of the negative comments or problems people have had with it . \n\n xxmaj as a first time " lone xxmaj wolf " director / producer , i like to see things that i can xxunk to , not necessarily from the pro xxmaj not sure if it was right or wrong , but i read thru the other comments before watching the xxunk have to say i disagree with most of the negative comments or problems people have had with it . \n\n xxmaj as a first time " lone xxmaj wolf " director / producer , i like to see things that i can xxunk to , not necessarily from the pro 's
1 and each and every actor . xxmaj it 's like they all think they 're the main part of the movie and scream " notice xxup me ! " over and over again . xxmaj the bad guy has his bad - guy music going on and says sinister bad - guy - like things , just in case you did n't quite catch on . xxmaj the good guy does brave each and every actor . xxmaj it 's like they all think they 're the main part of the movie and scream " notice xxup me ! " over and over again . xxmaj the bad guy has his bad - guy music going on and says sinister bad - guy - like things , just in case you did n't quite catch on . xxmaj the good guy does brave and
2 innocently helps the xxmaj confederate hide . xxmaj later , when he returns to kill her father , the little girl 's xxunk is remembered . a sweet , small story from director xxup xxunk . xxmaj griffith . xxmaj location footage and humanity are xxunk displayed . \n\n xxrep 4 * xxmaj in the xxmaj border xxmaj states ( 6 / 13 / 10 ) xxup xxunk . xxmaj griffith ~ helps the xxmaj confederate hide . xxmaj later , when he returns to kill her father , the little girl 's xxunk is remembered . a sweet , small story from director xxup xxunk . xxmaj griffith . xxmaj location footage and humanity are xxunk displayed . \n\n xxrep 4 * xxmaj in the xxmaj border xxmaj states ( 6 / 13 / 10 ) xxup xxunk . xxmaj griffith ~ xxmaj
3 when they real winner should of been xxmaj xxunk xxmaj fiennes for " sunshine " . xxmaj if you have n't seen this movie yet , watch it and you 'll agree . " eyes xxmaj wide xxmaj shut " when released xxunk no nominations . xxmaj and as far as this year goes , well , the bad choices were all over the place ! xxmaj xxunk xxmaj xxunk gets no they real winner should of been xxmaj xxunk xxmaj fiennes for " sunshine " . xxmaj if you have n't seen this movie yet , watch it and you 'll agree . " eyes xxmaj wide xxmaj shut " when released xxunk no nominations . xxmaj and as far as this year goes , well , the bad choices were all over the place ! xxmaj xxunk xxmaj xxunk gets no "
4 xxmaj in this case , however , when combined with the moody atmosphere , and the fact that the small town of xxmaj red xxmaj rock seems almost empty of normal daily life , the coincidences and unlikely timing suggest a story that , beyond " xxunk " , is … surreal . xxmaj it 's almost as if fate deliberately xxunk with improbable events so as to force xxmaj michael to in this case , however , when combined with the moody atmosphere , and the fact that the small town of xxmaj red xxmaj rock seems almost empty of normal daily life , the coincidences and unlikely timing suggest a story that , beyond " xxunk " , is … surreal . xxmaj it 's almost as if fate deliberately xxunk with improbable events so as to force xxmaj michael to come
5 is not over the top and enough twists and turns to keep you interested until the end . \n\n xxmaj well directed , well acted and a good story . xxbos xxmaj not the worst movie xxmaj i 've seen but definitely not very good either . i myself am a paintball player , used to play airball a lot and going from woods to airball is quite a large change . not over the top and enough twists and turns to keep you interested until the end . \n\n xxmaj well directed , well acted and a good story . xxbos xxmaj not the worst movie xxmaj i 've seen but definitely not very good either . i myself am a paintball player , used to play airball a lot and going from woods to airball is quite a large change . xxmaj

文本分类

对于文本分类,让我们回顾我们惯常的问题:

  • 我们的输入和目标的类型是什么?文本和类别。
  • 数据在哪里?在一个数据框中。
  • 我们如何知道样本是属于训练集还是验证集?我们有一个 is_valid 列。
  • 我们如何获取输入?在 text 列中。
  • 我们如何获取目标?在 label 列中。
imdb_clas = DataBlock(blocks=(TextBlock.from_df('text', seq_len=72, vocab=dls.vocab), CategoryBlock),
                      get_x=ColReader('text'),
                      get_y=ColReader('label'),
                      splitter=ColSplitter())

就像在前面的例子中一样,我们使用类方法来构建一个 TextBlock。我们可以向它传递语言模型的词汇表(这对于 ULMFit 方法非常有用)。我们还展示了 seq_len 参数(默认为 72),这只是因为你需要确保在这里以及在你的 text_classifier_learner 中使用相同的值。

警告

你需要确保在 TextBlock 和稍后将定义的 Learner 中使用相同的 seq_len

dls = imdb_clas.dataloaders(df, bs=64)
dls.show_batch()
text category
0 xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n\n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , xxunk bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj victor xxmaj vargas became i was always aware that something did n't quite feel right . xxmaj victor xxmaj vargas suffers from a certain xxunk on the director 's part . xxmaj apparently , the director thought that the ethnic backdrop of a xxmaj latino family on the lower east side , and an xxunk storyline would make the film critic proof . xxmaj he was right , but it did n't fool me . xxmaj raising xxmaj victor xxmaj vargas is negative
1 xxbos xxup the xxup shop xxup around xxup the xxup corner is one of the xxunk and most feel - good romantic comedies ever made . xxmaj there 's just no getting around that , and it 's hard to actually put one 's feeling for this film into words . xxmaj it 's not one of those films that tries too hard , nor does it come up with the xxunk possible scenarios to get the two protagonists together in the end . xxmaj in fact , all its charm is xxunk , contained within the characters and the setting and the plot … which is highly believable to xxunk . xxmaj it 's easy to think that such a love story , as beautiful as any other ever told , * could * happen to you … a feeling you do n't often get from other romantic comedies positive
2 xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj xxunk , after xxunk ) , i can xxunk join both xxunk of " at xxmaj the xxmaj movies " in taking xxmaj steven xxmaj soderbergh to task . \n\n xxmaj it 's usually satisfying to watch a film director change his style / subject , but xxmaj soderbergh 's most recent stinker , xxmaj the xxmaj girlfriend xxmaj xxunk ) , was also missing a story , so narrative ( and editing ? ) seem to suddenly be xxmaj soderbergh 's main challenge . xxmaj strange , after 20 - odd years in the business . xxmaj he was probably never much good at narrative , just xxunk it well inside " edgy " projects . \n\n xxmaj none of this excuses him this present , negative
3 xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i xxunk a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj xxunk . \n\n xxmaj the format is the same as xxmaj max xxmaj xxunk ' " la xxmaj xxunk , " based on a play by xxmaj arthur xxmaj xxunk , who is given an " inspired by " credit . xxmaj it starts from one person , a prostitute , standing on a street corner in xxmaj brooklyn . xxmaj she is picked up by a home contractor , who has sex with her on the hood of a car , but ca n't come . xxmaj he refuses to pay her . xxmaj when he 's off xxunk , she positive
4 xxbos i really wanted to love this show . i truly , honestly did . \n\n xxmaj for the first time , gay viewers get their own version of the " the xxmaj bachelor " . xxmaj with the help of his obligatory " hag " xxmaj xxunk , xxmaj james , a good looking , well - to - do thirty - something has the chance of love with 15 suitors ( or " mates " as they are referred to in the show ) . xxmaj the only problem is half of them are straight and xxmaj james does n't know this . xxmaj if xxmaj james picks a gay one , they get a trip to xxmaj new xxmaj zealand , and xxmaj if he picks a straight one , straight guy gets $ 25 , xxrep 3 0 . xxmaj how can this not be fun negative
5 xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first 3d game , or even the first xxunk - up . xxmaj it 's also one of the first xxunk games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - xxunk gaming experience in general . xxmaj with graphics that are terribly dated today , the game xxunk you into the role of xxunk even * think * xxmaj i 'm going to attempt spelling his last name ! ) , an xxmaj american xxup xxunk . caught in an underground bunker . xxmaj you fight and search your way through xxunk in order to achieve different xxunk for the six xxunk , let 's face it , most of them are just an excuse to hand you a weapon positive
6 xxbos xxmaj the xxmaj blob starts with one of the most bizarre theme songs ever , xxunk by an uncredited xxmaj burt xxmaj xxunk of all people ! xxmaj you really have to hear it to believe it , xxmaj the xxmaj blob may be worth watching just for this song alone & my user comment summary is just a little taste of the classy lyrics … xxmaj after this xxunk opening credits sequence xxmaj the xxmaj blob introduces us , the viewer that is , to xxmaj steve xxmaj xxunk ( steve mcqueen as xxmaj steven mcqueen ) & his girlfriend xxmaj jane xxmaj martin ( xxunk xxmaj xxunk ) who are xxunk on their own somewhere & witness what looks like a meteorite falling to xxmaj earth in nearby woods . xxmaj an old man ( xxunk xxmaj xxunk as xxmaj xxunk xxmaj xxunk ) who lives in negative
7 xxbos xxmaj the year 2005 saw no xxunk than 3 filmed productions of xxup h. xxup g. xxmaj wells ' great novel , " war of the xxmaj worlds " . xxmaj this is perhaps the least well - known and very probably the best of them . xxmaj no other version of xxunk has ever attempted not only to present the story very much as xxmaj wells wrote it , but also to create the atmosphere of the time in which it was supposed to take place : the last year of the 19th xxmaj century , 1900 … using xxmaj wells ' original setting , in and near xxmaj xxunk , xxmaj england . \n\n imdb seems xxunk to what they regard as " spoilers " . xxmaj that might apply with some films , where the ending might actually be a surprise , but with regard to positive
8 xxbos xxmaj well , what can i say . \n\n " what the xxmaj xxunk do we xxmaj know " has achieved the nearly impossible - leaving behind such masterpieces of the genre as " the xxmaj xxunk " , " the xxmaj xxunk xxmaj master " , " xxunk " , and so fourth , it will go down in history as the single worst movie i have ever seen in its xxunk . xxmaj and that , ladies and gentlemen , is impressive indeed , for i have seen many a bad movie . \n\n xxmaj this masterpiece of modern cinema consists of two xxunk parts , xxunk between a silly and contrived plot about an extremely annoying photographer , abandoned by her husband and forced to take anti - xxunk to survive , and a bunch of talking heads going on about how quantum physics supposedly xxunk negative

表格数据

表格数据并没有真正使用数据块 API,因为它依赖于另一个 API,即 TabularPandas,用于高效的预处理和批量处理(在不久的将来会添加一些与数据块 API 配合良好的效率较低的 API)。你仍然可以使用不同的块来处理目标。

from fastai.tabular.core import *

作为我们的示例,我们将查看 adult 数据集 的一个子集,该数据集包含一些人口普查数据,任务是预测某人的收入是否超过 5 万美元。

adult_source = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(adult_source/'adult.csv')
df.head()
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 49 Private 101320 Assoc-acdm 12.0 Married-civ-spouse NaN Wife White Female 0 1902 40 United-States >=50k
1 44 Private 236746 Masters 14.0 Divorced Exec-managerial Not-in-family White Male 10520 0 45 United-States >=50k
2 38 Private 96185 HS-grad NaN Divorced NaN Unmarried Black Female 0 0 32 United-States <50k
3 38 Self-emp-inc 112847 Prof-school 15.0 Married-civ-spouse Prof-specialty Husband Asian-Pac-Islander Male 0 0 40 United-States >=50k
4 42 Self-emp-not-inc 82297 7th-8th NaN Married-civ-spouse Other-service Wife Black Female 0 0 50 United-States <50k

在表格数据问题中,我们需要将列分为表示连续变量(如 age)的列和表示分类变量(如 education)的列。

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']

fastai 中的标准预处理,使用这些预处理器:

procs = [Categorify, FillMissing, Normalize]

Categorify 将分类列转换为索引,FillMissing 将填充连续列中的缺失值(如果存在),并添加一个 na 分类列(如果需要)。Normalize 将对连续列进行归一化(减去均值并除以标准差)。

我们仍然可以使用任何分割器来创建我们想要的分割。

splits = RandomSplitter()(range_of(df))

然后所有内容都放入一个 TabularPandas 对象中。

to = TabularPandas(df, procs, cat_names, cont_names, y_names="salary", splits=splits, y_block=CategoryBlock)

我们使用 y_block=CategoryBlock 只是为了向你展示如何为目标自定义块,但它通常会从数据中推断出来,所以通常不需要传递它。

dls = to.dataloaders()
dls.show_batch()
workclass education marital-status occupation relationship race education-num_na age fnlwgt education-num salary
0 Self-emp-not-inc Prof-school Never-married Prof-specialty Not-in-family White False 34.0 204374.999924 15.0 >=50k
1 Private Some-college Never-married Adm-clerical Not-in-family White False 62.0 141307.999756 10.0 <50k
2 Private Assoc-acdm Never-married Other-service Not-in-family White False 23.0 152188.999004 12.0 <50k
3 Private HS-grad Divorced Craft-repair Unmarried White False 38.0 27407.999090 9.0 <50k
4 Private Bachelors Never-married Prof-specialty Not-in-family White False 32.0 340917.004812 13.0 >=50k
5 Private Bachelors Never-married Prof-specialty Not-in-family White False 22.0 153515.999598 13.0 <50k
6 Self-emp-not-inc Doctorate Never-married Prof-specialty Not-in-family White False 46.0 165754.000335 16.0 <50k
7 Private Masters Married-civ-spouse Prof-specialty Husband White False 33.0 202050.999896 14.0 <50k
8 Private Assoc-acdm Divorced Sales Unmarried White False 40.0 197919.000079 12.0 <50k
9 ? Some-college Never-married ? Own-child White False 18.0 264924.000434 10.0 <50k