数据核心

数据收集的核心功能

from nbdev.cli import *

这里的类提供了将变换列表应用于项目集（TfmdLists, Datasets）或 DataLoader (TfmdDl) 的功能，以及用于收集模型训练数据的基类：DataLoaders。

show_batch 是一个根据类型调度的函数，负责显示解码后的 samples。x 和 y 是要显示的批次中的输入和目标，并根据它们的类型进行传递调度。例如，如果 x 是一个 TensorImage 或 TensorText，则 show_batch 有不同的实现（更多详情请参见 vision.core 或 text.data）。可以传递 ctxs，但函数在必要时负责创建它们。kwargs 取决于具体的实现。

源码

get_show_batch_func

 get_show_batch_func (x_typ=typing.Any, y_typ=typing.Any,
                      samples_typ=typing.Any)

用于手动获取给定输入类型对应的 show_batch 函数的辅助函数。

show_results 是一个根据类型调度的函数，负责显示解码后的 samples 及其对应的 outs。与 show_batch 类似，x 和 y 是要显示的批次中的输入和目标，并根据它们的类型进行传递调度。可以传递 ctxs，但函数在必要时负责创建它们。kwargs 取决于具体的实现。

源码

TfmdDL

 TfmdDL (dataset, bs:int=64, shuffle:bool=False, num_workers:int=None,
         verbose:bool=False, do_setup:bool=True, pin_memory=False,
         timeout=0, batch_size=None, drop_last=False, indexed=None,
         n=None, device=None, persistent_workers=False,
         pin_memory_device='', wif=None, before_iter=None,
         after_item=None, before_batch=None, after_batch=None,
         after_iter=None, create_batches=None, create_item=None,
         create_batch=None, retain=None, get_idxs=None, sample=None,
         shuffle_fn=None, do_batch=None)

变换后的 DataLoader

	类型	默认值	详情
dataset			用于加载数据的 Map 风格或可迭代风格数据集
bs	int	64	批次大小
shuffle	bool	False	是否打乱数据
num_workers	int	None	并行使用的 CPU 核数（默认：最多使用 16 个可用核）
verbose	bool	False	是否打印详细日志
do_setup	bool	True	是否对批次变换运行 `setup()`
pin_memory	bool	False
timeout	int	0
batch_size	NoneType	None
drop_last	bool	False
indexed	NoneType	None
n	NoneType	None
device	NoneType	None
persistent_workers	bool	False
pin_memory_device	str
wif	NoneType	None
before_iter	NoneType	None
after_item	NoneType	None
before_batch	NoneType	None
after_batch	NoneType	None
after_iter	NoneType	None
create_batches	NoneType	None
create_item	NoneType	None
create_batch	NoneType	None
retain	NoneType	None
get_idxs	NoneType	None
sample	NoneType	None
shuffle_fn	NoneType	None
do_batch	NoneType	None

A TfmdDL 是一个 DataLoader，它从变换（Transform）列表中为 after_item、before_batch 和 after_batch 等 Callbacks 创建 Pipeline。因此，它可以解码或显示处理过的 batch。

class _Category(int, ShowTitle): pass

#Test retain type
class NegTfm(Transform):
    def encodes(self, x): return torch.neg(x)
    def decodes(self, x): return torch.neg(x)
    
tdl = TfmdDL([(TensorImage([1]),)] * 4, after_batch=NegTfm(), bs=4, num_workers=4)
b = tdl.one_batch()
test_eq(type(b[0]), TensorImage)
b = (tensor([1.,1.,1.,1.]),)
test_eq(type(tdl.decode_batch(b)[0][0]), TensorImage)

class A(Transform): 
    def encodes(self, x): return x 
    def decodes(self, x): return TitledInt(x) 

@Transform
def f(x)->None: return fastuple((x,x))

start = torch.arange(50)
test_eq_type(f(2), fastuple((2,2)))

a = A()
tdl = TfmdDL(start, after_item=lambda x: (a(x), f(x)), bs=4)
x,y = tdl.one_batch()
test_eq(type(y), fastuple)

s = tdl.decode_batch((x,y))
test_eq(type(s[0][1]), fastuple)

tdl = TfmdDL(torch.arange(0,50), after_item=A(), after_batch=NegTfm(), bs=4)
test_eq(tdl.dataset[0], start[0])
test_eq(len(tdl), (50-1)//4+1)
test_eq(tdl.bs, 4)
test_stdout(tdl.show_batch, '0\n1\n2\n3')
test_stdout(partial(tdl.show_batch, unique=True), '0\n0\n0\n0')

class B(Transform):
    parameters = 'a'
    def __init__(self): self.a = torch.tensor(0.)
    def encodes(self, x): x
    
tdl = TfmdDL([(TensorImage([1]),)] * 4, after_batch=B(), bs=4)
test_eq(tdl.after_batch.fs[0].a.device, torch.device('cpu'))
tdl.to(default_device())
test_eq(tdl.after_batch.fs[0].a.device, default_device())

方法

源码

DataLoader.one_batch

 DataLoader.one_batch ()

从 DataLoader 返回一个批次。

tfm = NegTfm()
tdl = TfmdDL(start, after_batch=tfm, bs=4)

b = tdl.one_batch()
test_eq(tensor([0,-1,-2,-3]), b)

源码

TfmdDL.decode

 TfmdDL.decode (b)

使用 tfms 解码 b

	详情
b	要解码的批次

test_eq(tdl.decode(b), tensor(0,1,2,3))

源码

TfmdDL.decode_batch

 TfmdDL.decode_batch (b, max_n:int=9, full:bool=True)

完全解码 b

	类型	默认值	详情
b			要解码的批次
max_n	int	9	最大解码项目数
full	bool	True	是否解码所有变换。如果为 `False`，则解码到项目知道如何显示自身为止

test_eq(tdl.decode_batch(b), [0,1,2,3])

源码

TfmdDL.show_batch

 TfmdDL.show_batch (b=None, max_n:int=9, ctxs=None, show:bool=True,
                    unique:bool=False, **kwargs)

显示 b（默认为 one_batch），这是一个由 pipeline 输出组成的列表的列表（即 DataLoader 的输出）。

	类型	默认值	详情
b	NoneType	None	要显示的批次
max_n	int	9	最大显示项目数
ctxs	NoneType	None	用于显示数据的 `ctx` 对象列表。可以是 matplotlib 轴、DataFrame 等
show	bool	True	是否显示数据
unique	bool	False	是否只显示一个
kwargs	VAR_KEYWORD

源码

DataLoader.to

 DataLoader.to (device)

将自身及其变换状态放在 device 上

源码

DataLoaders

 DataLoaders (*loaders, path:str|pathlib.Path='.', device=None)

围绕多个 DataLoader 的基本包装器。

	类型	默认值	详情
loaders	VAR_POSITIONAL		要包装的 `DataLoader` 对象。
path	str \| pathlib.Path	.	存储导出对象的路径
device	NoneType	None	放置 `DataLoaders` 的设备

dls = DataLoaders(tdl,tdl)
x = dls.train.one_batch()
x2 = first(tdl)
test_eq(x,x2)
x2 = dls.one_batch()
test_eq(x,x2)

可以使用 Dataloaders.add_tfms 将多个变换添加到多个 dataloaders 中。您可以通过名称列表 dls.add_tfms(...,'valid',...) 或索引 dls.add_tfms(...,1,....) 指定 dataloaders，默认情况下会将变换添加到所有 dataloaders。event 是一个必需参数，它决定了变换何时运行，有关事件的更多信息请参考 TfmdDL。tfms 是一个 Transform 列表，也是一个必需参数。

class _TestTfm(Transform):
    def encodes(self, o):  return torch.ones_like(o)
    def decodes(self, o):  return o
tdl1,tdl2 = TfmdDL(start, bs=4),TfmdDL(start, bs=4)
dls2 = DataLoaders(tdl1,tdl2)
dls2.add_tfms([_TestTfm()],'after_batch',['valid'])
dls2.add_tfms([_TestTfm()],'after_batch',[1])
dls2.train.after_batch,dls2.valid.after_batch,

(Pipeline: , Pipeline: _TestTfm -> _TestTfm)

class _T(Transform):  
    def encodes(self, o):  return -o
class _T2(Transform): 
    def encodes(self, o):  return o/2

#test tfms are applied on both traind and valid dl
dls_from_ds = DataLoaders.from_dsets([1,], [5,], bs=1, after_item=_T, after_batch=_T2)
b = first(dls_from_ds.train)
test_eq(b, tensor([-.5]))
b = first(dls_from_ds.valid)
test_eq(b, tensor([-2.5]))

方法

源码

DataLoaders.getitem

 DataLoaders.__getitem__ (i)

获取索引 i 处的 DataLoader（0 是训练集，1 是验证集）。

x2

tensor([ 0, -1, -2, -3])

x2 = dls[0].one_batch()
test_eq(x,x2)

DataLoaders.train

 DataLoaders.train (x)

partial(func, *args, **keywords) - 创建一个新函数，其中部分应用了给定的参数和关键字。

DataLoaders.valid

 DataLoaders.valid (x)

partial(func, *args, **keywords) - 创建一个新函数，其中部分应用了给定的参数和关键字。

DataLoaders.train_ds

 DataLoaders.train_ds (x)

partial(func, *args, **keywords) - 创建一个新函数，其中部分应用了给定的参数和关键字。

DataLoaders.valid_ds

 DataLoaders.valid_ds (x)

partial(func, *args, **keywords) - 创建一个新函数，其中部分应用了给定的参数和关键字。

源码

FilteredBase

 FilteredBase (*args, dl_type=None, **kwargs)

支持子集列表的基类

源码

FilteredBase.dataloaders

 FilteredBase.dataloaders (bs:int=64, shuffle_train:bool=None,
                           shuffle:bool=True, val_shuffle:bool=False,
                           n:int=None, path:str|Path='.',
                           dl_type:TfmdDL=None, dl_kwargs:list=None,
                           device:torch.device=None, drop_last:bool=None,
                           val_bs:int=None, num_workers:int=None,
                           verbose:bool=False, do_setup:bool=True,
                           pin_memory=False, timeout=0, batch_size=None,
                           indexed=None, persistent_workers=False,
                           pin_memory_device='', wif=None,
                           before_iter=None, after_item=None,
                           before_batch=None, after_batch=None,
                           after_iter=None, create_batches=None,
                           create_item=None, create_batch=None,
                           retain=None, get_idxs=None, sample=None,
                           shuffle_fn=None, do_batch=None)

	类型	默认值	详情
bs	int	64	批次大小
shuffle_train	bool	None	（已弃用，请使用 `shuffle`）打乱训练集 `DataLoader`
shuffle	bool	True	打乱训练集 `DataLoader`
val_shuffle	bool	False	打乱验证集 `DataLoader`
n	int	None	用于创建 `DataLoader` 的 `Datasets` 大小。
path	str \| Path	.	放入 `DataLoaders` 的路径
dl_type	TfmdDL	None	`DataLoader` 的类型
dl_kwargs	list	None	传递给每个 `DataLoader` 的 kwargs 列表
device	torch.device	None	放置 `DataLoaders` 的设备
drop_last	bool	None	丢弃最后一个不完整的批次，默认为 `shuffle`
val_bs	int	None	验证集批次大小，默认为 `bs`
num_workers	int	None	并行使用的 CPU 核数（默认：最多使用 16 个可用核）
verbose	bool	False	是否打印详细日志
do_setup	bool	True	是否对批次变换运行 `setup()`
pin_memory	bool	False
timeout	int	0
batch_size	NoneType	None
indexed	NoneType	None
persistent_workers	bool	False
pin_memory_device	str
wif	NoneType	None
before_iter	NoneType	None
after_item	NoneType	None
before_batch	NoneType	None
after_batch	NoneType	None
after_iter	NoneType	None
create_batches	NoneType	None
create_item	NoneType	None
create_batch	NoneType	None
retain	NoneType	None
get_idxs	NoneType	None
sample	NoneType	None
shuffle_fn	NoneType	None
do_batch	NoneType	None
返回值	DataLoaders

源码

TfmdLists

 TfmdLists (items=None, *rest, use_list=False, match=None)

应用于项目集合的 tfms Pipeline

	类型	默认值	详情
items	list		要应用 `Transform` 的项目
use_list	bool	None	在 `L` 中使用 `list`

源码

decode_at

 decode_at (o, idx)

索引 idx 处的解码项目

导出源码

def decode_at(o, idx):
    "Decoded item at `idx`"
    return o.decode(o[idx])

源码

show_at

 show_at (o, idx, **kwargs)

导出源码

def show_at(o, idx, **kwargs):
    "Show item at `idx`",
    return o.show(o[idx], **kwargs)

A TfmdLists 将对象集合与 Pipeline 结合在一起。tfms 可以是一个 Pipeline 或一个变换列表，在这种情况下，它会将其包装在 Pipeline 中。use_list 会与 items 一起传递给 L，而 split_idx 会传递给 Pipeline 的每个变换。do_setup 表示初始化期间是否应调用 Pipeline.setup 方法。

class _IntFloatTfm(Transform):
    def encodes(self, o):  return TitledInt(o)
    def decodes(self, o):  return TitledFloat(o)
int2f_tfm=_IntFloatTfm()

def _neg(o): return -o
neg_tfm = Transform(_neg, _neg)

items = L([1.,2.,3.]); tfms = [neg_tfm, int2f_tfm]
tl = TfmdLists(items, tfms=tfms)
test_eq_type(tl[0], TitledInt(-1))
test_eq_type(tl[1], TitledInt(-2))
test_eq_type(tl.decode(tl[2]), TitledFloat(3.))
test_stdout(lambda: show_at(tl, 2), '-3')
test_eq(tl.types, [float, float, TitledInt])
tl

TfmdLists: [1.0, 2.0, 3.0]
tfms - [_neg(enc:1,dec:1), _IntFloatTfm(enc:1,dec:1)]

# add splits to TfmdLists
splits = [[0,2],[1]]
tl = TfmdLists(items, tfms=tfms, splits=splits)
test_eq(tl.n_subsets, 2)
test_eq(tl.train, tl.subset(0))
test_eq(tl.valid, tl.subset(1))
test_eq(tl.train.items, items[splits[0]])
test_eq(tl.valid.items, items[splits[1]])
test_eq(tl.train.tfms.split_idx, 0)
test_eq(tl.valid.tfms.split_idx, 1)
test_eq(tl.train.new_empty().split_idx, 0)
test_eq(tl.valid.new_empty().split_idx, 1)
test_eq_type(tl.splits, L(splits))
assert not tl.overlapping_splits()

df = pd.DataFrame(dict(a=[1,2,3],b=[2,3,4]))
tl = TfmdLists(df, lambda o: o.a+1, splits=[[0],[1,2]])
test_eq(tl[1,2], [3,4])
tr = tl.subset(0)
test_eq(tr[:], [2])
val = tl.subset(1)
test_eq(val[:], [3,4])

class _B(Transform):
    def __init__(self): self.m = 0
    def encodes(self, o): return o+self.m
    def decodes(self, o): return o-self.m
    def setups(self, items): 
        print(items)
        self.m = tensor(items).float().mean().item()

# test for setup, which updates `self.m`
tl = TfmdLists(items, _B())
test_eq(tl.m, 2)

TfmdLists: [1.0, 2.0, 3.0]
tfms - []

以下是如何使用 TfmdLists.setup 实现一个简单的类别列表，从模拟文件列表中获取标签

class _Cat(Transform):
    order = 1
    def encodes(self, o):    return int(self.o2i[o])
    def decodes(self, o):    return TitledStr(self.vocab[o])
    def setups(self, items): self.vocab,self.o2i = uniqueify(L(items), sort=True, bidir=True)
tcat = _Cat()

def _lbl(o): return TitledStr(o.split('_')[0])

# Check that tfms are sorted by `order` & `_lbl` is called first
fns = ['dog_0.jpg','cat_0.jpg','cat_2.jpg','cat_1.jpg','dog_1.jpg']
tl = TfmdLists(fns, [tcat,_lbl])
exp_voc = ['cat','dog']
test_eq(tcat.vocab, exp_voc)
test_eq(tl.tfms.vocab, exp_voc)
test_eq(tl.vocab, exp_voc)
test_eq(tl, (1,0,0,0,1))
test_eq([tl.decode(o) for o in tl], ('dog','cat','cat','cat','dog'))

#Check only the training set is taken into account for setup
tl = TfmdLists(fns, [tcat,_lbl], splits=[[0,4], [1,2,3]])
test_eq(tcat.vocab, ['dog'])

tfm = NegTfm(split_idx=1)
tds = TfmdLists(start, A())
tdl = TfmdDL(tds, after_batch=tfm, bs=4)
x = tdl.one_batch()
test_eq(x, torch.arange(4))
tds.split_idx = 1
x = tdl.one_batch()
test_eq(x, -torch.arange(4))
tds.split_idx = 0
x = tdl.one_batch()
test_eq(x, torch.arange(4))

tds = TfmdLists(start, A())
tdl = TfmdDL(tds, after_batch=NegTfm(), bs=4)
test_eq(tdl.dataset[0], start[0])
test_eq(len(tdl), (len(tds)-1)//4+1)
test_eq(tdl.bs, 4)
test_stdout(tdl.show_batch, '0\n1\n2\n3')

源码

TfmdLists.subset

 TfmdLists.subset (i)

新的 TfmdLists，具有相同的 tfms，仅包含第 i 个 split 中的项目。

源码

TfmdLists.infer_idx

 TfmdLists.infer_idx (x)

根据 x 的类型，找到可以对 x 应用 self.tfms 的索引

源码

TfmdLists.infer

 TfmdLists.infer (x)

根据 x 的类型，从正确的 tfm 开始将 self.tfms 应用于 x

def mult(x): return x*2
mult.order = 2

fns = ['dog_0.jpg','cat_0.jpg','cat_2.jpg','cat_1.jpg','dog_1.jpg']
tl = TfmdLists(fns, [_lbl,_Cat(),mult])

test_eq(tl.infer_idx('dog_45.jpg'), 0)
test_eq(tl.infer('dog_45.jpg'), 2)

test_eq(tl.infer_idx(4), 2)
test_eq(tl.infer(4), 8)

test_fail(lambda: tl.infer_idx(2.0))
test_fail(lambda: tl.infer(2.0))

源码

Datasets

 Datasets (items:list=None, tfms:MutableSequence|Pipeline=None,
           tls:TfmdLists=None, n_inp:int=None, dl_type=None,
           use_list:bool=None, do_setup:bool=True, split_idx:int=None,
           train_setup:bool=True, splits:list=None, types=None,
           verbose:bool=False)

从每个 tfms 创建元组的数据集

	类型	默认值	详情
items	list	None	用于创建 `Datasets` 的项目列表
tfms	collections.abc.MutableSequence \| fasttransform.transform.Pipeline	None	要应用的 `Transform` 或 `Pipeline` 列表
tls	TfmdLists	None	如果为 None，则从 `items` 和 `tfms` 生成 `self.tls`
n_inp	int	None	`Datasets` 元组中应视为输入部分的元素数量。
dl_type	NoneType	None	调用函数 `FilteredBase.dataloaders` 时使用的默认 `DataLoader` 类型。
use_list	bool	None	在 `L` 中使用 `list`
do_setup	bool	True	调用 `Transform` 的 `setup()` 方法
split_idx	int	None	将 `Transform` 应用于训练集或验证集。`0` 表示训练集，`1` 表示验证集。
train_setup	bool	True	仅将 `Transform` 应用于训练集 `DataLoader`。
splits	list	None	训练集和验证集的索引
types	NoneType	None	`items` 中数据的类型
verbose	bool	False	打印详细输出

A Datasets 通过对 items（通常是输入和目标）应用 tfms 中的每个 Transform (或 Pipeline) 列表来从 items 创建元组。请注意，如果 tfms 只包含一个 tfms 列表，则 Datasets 返回的项目将是包含一个元素的元组。

n_inp 是元组中应视为输入部分的元素数量，如果 tfms 包含一组变换，则默认为 1，否则为 len(tfms)-1。在大多数情况下，Datasets 生成的元组元素数量为 2（用于输入和目标），但有时也可能为 3（例如 Siamese 网络或表格数据），在这种情况下我们需要能够确定输入的结束和目标的开始位置。

items = [1,2,3,4]
dsets = Datasets(items, [[neg_tfm,int2f_tfm], [add(1)]])
t = dsets[0]
test_eq(t, (-1,2))
test_eq(dsets[0,1,2], [(-1,2),(-2,3),(-3,4)])
test_eq(dsets.n_inp, 1)
dsets.decode(t)

(1.0, 2)

class Norm(Transform):
    def encodes(self, o): return (o-self.m)/self.s
    def decodes(self, o): return (o*self.s)+self.m
    def setups(self, items):
        its = tensor(items).float()
        self.m,self.s = its.mean(),its.std()

items = [1,2,3,4]
nrm = Norm()
dsets = Datasets(items, [[neg_tfm,int2f_tfm], [neg_tfm,nrm]])

x,y = zip(*dsets)
test_close(tensor(y).mean(), 0)
test_close(tensor(y).std(), 1)
test_eq(x, (-1,-2,-3,-4,))
test_eq(nrm.m, -2.5)
test_stdout(lambda:show_at(dsets, 1), '-2')

test_eq(dsets.m, nrm.m)
test_eq(dsets.norm.m, nrm.m)
test_eq(dsets.train.norm.m, nrm.m)

test_fns = ['dog_0.jpg','cat_0.jpg','cat_2.jpg','cat_1.jpg','kid_1.jpg']
tcat = _Cat()
dsets = Datasets(test_fns, [[tcat,_lbl]], splits=[[0,1,2], [3,4]])
test_eq(tcat.vocab, ['cat','dog'])
test_eq(dsets.train, [(1,),(0,),(0,)])
test_eq(dsets.valid[0], (0,))
test_stdout(lambda: show_at(dsets.train, 0), "dog")

inp = [0,1,2,3,4]
dsets = Datasets(inp, tfms=[None])

test_eq(*dsets[2], 2)          # Retrieve one item (subset 0 is the default)
test_eq(dsets[1,2], [(1,),(2,)])    # Retrieve two items by index
mask = [True,False,False,True,False]
test_eq(dsets[mask], [(0,),(3,)])   # Retrieve two items by mask

inp = pd.DataFrame(dict(a=[5,1,2,3,4]))
dsets = Datasets(inp, tfms=attrgetter('a')).subset(0)
test_eq(*dsets[2], 2)          # Retrieve one item (subset 0 is the default)
test_eq(dsets[1,2], [(1,),(2,)])    # Retrieve two items by index
mask = [True,False,False,True,False]
test_eq(dsets[mask], [(5,),(3,)])   # Retrieve two items by mask

#test n_inp
inp = [0,1,2,3,4]
dsets = Datasets(inp, tfms=[None])
test_eq(dsets.n_inp, 1)
dsets = Datasets(inp, tfms=[[None],[None],[None]])
test_eq(dsets.n_inp, 2)
dsets = Datasets(inp, tfms=[[None],[None],[None]], n_inp=1)
test_eq(dsets.n_inp, 1)

# splits can be indices
dsets = Datasets(range(5), tfms=[None], splits=[tensor([0,2]), [1,3,4]])

test_eq(dsets.subset(0), [(0,),(2,)])
test_eq(dsets.train, [(0,),(2,)])       # Subset 0 is aliased to `train`
test_eq(dsets.subset(1), [(1,),(3,),(4,)])
test_eq(dsets.valid, [(1,),(3,),(4,)])     # Subset 1 is aliased to `valid`
test_eq(*dsets.valid[2], 4)
#assert '[(1,),(3,),(4,)]' in str(dsets) and '[(0,),(2,)]' in str(dsets)
dsets

(#5) [(0,),(1,),(2,),(3,),(4,)]

# splits can be boolean masks (they don't have to cover all items, but must be disjoint)
splits = [[False,True,True,False,True], [True,False,False,False,False]]
dsets = Datasets(range(5), tfms=[None], splits=splits)

test_eq(dsets.train, [(1,),(2,),(4,)])
test_eq(dsets.valid, [(0,)])

# apply transforms to all items
tfm = [[lambda x: x*2,lambda x: x+1]]
splits = [[1,2],[0,3,4]]
dsets = Datasets(range(5), tfm, splits=splits)
test_eq(dsets.train,[(3,),(5,)])
test_eq(dsets.valid,[(1,),(7,),(9,)])
test_eq(dsets.train[False,True], [(5,)])

# only transform subset 1
class _Tfm(Transform):
    split_idx=1
    def encodes(self, x): return x*2
    def decodes(self, x): return TitledStr(x//2)

dsets = Datasets(range(5), [_Tfm()], splits=[[1,2],[0,3,4]])
test_eq(dsets.train,[(1,),(2,)])
test_eq(dsets.valid,[(0,),(6,),(8,)])
test_eq(dsets.train[False,True], [(2,)])
dsets

(#5) [(0,),(1,),(2,),(3,),(4,)]

#A context manager to change the split_idx and apply the validation transform on the training set
ds = dsets.train
with ds.set_split_idx(1):
    test_eq(ds,[(2,),(4,)])
test_eq(dsets.train,[(1,),(2,)])

dsets = Datasets(range(5), [_Tfm(),noop], splits=[[1,2],[0,3,4]])
test_eq(dsets.train,[(1,1),(2,2)])
test_eq(dsets.valid,[(0,0),(6,3),(8,4)])

start = torch.arange(0,50)
tds = Datasets(start, [A()])
tdl = TfmdDL(tds, after_item=NegTfm(), bs=4)
b = tdl.one_batch()
test_eq(tdl.decode_batch(b), ((0,),(1,),(2,),(3,)))
test_stdout(tdl.show_batch, "0\n1\n2\n3")

# only transform subset 1
class _Tfm(Transform):
    split_idx=1
    def encodes(self, x): return x*2

dsets = Datasets(range(8), [None], splits=[[1,2,5,7],[0,3,4,6]])

# only transform subset 1
class _Tfm(Transform):
    split_idx=1
    def encodes(self, x): return x*2

dsets = Datasets(range(8), [None], splits=[[1,2,5,7],[0,3,4,6]])
dls = dsets.dataloaders(bs=4, after_batch=_Tfm(), shuffle=False, device=torch.device('cpu'))
test_eq(dls.train, [(tensor([1,2,5, 7]),)])
test_eq(dls.valid, [(tensor([0,6,8,12]),)])
test_eq(dls.n_inp, 1)

方法

items = [1,2,3,4]
dsets = Datasets(items, [[neg_tfm,int2f_tfm]])

源码

Datasets.dataloaders

 Datasets.dataloaders (bs:int=64, shuffle_train:bool=None,
                       shuffle:bool=True, val_shuffle:bool=False,
                       n:int=None, path:str|Path='.', dl_type:TfmdDL=None,
                       dl_kwargs:list=None, device:torch.device=None,
                       drop_last:bool=None, val_bs:int=None,
                       num_workers:int=None, verbose:bool=False,
                       do_setup:bool=True, pin_memory=False, timeout=0,
                       batch_size=None, indexed=None,
                       persistent_workers=False, pin_memory_device='',
                       wif=None, before_iter=None, after_item=None,
                       before_batch=None, after_batch=None,
                       after_iter=None, create_batches=None,
                       create_item=None, create_batch=None, retain=None,
                       get_idxs=None, sample=None, shuffle_fn=None,
                       do_batch=None)

获取一个 DataLoaders。

	类型	默认值	详情
bs	int	64	批次大小
shuffle_train	bool	None	（已弃用，请使用 `shuffle`）打乱训练集 `DataLoader`
shuffle	bool	True	打乱训练集 `DataLoader`
val_shuffle	bool	False	打乱验证集 `DataLoader`
n	int	None	用于创建 `DataLoader` 的 `Datasets` 大小。
path	str \| Path	.	放入 `DataLoaders` 的路径
dl_type	TfmdDL	None	`DataLoader` 的类型
dl_kwargs	list	None	传递给每个 `DataLoader` 的 kwargs 列表
device	torch.device	None	放置 `DataLoaders` 的设备
drop_last	bool	None	丢弃最后一个不完整的批次，默认为 `shuffle`
val_bs	int	None	验证集批次大小，默认为 `bs`
num_workers	int	None	并行使用的 CPU 核数（默认：最多使用 16 个可用核）
verbose	bool	False	是否打印详细日志
do_setup	bool	True	是否对批次变换运行 `setup()`
pin_memory	bool	False
timeout	int	0
batch_size	NoneType	None
indexed	NoneType	None
persistent_workers	bool	False
pin_memory_device	str
wif	NoneType	None
before_iter	NoneType	None
after_item	NoneType	None
before_batch	NoneType	None
after_batch	NoneType	None
after_iter	NoneType	None
create_batches	NoneType	None
create_item	NoneType	None
create_batch	NoneType	None
retain	NoneType	None
get_idxs	NoneType	None
sample	NoneType	None
shuffle_fn	NoneType	None
do_batch	NoneType	None
返回值	DataLoaders

用于创建 dataloaders。您可以像 val_shuffle 那样在前面加上 'val_' 来覆盖验证集的功能。如果需要处理多个 dataloader，dl_kwargs 可以提供更精细的每个 dataloader 控制。

源码

Datasets.decode

 Datasets.decode (o, full=True)

先组合所有 tuple_tfms 的 decode，然后组合所有 tfms 的 decode，应用于 i

test_eq(*dsets[0], -1)
test_eq(*dsets.decode((-1,)), 1)

源码

Datasets.show

 Datasets.show (o, ctx=None, **kwargs)

在 ctx 中显示项目 o

test_stdout(lambda:dsets.show(dsets[1]), '-2')

源码

Datasets.new_empty

 Datasets.new_empty ()

创建 self 的新空版本，仅保留变换

items = [1,2,3,4]
nrm = Norm()
dsets = Datasets(items, [[neg_tfm,int2f_tfm], [neg_tfm]])
empty = dsets.new_empty()
test_eq(empty.items, [])

添加测试集用于推理

# only transform subset 1
class _Tfm1(Transform):
    split_idx=0
    def encodes(self, x): return x*3

dsets = Datasets(range(8), [[_Tfm(),_Tfm1()]], splits=[[1,2,5,7],[0,3,4,6]])
test_eq(dsets.train, [(3,),(6,),(15,),(21,)])
test_eq(dsets.valid, [(0,),(6,),(8,),(12,)])

源码

test_set

 test_set (dsets:__main__.Datasets|__main__.TfmdLists, test_items,
           rm_tfms=None, with_labels:bool=False)

使用 dsets 的验证集变换，从 test_items 创建一个测试集

	类型	默认值	详情
dsets	main.Datasets \| main.TfmdLists		用于加载数据的 Map 风格或可迭代风格数据集
test_items			测试数据集中的项目
rm_tfms	NoneType	None	要应用的 `dsets` 验证集中的 `Transform` 的起始索引
with_labels	bool	False	测试项目是否包含标签

class _Tfm1(Transform):
    split_idx=0
    def encodes(self, x): return x*3

dsets = Datasets(range(8), [[_Tfm(),_Tfm1()]], splits=[[1,2,5,7],[0,3,4,6]])
test_eq(dsets.train, [(3,),(6,),(15,),(21,)])
test_eq(dsets.valid, [(0,),(6,),(8,),(12,)])

#Tranform of the validation set are applied
tst = test_set(dsets, [1,2,3])
test_eq(tst, [(2,),(4,),(6,)])

源码

DataLoaders.test_dl

 DataLoaders.test_dl (test_items, rm_type_tfms=None,
                      with_labels:bool=False, bs:int=64,
                      shuffle:bool=False, num_workers:int=None,
                      verbose:bool=False, do_setup:bool=True,
                      pin_memory=False, timeout=0, batch_size=None,
                      drop_last=False, indexed=None, n=None, device=None,
                      persistent_workers=False, pin_memory_device='',
                      wif=None, before_iter=None, after_item=None,
                      before_batch=None, after_batch=None,
                      after_iter=None, create_batches=None,
                      create_item=None, create_batch=None, retain=None,
                      get_idxs=None, sample=None, shuffle_fn=None,
                      do_batch=None)

使用 dls 的验证集变换，从 test_items 创建一个测试 dataloader

	类型	默认值	详情
test_items			测试数据集中的项目
rm_type_tfms	NoneType	None	要应用的 `dsets` 验证集中的 `Transform` 的起始索引
with_labels	bool	False	测试项目是否包含标签
bs	int	64	批次大小
shuffle	bool	False	是否打乱数据
num_workers	int	None	并行使用的 CPU 核数（默认：最多使用 16 个可用核）
verbose	bool	False	是否打印详细日志
do_setup	bool	True	是否对批次变换运行 `setup()`
pin_memory	bool	False
timeout	int	0
batch_size	NoneType	None
drop_last	bool	False
indexed	NoneType	None
n	NoneType	None
device	NoneType	None
persistent_workers	bool	False
pin_memory_device	str
wif	NoneType	None
before_iter	NoneType	None
after_item	NoneType	None
before_batch	NoneType	None
after_batch	NoneType	None
after_iter	NoneType	None
create_batches	NoneType	None
create_item	NoneType	None
create_batch	NoneType	None
retain	NoneType	None
get_idxs	NoneType	None
sample	NoneType	None
shuffle_fn	NoneType	None
do_batch	NoneType	None

dsets = Datasets(range(8), [[_Tfm(),_Tfm1()]], splits=[[1,2,5,7],[0,3,4,6]])
dls = dsets.dataloaders(bs=4, device=torch.device('cpu'))

dsets = Datasets(range(8), [[_Tfm(),_Tfm1()]], splits=[[1,2,5,7],[0,3,4,6]])
dls = dsets.dataloaders(bs=4, device=torch.device('cpu'))
tst_dl = dls.test_dl([2,3,4,5])
test_eq(tst_dl._n_inp, 1)
test_eq(list(tst_dl), [(tensor([ 4,  6,  8, 10]),)])
#Test you can change transforms
tst_dl = dls.test_dl([2,3,4,5], after_item=add1)
test_eq(list(tst_dl), [(tensor([ 5,  7,  9, 11]),)])