表格核心

在将表格数据组装到 DataLoaders 中之前，用于预处理表格数据的基本函数。

初始预处理

make_date

 make_date (df, date_field)

确保 df[date_field] 具有正确的日期类型。

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
make_date(df, 'date')
test_eq(df['date'].dtype, np.dtype('datetime64[ns]'))

源代码

add_datepart

 add_datepart (df, field_name, prefix=None, drop=True, time=False)

辅助函数，用于在 df 的 field_name 列中添加与日期相关的列。

例如，如果我们有一系列日期，我们可以生成诸如 Year, Month, Day, Dayofweek, Is_month_start 等特征，如下所示：

df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
df = add_datepart(df, 'date')
df.head()

	年	月	周	日	周几	一年中的第几天	是否为月末	是否为月初	是否为季末	是否为季初	是否为年末	是否为年初	已用时间
0	2019.0	12.0	49.0	4.0	2.0	338.0	False	False	False	False	False	False	1.575418e+09
1	NaN	NaN	NaN	NaN	NaN	NaN	False	False	False	False	False	False	NaN
2	2019.0	11.0	46.0	15.0	4.0	319.0	False	False	False	False	False	False	1.573776e+09
3	2019.0	10.0	43.0	24.0	3.0	297.0	False	False	False	False	False	False	1.571875e+09

源代码

add_elapsed_times

 add_elapsed_times (df, field_names, date_field, base_field)

在 df 中，为 field_names 中的每个事件，根据 date_field 按 base_field 分组添加已用时间。

df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
                   'event': [False, True, False, True], 'base': [1,1,2,2]})
df = add_elapsed_times(df, ['event'], 'date', 'base')
df.head()

	date	event	base	Afterevent	event_bw	event_fw
0	2019-12-04	False	1	5	1.0	0.0
1	2019-11-29	True	1	0	1.0	1.0
2	2019-11-15	False	2	22	1.0	0.0
3	2019-10-24	True	2	0	1.0	1.0

源代码

cont_cat_split

 cont_cat_split (df, max_card=20, dep_var=None)

辅助函数，返回给定 df 中连续变量和分类变量的列名。

此函数通过根据值的基数确定列是连续的还是分类的来工作。如果基数高于 max_card 参数（或列为 float 数据类型），则将其添加到 cont_names 中，否则添加到 cat_names 中。示例如下：

# Example with simple numpy types
df = pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
                   'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
                   'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
                   'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
                   'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
cont_names, cat_names = cont_cat_split(df)

cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`

# Example with pandas types and generated columns
df = pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
                    'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
                    'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
                    'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
                    'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
                    })
df = add_datepart(df, 'd1_date', drop=False)
df['cat1'] = df['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True)
cont_names, cat_names = cont_cat_split(df, max_card=0)

/home/jhoward/miniconda3/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2630: FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  res = method(*args, **kwargs)

cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']

源代码

df_shrink_dtypes

 df_shrink_dtypes (df, skip=[], obj2cat=True, int2uint=False)

返回 DataFrame 列所有可能的较小数据类型。允许将 object 转换为 category，将 int 转换为 uint，以及排除指定列。

例如，我们将创建一个包含 int, float, bool 和 object 数据类型的示例 DataFrame。

df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
                   'date':['2019-12-04','2019-11-29','2019-11-15',]})
df.dtypes

i         int64
f       float64
e          bool
date     object
dtype: object

然后我们可以调用 df_shrink_dtypes 来找到可以支持数据的最小可能数据类型。

dt = df_shrink_dtypes(df)
dt

{'i': dtype('int8'), 'f': dtype('float32'), 'date': 'category'}

源代码

df_shrink

 df_shrink (df, skip=[], obj2cat=True, int2uint=False)

通过转换为由 df_shrink_dtypes() 返回的较小类型来减少 DataFrame 内存使用。

df_shrink(df) 尝试通过将数值列转换为最小数据类型来减少 DataFrame 的内存使用。此外，

忽略 boolean, category, datetime64[ns] 数据类型列。
‘object’ 类型列被分类化 (categorified)，这可以在大型数据集中节省大量内存。可以通过设置 obj2cat=False 来关闭此功能。
int2uint=True，用于将 int 类型转换为 uint 类型，如果列中所有数据都 >= 0。
可以使用 excl_cols=['col1','col2'] 按名称排除列。

若要仅获取新的列数据类型而不实际转换 DataFrame，请使用 df_shrink_dtypes()，并使用与 df_shrink() 相同的所有参数。

df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
                  'date':['2019-12-04','2019-11-29','2019-11-15']})
df2 = df_shrink(df, skip=['date'])

让我们比较一下两者

df.dtypes

i         int64
f       float64
u         int64
date     object
dtype: object

df2.dtypes

i          int8
f       float32
u         int16
date     object
dtype: object

我们可以看到数据类型发生了变化，甚至可以进一步查看它们的相对内存使用情况

Initial Dataframe: 224 bytes
Reduced Dataframe: 173 bytes

这是使用 ADULT_SAMPLE 数据集的另一个示例

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
new_df = df_shrink(df, int2uint=True)

Initial Dataframe: 3.907448 megabytes
Reduced Dataframe: 0.818329 megabytes

我们减少了总内存使用量达 79%！

源代码

表格

 Tabular (df, procs=None, cat_names=None, cont_names=None, y_names=None,
          y_block=None, splits=None, do_setup=True, device=None,
          inplace=False, reduce_memory=True)

一个 DataFrame 包装器，它知道哪些列是连续/分类/y 变量，并在 __getitem__ 中返回行。

df: 您的数据 DataFrame
cat_names: 您的分类 x 变量
cont_names: 您的连续 x 变量
y_names: 您的因变量 y
- 注意：目前不支持混合 y 变量（如回归和分类），但支持多个回归或分类输出。
y_block: 如何对 y_names 的类型进行子分类 (CategoryBlock 或 RegressionBlock)
splits: 如何分割数据
do_setup: 一个参数，用于确定 Tabular 是否在初始化时通过 procs 处理数据
device: cuda 或 cpu
inplace: 如果为 True，Tabular 将不会在内存中保留原始 DataFrame 的单独副本。您应该在设置此参数之前确保 pd.options.mode.chained_assignment 为 None。
reduce_memory: fastai 将尝试使用 df_shrink 减少输入的 DataFrame 的总内存使用量

源代码

TabularPandas

 TabularPandas (df, procs=None, cat_names=None, cont_names=None,
                y_names=None, y_block=None, splits=None, do_setup=True,
                device=None, inplace=False, reduce_memory=True)

一个带有变换的 Tabular 对象

源代码

TabularProc

 TabularProc (enc=None, dec=None, split_idx=None, order=None)

用于编写 DataFrame 非延迟表格处理器的基类

这些变换在数据可用时立即应用，而不是在从 DataLoader 调用数据时应用

源代码

Categorify

 Categorify (enc=None, dec=None, split_idx=None, order=None)

将分类变量转换为类似于 pd.Categorical 的形式

虽然在 DataFrame 中视觉上不会看到变化，但类别存储在 to.procs.categorify 中，如下面的一个虚拟 DataFrame 所示：

df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.show()

	a
0	0
1	1
2	2
3	0
4	2

每列的唯一值存储在一个形如 column:[values] 的字典中

cat = to.procs.categorify
cat.classes

{'a': ['#na#', 0, 1, 2]}

源代码

FillStrategy

 FillStrategy ()

包含各种填充策略的命名空间。

目前支持使用 median, constant 和 mode 进行填充。

源代码

FillMissing

 FillMissing (fill_strategy=<function median>, add_col=True,
              fill_vals=None)

填充连续列中的缺失值。

源代码

ReadTabBatch

 ReadTabBatch (to)

将 TabularPandas 值转换为具有解码能力的 Tensor

源代码

TabDataLoader

 TabDataLoader (dataset, bs=16, shuffle=False, after_batch=None,
                num_workers=0, verbose:bool=False, do_setup:bool=True,
                pin_memory=False, timeout=0, batch_size=None,
                drop_last=False, indexed=None, n=None, device=None,
                persistent_workers=False, pin_memory_device='', wif=None,
                before_iter=None, after_item=None, before_batch=None,
                after_iter=None, create_batches=None, create_item=None,
                create_batch=None, retain=None, get_idxs=None,
                sample=None, shuffle_fn=None, do_batch=None)

表格数据的转换后的 DataLoader

源代码

TabWeightedDL

 TabWeightedDL (dataset, bs=16, wgts=None, shuffle=False,
                after_batch=None, num_workers=0, verbose:bool=False,
                do_setup:bool=True, pin_memory=False, timeout=0,
                batch_size=None, drop_last=False, indexed=None, n=None,
                device=None, persistent_workers=False,
                pin_memory_device='', wif=None, before_iter=None,
                after_item=None, before_batch=None, after_iter=None,
                create_batches=None, create_item=None, create_batch=None,
                retain=None, get_idxs=None, sample=None, shuffle_fn=None,
                do_batch=None)

表格加权数据的转换后的 DataLoader

集成示例

有关更深入的解释，请参阅表格教程

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_test.drop('salary', axis=1, inplace=True)
df_main.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))

to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits)

dls = to.dataloaders()
dls.valid.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Self-emp-not-inc	Prof-school	Divorced	Prof-specialty	Not-in-family	White	False	65.000000	316093.005287	15.0	<50k
1	Private	Bachelors	Married-civ-spouse	Exec-managerial	Husband	White	False	69.999999	280306.998091	13.0	<50k
2	Federal-gov	Some-college	Married-civ-spouse	Adm-clerical	Husband	Black	False	34.000000	199933.999862	10.0	>=50k
3	Private	HS-grad	Never-married	Handlers-cleaners	Unmarried	White	False	24.000001	300584.002430	9.0	<50k
4	Private	Assoc-voc	Never-married	Other-service	Not-in-family	White	False	34.000000	220630.999335	11.0	<50k
5	Private	Bachelors	Divorced	Prof-specialty	Unmarried	White	False	45.000000	289230.003178	13.0	>=50k
6	?	Some-college	Never-married	?	Own-child	White	False	26.000000	208993.999494	10.0	<50k
7	Private	Some-college	Divorced	Adm-clerical	Not-in-family	White	False	43.000000	174574.999446	10.0	<50k
8	Self-emp-not-inc	Assoc-voc	Married-civ-spouse	Other-service	Husband	White	False	63.000000	420628.997361	11.0	<50k
9	State-gov	Some-college	Married-civ-spouse	Adm-clerical	Husband	Black	False	25.000000	257064.003065	10.0	<50k

to.show()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
5516	Private	HS-grad	Divorced	Exec-managerial	Unmarried	White	False	49.0	140121.0	9.0	<50k
7184	Self-emp-inc	Some-college	Never-married	Exec-managerial	Not-in-family	White	False	70.0	207938.0	10.0	<50k
2336	Private	Some-college	Never-married	Priv-house-serv	Own-child	White	False	23.0	50953.0	10.0	<50k
4342	Private	Assoc-voc	Married-civ-spouse	Machine-op-inspct	Husband	White	False	46.0	27802.0	11.0	<50k
8474	Self-emp-not-inc	Assoc-acdm	Married-civ-spouse	Craft-repair	Husband	White	False	47.0	107231.0	12.0	<50k
5948	Local-gov	HS-grad	Married-civ-spouse	Transport-moving	Husband	White	False	40.0	55363.0	9.0	<50k
5342	Local-gov	HS-grad	Married-civ-spouse	Craft-repair	Husband	White	False	46.0	36228.0	9.0	<50k
9005	Private	Bachelors	Married-civ-spouse	Adm-clerical	Husband	White	False	38.0	297449.0	13.0	>=50k
1189	Private	Assoc-voc	Divorced	Sales	Not-in-family	Amer-Indian-Eskimo	False	31.0	87950.0	11.0	<50k
8784	Private	Assoc-voc	Divorced	Prof-specialty	Own-child	Black	False	35.0	491000.0	11.0	<50k

我们可以通过调用 to.decode_row 并传入原始数据来解码任何一组转换后的数据

row = to.items.iloc[0]
to.decode_row(row)

age                             49.0
workclass                    Private
fnlwgt                      140121.0
education                    HS-grad
education-num                    9.0
marital-status              Divorced
occupation           Exec-managerial
relationship               Unmarried
race                           White
sex                             Male
capital-gain                       0
capital-loss                       0
hours-per-week                    50
native-country         United-States
salary                          <50k
education-num_na               False
Name: 5516, dtype: object

我们可以使用 to.new() 基于训练数据创建新的测试数据集

注意

由于机器学习模型无法神奇地理解从未训练过的类别，因此数据应反映这一点。如果测试数据中存在不同的缺失值，您应该在训练之前解决这个问题。

to_tst = to.new(df_test)
to_tst.process()
to_tst.items.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	hours-per-week	native-country	education-num_na
10000	0.465031	5	1.319553	10	1.176677	3	2	1	2	Male	40	Philippines	1
10001	-0.926675	5	1.233650	12	-0.420035	3	15	1	4	Male	40	United-States	1
10002	1.051012	5	0.145161	2	-1.218391	1	9	2	5	Female	37	United-States	1
10003	0.538279	5	-0.282370	12	-0.420035	7	2	5	5	Female	43	United-States	1
10004	0.758022	6	1.420768	9	0.378321	3	5	1	5	Male	60	United-States	1

然后我们可以将其转换为 DataLoader

tst_dl = dls.valid.new(to_tst)
tst_dl.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
0	Private	Bachelors	Married-civ-spouse	Adm-clerical	Husband	Asian-Pac-Islander	False	45.000000	338105.005817	13.0
1	Private	HS-grad	Married-civ-spouse	Transport-moving	Husband	Other	False	26.000000	328663.002806	9.0
2	Private	11th	Divorced	Other-service	Not-in-family	White	False	52.999999	209022.000317	7.0
3	Private	HS-grad	Widowed	Adm-clerical	Unmarried	White	False	46.000000	162029.998917	9.0
4	Self-emp-inc	Assoc-voc	Married-civ-spouse	Exec-managerial	Husband	White	False	49.000000	349230.006300	11.0
5	Local-gov	Some-college	Married-civ-spouse	Exec-managerial	Husband	White	False	34.000000	124827.002059	10.0
6	Self-emp-inc	Some-college	Married-civ-spouse	Sales	Husband	White	False	52.999999	290640.002462	10.0
7	Private	Some-college	Never-married	Sales	Own-child	White	False	19.000000	106272.998239	10.0
8	Private	Some-college	Married-civ-spouse	Protective-serv	Husband	Black	False	71.999999	53684.001668	10.0
9	Private	Some-college	Never-married	Sales	Own-child	White	False	20.000000	505980.010609	10.0

# Create a TabWeightedDL
train_ds = to.train
weights = np.random.random(len(train_ds))
train_dl = TabWeightedDL(train_ds, wgts=weights, bs=64, shuffle=True)

train_dl.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Local-gov	Masters	Never-married	Prof-specialty	Not-in-family	White	False	31.000000	204469.999932	14.0	<50k
1	Self-emp-not-inc	HS-grad	Divorced	Farming-fishing	Not-in-family	White	False	32.000000	34572.002104	9.0	<50k
2	?	Some-college	Widowed	?	Not-in-family	White	False	64.000000	34099.998990	10.0	<50k
3	Private	Some-college	Divorced	Exec-managerial	Not-in-family	White	False	32.000000	251242.999189	10.0	>=50k
4	Federal-gov	HS-grad	Married-civ-spouse	Exec-managerial	Husband	White	False	55.000001	176903.999313	9.0	<50k
5	Private	11th	Married-civ-spouse	Transport-moving	Husband	White	False	50.000000	192203.000000	7.0	<50k
6	Private	10th	Never-married	Farming-fishing	Own-child	Black	False	36.000000	181720.999704	6.0	<50k
7	Local-gov	Masters	Divorced	Prof-specialty	Not-in-family	Amer-Indian-Eskimo	False	50.000000	220640.001490	14.0	>=50k
8	Private	HS-grad	Married-civ-spouse	Adm-clerical	Wife	White	False	36.000000	189381.999993	9.0	>=50k
9	Private	Masters	Divorced	Prof-specialty	Unmarried	White	False	42.000000	265697.997341	14.0	<50k

TabDataLoader 的 create_item 方法

df = pd.DataFrame([{'age': 35}])
to = TabularPandas(df)
dls = to.dataloaders()
print(dls.create_item(0))
# test_eq(dls.create_item(0).items.to_dict(), {'age': 0.5330614747286777, 'workclass': 5, 'fnlwgt': -0.26305443080666174, 'education': 10, 'education-num': 1.169790230219763, 'marital-status': 1, 'occupation': 13, 'relationship': 5, 'race': 3, 'sex': ' Female', 'capital-gain': 0, 'capital-loss': 0, 'hours-per-week': 35, 'native-country': 'United-States', 'salary': 1, 'education-num_na': 1})

age    35
Name: 0, dtype: int8

其他目标类型

多标签类别

one-hot 编码标签

def _mock_multi_label(df):
    sal,sex,white = [],[],[]
    for row in df.itertuples():
        sal.append(row.salary == '>=50k')
        sex.append(row.sex == ' Male')
        white.append(row.race == ' White')
    df['salary'] = np.array(sal)
    df['male']   = np.array(sex)
    df['white']  = np.array(white)
    return df

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)

df_main.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary	male	white
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	True	False	True
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	True	True	True
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	False	False	False
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	True	True	False
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	False	False	False

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
y_names=["salary", "male", "white"]

CPU times: user 66 ms, sys: 0 ns, total: 66 ms
Wall time: 65.3 ms

dls = to.dataloaders()
dls.valid.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	male	white
0	Private	HS-grad	Divorced	Exec-managerial	Unmarried	White	False	47.000000	164423.000013	9.0	False	False	True
1	Private	Some-college	Married-civ-spouse	Transport-moving	Husband	White	False	74.999999	239037.999499	10.0	False	True	True
2	Private	HS-grad	Married-civ-spouse	Sales	Wife	White	False	45.000000	228570.000761	9.0	False	False	True
3	Self-emp-not-inc	HS-grad	Married-civ-spouse	Exec-managerial	Husband	Asian-Pac-Islander	False	45.000000	285574.998753	9.0	False	True	False
4	Private	Some-college	Never-married	Adm-clerical	Own-child	White	False	21.999999	184812.999966	10.0	False	True	True
5	Private	10th	Married-civ-spouse	Transport-moving	Husband	White	False	67.000001	274450.998865	6.0	False	True	True
6	Private	HS-grad	Divorced	Exec-managerial	Unmarried	White	False	53.999999	192862.000000	9.0	False	False	True
7	Federal-gov	Some-college	Divorced	Tech-support	Unmarried	Amer-Indian-Eskimo	False	37.000000	33486.997455	10.0	False	False	False
8	Private	HS-grad	Never-married	Machine-op-inspct	Other-relative	White	False	30.000000	219318.000010	9.0	False	False	True
9	Self-emp-not-inc	Bachelors	Married-civ-spouse	Sales	Husband	White	False	44.000000	167279.999960	13.0	False	True	True

非 one-hot 编码

def _mock_multi_label(df):
    targ = []
    for row in df.itertuples():
        labels = []
        if row.salary == '>=50k': labels.append('>50k')
        if row.sex == ' Male':   labels.append('male')
        if row.race == ' White': labels.append('white')
        targ.append(' '.join(labels))
    df['target'] = np.array(targ)
    return df

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)

df_main.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary	target
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k	>50k white
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k	>50k male white
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k	>50k male
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

@MultiCategorize
def encodes(self, to:Tabular):
    #to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
    return to

@MultiCategorize
def decodes(self, to:Tabular):
    #to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
    return to

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))

CPU times: user 68.6 ms, sys: 0 ns, total: 68.6 ms
Wall time: 67.9 ms

to.procs[2].vocab

['-', '_', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']

回归

path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)

cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))

CPU times: user 70.7 ms, sys: 290 µs, total: 71 ms
Wall time: 70.3 ms

to.procs[-1].means

{'fnlwgt': 192085.701, 'education-num': 10.059124946594238}

dls = to.dataloaders()
dls.valid.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	fnlwgt	education-num	age
0	Private	12th	Never-married	Adm-clerical	Other-relative	Black	False	503454.004078	8.0	47.0
1	Federal-gov	Bachelors	Married-civ-spouse	Exec-managerial	Husband	White	False	586656.993690	13.0	49.0
2	Self-emp-not-inc	Assoc-voc	Married-civ-spouse	Farming-fishing	Husband	White	False	164607.001243	11.0	29.0
3	Private	HS-grad	Never-married	Adm-clerical	Not-in-family	Black	False	155508.999873	9.0	48.0
4	Private	11th	Never-married	Other-service	Own-child	White	False	318189.998679	7.0	18.0
5	Private	HS-grad	Never-married	Adm-clerical	Other-relative	White	False	140219.001104	9.0	47.0
6	Private	Masters	Divorced	#na#	Unmarried	White	True	235683.001562	10.0	47.0
7	Private	Bachelors	Married-civ-spouse	Craft-repair	Husband	White	False	187321.999825	13.0	43.0
8	Private	Bachelors	Married-civ-spouse	Prof-specialty	Husband	White	False	104196.002410	13.0	40.0
9	Private	Some-college	Separated	Priv-house-serv	Other-relative	White	False	184302.999784	10.0	25.0

目前未使用 - 用于多模态

class TensorTabular(fastuple):
    def get_ctxs(self, max_n=10, **kwargs):
        n_samples = min(self[0].shape[0], max_n)
        df = pd.DataFrame(index = range(n_samples))
        return [df.iloc[i] for i in range(n_samples)]

    def display(self, ctxs): display_df(pd.DataFrame(ctxs))

class TabularLine(pd.Series):
    "A line of a dataframe that knows how to show itself"
    def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)

class ReadTabLine(ItemTransform):
    def __init__(self, proc): self.proc = proc

    def encodes(self, row):
        cats,conts = (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
        return TensorTabular(tensor(cats).long(),tensor(conts).float())

    def decodes(self, o):
        to = TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
        to = self.proc.decode(to)
        return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))

class ReadTabTarget(ItemTransform):
    def __init__(self, proc): self.proc = proc
    def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
    def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])

# tds = TfmdDS(to.items, tfms=[[ReadTabLine(proc)], ReadTabTarget(proc)])
# enc = tds[1]
# test_eq(enc[0][0], tensor([2,1]))
# test_close(enc[0][1], tensor([-0.628828]))
# test_eq(enc[1], 1)

# dec = tds.decode(enc)
# assert isinstance(dec[0], TabularLine)
# test_close(dec[0], pd.Series({'a': 1, 'b_na': False, 'b': 1}))
# test_eq(dec[1], 'a')

# test_stdout(lambda: print(show_at(tds, 1)), """a               1
# b_na        False
# b               1
# category        a
# dtype: object""")