df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
make_date(df, 'date')
test_eq(df['date'].dtype, np.dtype('datetime64[ns]'))表格核心
DataLoaders 中之前,用于预处理表格数据的基本函数。初始预处理
make_date
make_date (df, date_field)
确保 df[date_field] 具有正确的日期类型。
add_datepart
add_datepart (df, field_name, prefix=None, drop=True, time=False)
辅助函数,用于在 df 的 field_name 列中添加与日期相关的列。
例如,如果我们有一系列日期,我们可以生成诸如 Year, Month, Day, Dayofweek, Is_month_start 等特征,如下所示:
df = pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
df = add_datepart(df, 'date')
df.head()| 年 | 月 | 周 | 日 | 周几 | 一年中的第几天 | 是否为月末 | 是否为月初 | 是否为季末 | 是否为季初 | 是否为年末 | 是否为年初 | 已用时间 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2019.0 | 12.0 | 49.0 | 4.0 | 2.0 | 338.0 | False | False | False | False | False | False | 1.575418e+09 | 
| 1 | NaN | NaN | NaN | NaN | NaN | NaN | False | False | False | False | False | False | NaN | 
| 2 | 2019.0 | 11.0 | 46.0 | 15.0 | 4.0 | 319.0 | False | False | False | False | False | False | 1.573776e+09 | 
| 3 | 2019.0 | 10.0 | 43.0 | 24.0 | 3.0 | 297.0 | False | False | False | False | False | False | 1.571875e+09 | 
add_elapsed_times
add_elapsed_times (df, field_names, date_field, base_field)
在 df 中,为 field_names 中的每个事件,根据 date_field 按 base_field 分组添加已用时间。
df = pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
                   'event': [False, True, False, True], 'base': [1,1,2,2]})
df = add_elapsed_times(df, ['event'], 'date', 'base')
df.head()| date | event | base | Afterevent | Beforeevent | event_bw | event_fw | |
|---|---|---|---|---|---|---|---|
| 0 | 2019-12-04 | False | 1 | 5 | 0 | 1.0 | 0.0 | 
| 1 | 2019-11-29 | True | 1 | 0 | 0 | 1.0 | 1.0 | 
| 2 | 2019-11-15 | False | 2 | 22 | 0 | 1.0 | 0.0 | 
| 3 | 2019-10-24 | True | 2 | 0 | 0 | 1.0 | 1.0 | 
cont_cat_split
cont_cat_split (df, max_card=20, dep_var=None)
辅助函数,返回给定 df 中连续变量和分类变量的列名。
此函数通过根据值的基数确定列是连续的还是分类的来工作。如果基数高于 max_card 参数(或列为 float 数据类型),则将其添加到 cont_names 中,否则添加到 cat_names 中。示例如下:
# Example with simple numpy types
df = pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
                   'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
                   'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
                   'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
                   'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
cont_names, cat_names = cont_cat_split(df)cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`# Example with pandas types and generated columns
df = pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
                    'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
                    'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
                    'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
                    'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
                    })
df = add_datepart(df, 'd1_date', drop=False)
df['cat1'] = df['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True)
cont_names, cat_names = cont_cat_split(df, max_card=0)/home/jhoward/miniconda3/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2630: FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
  res = method(*args, **kwargs)cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']df_shrink_dtypes
df_shrink_dtypes (df, skip=[], obj2cat=True, int2uint=False)
返回 DataFrame 列所有可能的较小数据类型。允许将 object 转换为 category,将 int 转换为 uint,以及排除指定列。
例如,我们将创建一个包含 int, float, bool 和 object 数据类型的示例 DataFrame。
df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
                   'date':['2019-12-04','2019-11-29','2019-11-15',]})
df.dtypesi         int64
f       float64
e          bool
date     object
dtype: object然后我们可以调用 df_shrink_dtypes 来找到可以支持数据的最小可能数据类型。
dt = df_shrink_dtypes(df)
dt{'i': dtype('int8'), 'f': dtype('float32'), 'date': 'category'}df_shrink
df_shrink (df, skip=[], obj2cat=True, int2uint=False)
通过转换为由 df_shrink_dtypes() 返回的较小类型来减少 DataFrame 内存使用。
df_shrink(df) 尝试通过将数值列转换为最小数据类型来减少 DataFrame 的内存使用。此外,
- 忽略 boolean,category,datetime64[ns]数据类型列。
- ‘object’ 类型列被分类化 (categorified),这可以在大型数据集中节省大量内存。可以通过设置 obj2cat=False来关闭此功能。
- int2uint=True,用于将- int类型转换为- uint类型,如果列中所有数据都 >= 0。
- 可以使用 excl_cols=['col1','col2']按名称排除列。
若要仅获取新的列数据类型而不实际转换 DataFrame,请使用 df_shrink_dtypes(),并使用与 df_shrink() 相同的所有参数。
df = pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
                  'date':['2019-12-04','2019-11-29','2019-11-15']})
df2 = df_shrink(df, skip=['date'])让我们比较一下两者
df.dtypesi         int64
f       float64
u         int64
date     object
dtype: objectdf2.dtypesi          int8
f       float32
u         int16
date     object
dtype: object我们可以看到数据类型发生了变化,甚至可以进一步查看它们的相对内存使用情况
Initial Dataframe: 224 bytes
Reduced Dataframe: 173 bytes这是使用 ADULT_SAMPLE 数据集的另一个示例
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
new_df = df_shrink(df, int2uint=True)Initial Dataframe: 3.907448 megabytes
Reduced Dataframe: 0.818329 megabytes我们减少了总内存使用量达 79%!
表格
Tabular (df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True)
一个 DataFrame 包装器,它知道哪些列是连续/分类/y 变量,并在 __getitem__ 中返回行。
- df: 您的数据- DataFrame
- cat_names: 您的分类- x变量
- cont_names: 您的连续- x变量
- y_names: 您的因变量- y- 注意:目前不支持混合 y 变量(如回归和分类),但支持多个回归或分类输出。
 
- y_block: 如何对- y_names的类型进行子分类 (- CategoryBlock或- RegressionBlock)
- splits: 如何分割数据
- do_setup: 一个参数,用于确定- Tabular是否在初始化时通过- procs处理数据
- device:- cuda或- cpu
- inplace: 如果为- True,- Tabular将不会在内存中保留原始- DataFrame的单独副本。您应该在设置此参数之前确保- pd.options.mode.chained_assignment为- None。
- reduce_memory:- fastai将尝试使用- df_shrink减少输入的- DataFrame的总内存使用量
TabularPandas
TabularPandas (df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True)
一个带有变换的 Tabular 对象
TabularProc
TabularProc (enc=None, dec=None, split_idx=None, order=None)
用于编写 DataFrame 非延迟表格处理器的基类
这些变换在数据可用时立即应用,而不是在从 DataLoader 调用数据时应用
Categorify
Categorify (enc=None, dec=None, split_idx=None, order=None)
将分类变量转换为类似于 pd.Categorical 的形式
虽然在 DataFrame 中视觉上不会看到变化,但类别存储在 to.procs.categorify 中,如下面的一个虚拟 DataFrame 所示:
df = pd.DataFrame({'a':[0,1,2,0,2]})
to = TabularPandas(df, Categorify, 'a')
to.show()| a | |
|---|---|
| 0 | 0 | 
| 1 | 1 | 
| 2 | 2 | 
| 3 | 0 | 
| 4 | 2 | 
每列的唯一值存储在一个形如 column:[values] 的字典中
cat = to.procs.categorify
cat.classes{'a': ['#na#', 0, 1, 2]}FillStrategy
FillStrategy ()
包含各种填充策略的命名空间。
目前支持使用 median, constant 和 mode 进行填充。
FillMissing
FillMissing (fill_strategy=<function median>, add_col=True, fill_vals=None)
填充连续列中的缺失值。
ReadTabBatch
ReadTabBatch (to)
将 TabularPandas 值转换为具有解码能力的 Tensor
TabDataLoader
TabDataLoader (dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, verbose:bool=False, do_setup:bool=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)
表格数据的转换后的 DataLoader
TabWeightedDL
TabWeightedDL (dataset, bs=16, wgts=None, shuffle=False, after_batch=None, num_workers=0, verbose:bool=False, do_setup:bool=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)
表格加权数据的转换后的 DataLoader
集成示例
有关更深入的解释,请参阅表格教程
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_test.drop('salary', axis=1, inplace=True)
df_main.head()| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k | 
| 1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k | 
| 2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k | 
| 3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k | 
| 4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k | 
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))to = TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits)dls = to.dataloaders()
dls.valid.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Self-emp-not-inc | Prof-school | Divorced | Prof-specialty | Not-in-family | White | False | 65.000000 | 316093.005287 | 15.0 | <50k | 
| 1 | Private | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 69.999999 | 280306.998091 | 13.0 | <50k | 
| 2 | Federal-gov | Some-college | Married-civ-spouse | Adm-clerical | Husband | Black | False | 34.000000 | 199933.999862 | 10.0 | >=50k | 
| 3 | Private | HS-grad | Never-married | Handlers-cleaners | Unmarried | White | False | 24.000001 | 300584.002430 | 9.0 | <50k | 
| 4 | Private | Assoc-voc | Never-married | Other-service | Not-in-family | White | False | 34.000000 | 220630.999335 | 11.0 | <50k | 
| 5 | Private | Bachelors | Divorced | Prof-specialty | Unmarried | White | False | 45.000000 | 289230.003178 | 13.0 | >=50k | 
| 6 | ? | Some-college | Never-married | ? | Own-child | White | False | 26.000000 | 208993.999494 | 10.0 | <50k | 
| 7 | Private | Some-college | Divorced | Adm-clerical | Not-in-family | White | False | 43.000000 | 174574.999446 | 10.0 | <50k | 
| 8 | Self-emp-not-inc | Assoc-voc | Married-civ-spouse | Other-service | Husband | White | False | 63.000000 | 420628.997361 | 11.0 | <50k | 
| 9 | State-gov | Some-college | Married-civ-spouse | Adm-clerical | Husband | Black | False | 25.000000 | 257064.003065 | 10.0 | <50k | 
to.show()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 5516 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 49.0 | 140121.0 | 9.0 | <50k | 
| 7184 | Self-emp-inc | Some-college | Never-married | Exec-managerial | Not-in-family | White | False | 70.0 | 207938.0 | 10.0 | <50k | 
| 2336 | Private | Some-college | Never-married | Priv-house-serv | Own-child | White | False | 23.0 | 50953.0 | 10.0 | <50k | 
| 4342 | Private | Assoc-voc | Married-civ-spouse | Machine-op-inspct | Husband | White | False | 46.0 | 27802.0 | 11.0 | <50k | 
| 8474 | Self-emp-not-inc | Assoc-acdm | Married-civ-spouse | Craft-repair | Husband | White | False | 47.0 | 107231.0 | 12.0 | <50k | 
| 5948 | Local-gov | HS-grad | Married-civ-spouse | Transport-moving | Husband | White | False | 40.0 | 55363.0 | 9.0 | <50k | 
| 5342 | Local-gov | HS-grad | Married-civ-spouse | Craft-repair | Husband | White | False | 46.0 | 36228.0 | 9.0 | <50k | 
| 9005 | Private | Bachelors | Married-civ-spouse | Adm-clerical | Husband | White | False | 38.0 | 297449.0 | 13.0 | >=50k | 
| 1189 | Private | Assoc-voc | Divorced | Sales | Not-in-family | Amer-Indian-Eskimo | False | 31.0 | 87950.0 | 11.0 | <50k | 
| 8784 | Private | Assoc-voc | Divorced | Prof-specialty | Own-child | Black | False | 35.0 | 491000.0 | 11.0 | <50k | 
我们可以通过调用 to.decode_row 并传入原始数据来解码任何一组转换后的数据
row = to.items.iloc[0]
to.decode_row(row)age                             49.0
workclass                    Private
fnlwgt                      140121.0
education                    HS-grad
education-num                    9.0
marital-status              Divorced
occupation           Exec-managerial
relationship               Unmarried
race                           White
sex                             Male
capital-gain                       0
capital-loss                       0
hours-per-week                    50
native-country         United-States
salary                          <50k
education-num_na               False
Name: 5516, dtype: object我们可以使用 to.new() 基于训练数据创建新的测试数据集
由于机器学习模型无法神奇地理解从未训练过的类别,因此数据应反映这一点。如果测试数据中存在不同的缺失值,您应该在训练之前解决这个问题。
to_tst = to.new(df_test)
to_tst.process()
to_tst.items.head()| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | education-num_na | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10000 | 0.465031 | 5 | 1.319553 | 10 | 1.176677 | 3 | 2 | 1 | 2 | Male | 0 | 0 | 40 | Philippines | 1 | 
| 10001 | -0.926675 | 5 | 1.233650 | 12 | -0.420035 | 3 | 15 | 1 | 4 | Male | 0 | 0 | 40 | United-States | 1 | 
| 10002 | 1.051012 | 5 | 0.145161 | 2 | -1.218391 | 1 | 9 | 2 | 5 | Female | 0 | 0 | 37 | United-States | 1 | 
| 10003 | 0.538279 | 5 | -0.282370 | 12 | -0.420035 | 7 | 2 | 5 | 5 | Female | 0 | 0 | 43 | United-States | 1 | 
| 10004 | 0.758022 | 6 | 1.420768 | 9 | 0.378321 | 3 | 5 | 1 | 5 | Male | 0 | 0 | 60 | United-States | 1 | 
然后我们可以将其转换为 DataLoader
tst_dl = dls.valid.new(to_tst)
tst_dl.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Private | Bachelors | Married-civ-spouse | Adm-clerical | Husband | Asian-Pac-Islander | False | 45.000000 | 338105.005817 | 13.0 | 
| 1 | Private | HS-grad | Married-civ-spouse | Transport-moving | Husband | Other | False | 26.000000 | 328663.002806 | 9.0 | 
| 2 | Private | 11th | Divorced | Other-service | Not-in-family | White | False | 52.999999 | 209022.000317 | 7.0 | 
| 3 | Private | HS-grad | Widowed | Adm-clerical | Unmarried | White | False | 46.000000 | 162029.998917 | 9.0 | 
| 4 | Self-emp-inc | Assoc-voc | Married-civ-spouse | Exec-managerial | Husband | White | False | 49.000000 | 349230.006300 | 11.0 | 
| 5 | Local-gov | Some-college | Married-civ-spouse | Exec-managerial | Husband | White | False | 34.000000 | 124827.002059 | 10.0 | 
| 6 | Self-emp-inc | Some-college | Married-civ-spouse | Sales | Husband | White | False | 52.999999 | 290640.002462 | 10.0 | 
| 7 | Private | Some-college | Never-married | Sales | Own-child | White | False | 19.000000 | 106272.998239 | 10.0 | 
| 8 | Private | Some-college | Married-civ-spouse | Protective-serv | Husband | Black | False | 71.999999 | 53684.001668 | 10.0 | 
| 9 | Private | Some-college | Never-married | Sales | Own-child | White | False | 20.000000 | 505980.010609 | 10.0 | 
# Create a TabWeightedDL
train_ds = to.train
weights = np.random.random(len(train_ds))
train_dl = TabWeightedDL(train_ds, wgts=weights, bs=64, shuffle=True)
train_dl.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Local-gov | Masters | Never-married | Prof-specialty | Not-in-family | White | False | 31.000000 | 204469.999932 | 14.0 | <50k | 
| 1 | Self-emp-not-inc | HS-grad | Divorced | Farming-fishing | Not-in-family | White | False | 32.000000 | 34572.002104 | 9.0 | <50k | 
| 2 | ? | Some-college | Widowed | ? | Not-in-family | White | False | 64.000000 | 34099.998990 | 10.0 | <50k | 
| 3 | Private | Some-college | Divorced | Exec-managerial | Not-in-family | White | False | 32.000000 | 251242.999189 | 10.0 | >=50k | 
| 4 | Federal-gov | HS-grad | Married-civ-spouse | Exec-managerial | Husband | White | False | 55.000001 | 176903.999313 | 9.0 | <50k | 
| 5 | Private | 11th | Married-civ-spouse | Transport-moving | Husband | White | False | 50.000000 | 192203.000000 | 7.0 | <50k | 
| 6 | Private | 10th | Never-married | Farming-fishing | Own-child | Black | False | 36.000000 | 181720.999704 | 6.0 | <50k | 
| 7 | Local-gov | Masters | Divorced | Prof-specialty | Not-in-family | Amer-Indian-Eskimo | False | 50.000000 | 220640.001490 | 14.0 | >=50k | 
| 8 | Private | HS-grad | Married-civ-spouse | Adm-clerical | Wife | White | False | 36.000000 | 189381.999993 | 9.0 | >=50k | 
| 9 | Private | Masters | Divorced | Prof-specialty | Unmarried | White | False | 42.000000 | 265697.997341 | 14.0 | <50k | 
TabDataLoader 的 create_item 方法
df = pd.DataFrame([{'age': 35}])
to = TabularPandas(df)
dls = to.dataloaders()
print(dls.create_item(0))
# test_eq(dls.create_item(0).items.to_dict(), {'age': 0.5330614747286777, 'workclass': 5, 'fnlwgt': -0.26305443080666174, 'education': 10, 'education-num': 1.169790230219763, 'marital-status': 1, 'occupation': 13, 'relationship': 5, 'race': 3, 'sex': ' Female', 'capital-gain': 0, 'capital-loss': 0, 'hours-per-week': 35, 'native-country': 'United-States', 'salary': 1, 'education-num_na': 1})age    35
Name: 0, dtype: int8其他目标类型
多标签类别
one-hot 编码标签
def _mock_multi_label(df):
    sal,sex,white = [],[],[]
    for row in df.itertuples():
        sal.append(row.salary == '>=50k')
        sex.append(row.sex == ' Male')
        white.append(row.race == ' White')
    df['salary'] = np.array(sal)
    df['male']   = np.array(sex)
    df['white']  = np.array(white)
    return dfpath = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)df_main.head()| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | male | white | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | True | False | True | 
| 1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | True | True | True | 
| 2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | False | False | False | 
| 3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | True | True | False | 
| 4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | False | False | False | 
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))
y_names=["salary", "male", "white"]CPU times: user 66 ms, sys: 0 ns, total: 66 ms
Wall time: 65.3 msdls = to.dataloaders()
dls.valid.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | male | white | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 47.000000 | 164423.000013 | 9.0 | False | False | True | 
| 1 | Private | Some-college | Married-civ-spouse | Transport-moving | Husband | White | False | 74.999999 | 239037.999499 | 10.0 | False | True | True | 
| 2 | Private | HS-grad | Married-civ-spouse | Sales | Wife | White | False | 45.000000 | 228570.000761 | 9.0 | False | False | True | 
| 3 | Self-emp-not-inc | HS-grad | Married-civ-spouse | Exec-managerial | Husband | Asian-Pac-Islander | False | 45.000000 | 285574.998753 | 9.0 | False | True | False | 
| 4 | Private | Some-college | Never-married | Adm-clerical | Own-child | White | False | 21.999999 | 184812.999966 | 10.0 | False | True | True | 
| 5 | Private | 10th | Married-civ-spouse | Transport-moving | Husband | White | False | 67.000001 | 274450.998865 | 6.0 | False | True | True | 
| 6 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 53.999999 | 192862.000000 | 9.0 | False | False | True | 
| 7 | Federal-gov | Some-college | Divorced | Tech-support | Unmarried | Amer-Indian-Eskimo | False | 37.000000 | 33486.997455 | 10.0 | False | False | False | 
| 8 | Private | HS-grad | Never-married | Machine-op-inspct | Other-relative | White | False | 30.000000 | 219318.000010 | 9.0 | False | False | True | 
| 9 | Self-emp-not-inc | Bachelors | Married-civ-spouse | Sales | Husband | White | False | 44.000000 | 167279.999960 | 13.0 | False | True | True | 
非 one-hot 编码
def _mock_multi_label(df):
    targ = []
    for row in df.itertuples():
        labels = []
        if row.salary == '>=50k': labels.append('>50k')
        if row.sex == ' Male':   labels.append('male')
        if row.race == ' White': labels.append('white')
        targ.append(' '.join(labels))
    df['target'] = np.array(targ)
    return dfpath = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)df_main.head()| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k | >50k white | 
| 1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k | >50k male white | 
| 2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k | |
| 3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k | >50k male | 
| 4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k | 
@MultiCategorize
def encodes(self, to:Tabular):
    #to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
    return to
@MultiCategorize
def decodes(self, to:Tabular):
    #to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
    return tocat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['age', 'fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))CPU times: user 68.6 ms, sys: 0 ns, total: 68.6 ms
Wall time: 67.9 msto.procs[2].vocab['-', '_', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']回归
path = untar_data(URLs.ADULT_SAMPLE)
df = pd.read_csv(path/'adult.csv')
df_main,df_test = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main = _mock_multi_label(df_main)cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cont_names = ['fnlwgt', 'education-num']
procs = [Categorify, FillMissing, Normalize]
splits = RandomSplitter()(range_of(df_main))CPU times: user 70.7 ms, sys: 290 µs, total: 71 ms
Wall time: 70.3 msto.procs[-1].means{'fnlwgt': 192085.701, 'education-num': 10.059124946594238}dls = to.dataloaders()
dls.valid.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | fnlwgt | education-num | age | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Private | 12th | Never-married | Adm-clerical | Other-relative | Black | False | 503454.004078 | 8.0 | 47.0 | 
| 1 | Federal-gov | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 586656.993690 | 13.0 | 49.0 | 
| 2 | Self-emp-not-inc | Assoc-voc | Married-civ-spouse | Farming-fishing | Husband | White | False | 164607.001243 | 11.0 | 29.0 | 
| 3 | Private | HS-grad | Never-married | Adm-clerical | Not-in-family | Black | False | 155508.999873 | 9.0 | 48.0 | 
| 4 | Private | 11th | Never-married | Other-service | Own-child | White | False | 318189.998679 | 7.0 | 18.0 | 
| 5 | Private | HS-grad | Never-married | Adm-clerical | Other-relative | White | False | 140219.001104 | 9.0 | 47.0 | 
| 6 | Private | Masters | Divorced | #na# | Unmarried | White | True | 235683.001562 | 10.0 | 47.0 | 
| 7 | Private | Bachelors | Married-civ-spouse | Craft-repair | Husband | White | False | 187321.999825 | 13.0 | 43.0 | 
| 8 | Private | Bachelors | Married-civ-spouse | Prof-specialty | Husband | White | False | 104196.002410 | 13.0 | 40.0 | 
| 9 | Private | Some-college | Separated | Priv-house-serv | Other-relative | White | False | 184302.999784 | 10.0 | 25.0 | 
目前未使用 - 用于多模态
class TensorTabular(fastuple):
    def get_ctxs(self, max_n=10, **kwargs):
        n_samples = min(self[0].shape[0], max_n)
        df = pd.DataFrame(index = range(n_samples))
        return [df.iloc[i] for i in range(n_samples)]
    def display(self, ctxs): display_df(pd.DataFrame(ctxs))
class TabularLine(pd.Series):
    "A line of a dataframe that knows how to show itself"
    def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)
class ReadTabLine(ItemTransform):
    def __init__(self, proc): self.proc = proc
    def encodes(self, row):
        cats,conts = (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
        return TensorTabular(tensor(cats).long(),tensor(conts).float())
    def decodes(self, o):
        to = TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
        to = self.proc.decode(to)
        return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))
class ReadTabTarget(ItemTransform):
    def __init__(self, proc): self.proc = proc
    def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
    def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])# tds = TfmdDS(to.items, tfms=[[ReadTabLine(proc)], ReadTabTarget(proc)])
# enc = tds[1]
# test_eq(enc[0][0], tensor([2,1]))
# test_close(enc[0][1], tensor([-0.628828]))
# test_eq(enc[1], 1)
# dec = tds.decode(enc)
# assert isinstance(dec[0], TabularLine)
# test_close(dec[0], pd.Series({'a': 1, 'b_na': False, 'b': 1}))
# test_eq(dec[1], 'a')
# test_stdout(lambda: print(show_at(tds, 1)), """a               1
# b_na        False
# b               1
# category        a
# dtype: object""")