= pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24']})
df 'date')
make_date(df, 'date'].dtype, np.dtype('datetime64[ns]')) test_eq(df[
表格核心
DataLoaders
中之前,用于预处理表格数据的基本函数。初始预处理
make_date
make_date (df, date_field)
确保 df[date_field]
具有正确的日期类型。
add_datepart
add_datepart (df, field_name, prefix=None, drop=True, time=False)
辅助函数,用于在 df
的 field_name
列中添加与日期相关的列。
例如,如果我们有一系列日期,我们可以生成诸如 Year
, Month
, Day
, Dayofweek
, Is_month_start
等特征,如下所示:
= pd.DataFrame({'date': ['2019-12-04', None, '2019-11-15', '2019-10-24']})
df = add_datepart(df, 'date')
df df.head()
年 | 月 | 周 | 日 | 周几 | 一年中的第几天 | 是否为月末 | 是否为月初 | 是否为季末 | 是否为季初 | 是否为年末 | 是否为年初 | 已用时间 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019.0 | 12.0 | 49.0 | 4.0 | 2.0 | 338.0 | False | False | False | False | False | False | 1.575418e+09 |
1 | NaN | NaN | NaN | NaN | NaN | NaN | False | False | False | False | False | False | NaN |
2 | 2019.0 | 11.0 | 46.0 | 15.0 | 4.0 | 319.0 | False | False | False | False | False | False | 1.573776e+09 |
3 | 2019.0 | 10.0 | 43.0 | 24.0 | 3.0 | 297.0 | False | False | False | False | False | False | 1.571875e+09 |
add_elapsed_times
add_elapsed_times (df, field_names, date_field, base_field)
在 df
中,为 field_names
中的每个事件,根据 date_field
按 base_field
分组添加已用时间。
= pd.DataFrame({'date': ['2019-12-04', '2019-11-29', '2019-11-15', '2019-10-24'],
df 'event': [False, True, False, True], 'base': [1,1,2,2]})
= add_elapsed_times(df, ['event'], 'date', 'base')
df df.head()
date | event | base | Afterevent | Beforeevent | event_bw | event_fw | |
---|---|---|---|---|---|---|---|
0 | 2019-12-04 | False | 1 | 5 | 0 | 1.0 | 0.0 |
1 | 2019-11-29 | True | 1 | 0 | 0 | 1.0 | 1.0 |
2 | 2019-11-15 | False | 2 | 22 | 0 | 1.0 | 0.0 |
3 | 2019-10-24 | True | 2 | 0 | 0 | 1.0 | 1.0 |
cont_cat_split
cont_cat_split (df, max_card=20, dep_var=None)
辅助函数,返回给定 df
中连续变量和分类变量的列名。
此函数通过根据值的基数确定列是连续的还是分类的来工作。如果基数高于 max_card
参数(或列为 float
数据类型),则将其添加到 cont_names
中,否则添加到 cat_names
中。示例如下:
# Example with simple numpy types
= pd.DataFrame({'cat1': [1, 2, 3, 4], 'cont1': [1., 2., 3., 2.], 'cat2': ['a', 'b', 'b', 'a'],
df 'i8': pd.Series([1, 2, 3, 4], dtype='int8'),
'u8': pd.Series([1, 2, 3, 4], dtype='uint8'),
'f16': pd.Series([1, 2, 3, 4], dtype='float16'),
'y1': [1, 0, 1, 0], 'y2': [2, 1, 1, 0]})
= cont_cat_split(df) cont_names, cat_names
cont_names: ['cont1', 'f16']
cat_names: ['cat1', 'cat2', 'i8', 'u8', 'y1', 'y2']`
# Example with pandas types and generated columns
= pd.DataFrame({'cat1': pd.Series(['l','xs','xl','s'], dtype='category'),
df 'ui32': pd.Series([1, 2, 3, 4], dtype='UInt32'),
'i64': pd.Series([1, 2, 3, 4], dtype='Int64'),
'f16': pd.Series([1, 2, 3, 4], dtype='Float64'),
'd1_date': ['2021-02-09', None, '2020-05-12', '2020-08-14'],
})= add_datepart(df, 'd1_date', drop=False)
df 'cat1'] = df['cat1'].cat.set_categories(['xl','l','m','s','xs'], ordered=True)
df[= cont_cat_split(df, max_card=0) cont_names, cat_names
/home/jhoward/miniconda3/lib/python3.8/site-packages/pandas/core/arrays/categorical.py:2630: FutureWarning: The `inplace` parameter in pandas.Categorical.set_categories is deprecated and will be removed in a future version. Removing unused categories will always return a new Categorical object.
res = method(*args, **kwargs)
cont_names: ['ui32', 'i64', 'f16', 'd1_Year', 'd1_Month', 'd1_Week', 'd1_Day', 'd1_Dayofweek', 'd1_Dayofyear', 'd1_Elapsed']
cat_names: ['cat1', 'd1_date', 'd1_Is_month_end', 'd1_Is_month_start', 'd1_Is_quarter_end', 'd1_Is_quarter_start', 'd1_Is_year_end', 'd1_Is_year_start']
df_shrink_dtypes
df_shrink_dtypes (df, skip=[], obj2cat=True, int2uint=False)
返回 DataFrame 列所有可能的较小数据类型。允许将 object
转换为 category
,将 int
转换为 uint
,以及排除指定列。
例如,我们将创建一个包含 int
, float
, bool
和 object
数据类型的示例 DataFrame
。
= pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'e': [True, False, True],
df 'date':['2019-12-04','2019-11-29','2019-11-15',]})
df.dtypes
i int64
f float64
e bool
date object
dtype: object
然后我们可以调用 df_shrink_dtypes
来找到可以支持数据的最小可能数据类型。
= df_shrink_dtypes(df)
dt dt
{'i': dtype('int8'), 'f': dtype('float32'), 'date': 'category'}
df_shrink
df_shrink (df, skip=[], obj2cat=True, int2uint=False)
通过转换为由 df_shrink_dtypes()
返回的较小类型来减少 DataFrame 内存使用。
df_shrink(df)
尝试通过将数值列转换为最小数据类型来减少 DataFrame 的内存使用。此外,
- 忽略
boolean
,category
,datetime64[ns]
数据类型列。 - ‘object’ 类型列被分类化 (categorified),这可以在大型数据集中节省大量内存。可以通过设置
obj2cat=False
来关闭此功能。 int2uint=True
,用于将int
类型转换为uint
类型,如果列中所有数据都 >= 0。- 可以使用
excl_cols=['col1','col2']
按名称排除列。
若要仅获取新的列数据类型而不实际转换 DataFrame,请使用 df_shrink_dtypes()
,并使用与 df_shrink()
相同的所有参数。
= pd.DataFrame({'i': [-100, 0, 100], 'f': [-100.0, 0.0, 100.0], 'u':[0, 10,254],
df 'date':['2019-12-04','2019-11-29','2019-11-15']})
= df_shrink(df, skip=['date']) df2
让我们比较一下两者
df.dtypes
i int64
f float64
u int64
date object
dtype: object
df2.dtypes
i int8
f float32
u int16
date object
dtype: object
我们可以看到数据类型发生了变化,甚至可以进一步查看它们的相对内存使用情况
Initial Dataframe: 224 bytes
Reduced Dataframe: 173 bytes
这是使用 ADULT_SAMPLE
数据集的另一个示例
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df_shrink(df, int2uint=True) new_df
Initial Dataframe: 3.907448 megabytes
Reduced Dataframe: 0.818329 megabytes
我们减少了总内存使用量达 79%!
表格
Tabular (df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True)
一个 DataFrame
包装器,它知道哪些列是连续/分类/y 变量,并在 __getitem__
中返回行。
df
: 您的数据DataFrame
cat_names
: 您的分类x
变量cont_names
: 您的连续x
变量y_names
: 您的因变量y
- 注意:目前不支持混合 y 变量(如回归和分类),但支持多个回归或分类输出。
y_block
: 如何对y_names
的类型进行子分类 (CategoryBlock
或RegressionBlock
)splits
: 如何分割数据do_setup
: 一个参数,用于确定Tabular
是否在初始化时通过procs
处理数据device
:cuda
或cpu
inplace
: 如果为True
,Tabular
将不会在内存中保留原始DataFrame
的单独副本。您应该在设置此参数之前确保pd.options.mode.chained_assignment
为None
。reduce_memory
:fastai
将尝试使用df_shrink
减少输入的DataFrame
的总内存使用量
TabularPandas
TabularPandas (df, procs=None, cat_names=None, cont_names=None, y_names=None, y_block=None, splits=None, do_setup=True, device=None, inplace=False, reduce_memory=True)
一个带有变换的 Tabular
对象
TabularProc
TabularProc (enc=None, dec=None, split_idx=None, order=None)
用于编写 DataFrame 非延迟表格处理器的基类
这些变换在数据可用时立即应用,而不是在从 DataLoader
调用数据时应用
Categorify
Categorify (enc=None, dec=None, split_idx=None, order=None)
将分类变量转换为类似于 pd.Categorical
的形式
虽然在 DataFrame
中视觉上不会看到变化,但类别存储在 to.procs.categorify
中,如下面的一个虚拟 DataFrame
所示:
= pd.DataFrame({'a':[0,1,2,0,2]})
df = TabularPandas(df, Categorify, 'a')
to to.show()
a | |
---|---|
0 | 0 |
1 | 1 |
2 | 2 |
3 | 0 |
4 | 2 |
每列的唯一值存储在一个形如 column:[values]
的字典中
= to.procs.categorify
cat cat.classes
{'a': ['#na#', 0, 1, 2]}
FillStrategy
FillStrategy ()
包含各种填充策略的命名空间。
目前支持使用 median
, constant
和 mode
进行填充。
FillMissing
FillMissing (fill_strategy=<function median>, add_col=True, fill_vals=None)
填充连续列中的缺失值。
ReadTabBatch
ReadTabBatch (to)
将 TabularPandas
值转换为具有解码能力的 Tensor
TabDataLoader
TabDataLoader (dataset, bs=16, shuffle=False, after_batch=None, num_workers=0, verbose:bool=False, do_setup:bool=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)
表格数据的转换后的 DataLoader
TabWeightedDL
TabWeightedDL (dataset, bs=16, wgts=None, shuffle=False, after_batch=None, num_workers=0, verbose:bool=False, do_setup:bool=True, pin_memory=False, timeout=0, batch_size=None, drop_last=False, indexed=None, n=None, device=None, persistent_workers=False, pin_memory_device='', wif=None, before_iter=None, after_item=None, before_batch=None, after_iter=None, create_batches=None, create_item=None, create_batch=None, retain=None, get_idxs=None, sample=None, shuffle_fn=None, do_batch=None)
表格加权数据的转换后的 DataLoader
集成示例
有关更深入的解释,请参阅表格教程
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main,df_test 'salary', axis=1, inplace=True)
df_test.drop( df_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['age', 'fnlwgt', 'education-num']
cont_names = [Categorify, FillMissing, Normalize]
procs = RandomSplitter()(range_of(df_main)) splits
= TabularPandas(df_main, procs, cat_names, cont_names, y_names="salary", splits=splits) to
= to.dataloaders()
dls dls.valid.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Self-emp-not-inc | Prof-school | Divorced | Prof-specialty | Not-in-family | White | False | 65.000000 | 316093.005287 | 15.0 | <50k |
1 | Private | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 69.999999 | 280306.998091 | 13.0 | <50k |
2 | Federal-gov | Some-college | Married-civ-spouse | Adm-clerical | Husband | Black | False | 34.000000 | 199933.999862 | 10.0 | >=50k |
3 | Private | HS-grad | Never-married | Handlers-cleaners | Unmarried | White | False | 24.000001 | 300584.002430 | 9.0 | <50k |
4 | Private | Assoc-voc | Never-married | Other-service | Not-in-family | White | False | 34.000000 | 220630.999335 | 11.0 | <50k |
5 | Private | Bachelors | Divorced | Prof-specialty | Unmarried | White | False | 45.000000 | 289230.003178 | 13.0 | >=50k |
6 | ? | Some-college | Never-married | ? | Own-child | White | False | 26.000000 | 208993.999494 | 10.0 | <50k |
7 | Private | Some-college | Divorced | Adm-clerical | Not-in-family | White | False | 43.000000 | 174574.999446 | 10.0 | <50k |
8 | Self-emp-not-inc | Assoc-voc | Married-civ-spouse | Other-service | Husband | White | False | 63.000000 | 420628.997361 | 11.0 | <50k |
9 | State-gov | Some-college | Married-civ-spouse | Adm-clerical | Husband | Black | False | 25.000000 | 257064.003065 | 10.0 | <50k |
to.show()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
5516 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 49.0 | 140121.0 | 9.0 | <50k |
7184 | Self-emp-inc | Some-college | Never-married | Exec-managerial | Not-in-family | White | False | 70.0 | 207938.0 | 10.0 | <50k |
2336 | Private | Some-college | Never-married | Priv-house-serv | Own-child | White | False | 23.0 | 50953.0 | 10.0 | <50k |
4342 | Private | Assoc-voc | Married-civ-spouse | Machine-op-inspct | Husband | White | False | 46.0 | 27802.0 | 11.0 | <50k |
8474 | Self-emp-not-inc | Assoc-acdm | Married-civ-spouse | Craft-repair | Husband | White | False | 47.0 | 107231.0 | 12.0 | <50k |
5948 | Local-gov | HS-grad | Married-civ-spouse | Transport-moving | Husband | White | False | 40.0 | 55363.0 | 9.0 | <50k |
5342 | Local-gov | HS-grad | Married-civ-spouse | Craft-repair | Husband | White | False | 46.0 | 36228.0 | 9.0 | <50k |
9005 | Private | Bachelors | Married-civ-spouse | Adm-clerical | Husband | White | False | 38.0 | 297449.0 | 13.0 | >=50k |
1189 | Private | Assoc-voc | Divorced | Sales | Not-in-family | Amer-Indian-Eskimo | False | 31.0 | 87950.0 | 11.0 | <50k |
8784 | Private | Assoc-voc | Divorced | Prof-specialty | Own-child | Black | False | 35.0 | 491000.0 | 11.0 | <50k |
我们可以通过调用 to.decode_row
并传入原始数据来解码任何一组转换后的数据
= to.items.iloc[0]
row to.decode_row(row)
age 49.0
workclass Private
fnlwgt 140121.0
education HS-grad
education-num 9.0
marital-status Divorced
occupation Exec-managerial
relationship Unmarried
race White
sex Male
capital-gain 0
capital-loss 0
hours-per-week 50
native-country United-States
salary <50k
education-num_na False
Name: 5516, dtype: object
我们可以使用 to.new()
基于训练数据创建新的测试数据集
由于机器学习模型无法神奇地理解从未训练过的类别,因此数据应反映这一点。如果测试数据中存在不同的缺失值,您应该在训练之前解决这个问题。
= to.new(df_test)
to_tst
to_tst.process() to_tst.items.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | education-num_na | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10000 | 0.465031 | 5 | 1.319553 | 10 | 1.176677 | 3 | 2 | 1 | 2 | Male | 0 | 0 | 40 | Philippines | 1 |
10001 | -0.926675 | 5 | 1.233650 | 12 | -0.420035 | 3 | 15 | 1 | 4 | Male | 0 | 0 | 40 | United-States | 1 |
10002 | 1.051012 | 5 | 0.145161 | 2 | -1.218391 | 1 | 9 | 2 | 5 | Female | 0 | 0 | 37 | United-States | 1 |
10003 | 0.538279 | 5 | -0.282370 | 12 | -0.420035 | 7 | 2 | 5 | 5 | Female | 0 | 0 | 43 | United-States | 1 |
10004 | 0.758022 | 6 | 1.420768 | 9 | 0.378321 | 3 | 5 | 1 | 5 | Male | 0 | 0 | 60 | United-States | 1 |
然后我们可以将其转换为 DataLoader
= dls.valid.new(to_tst)
tst_dl tst_dl.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Private | Bachelors | Married-civ-spouse | Adm-clerical | Husband | Asian-Pac-Islander | False | 45.000000 | 338105.005817 | 13.0 |
1 | Private | HS-grad | Married-civ-spouse | Transport-moving | Husband | Other | False | 26.000000 | 328663.002806 | 9.0 |
2 | Private | 11th | Divorced | Other-service | Not-in-family | White | False | 52.999999 | 209022.000317 | 7.0 |
3 | Private | HS-grad | Widowed | Adm-clerical | Unmarried | White | False | 46.000000 | 162029.998917 | 9.0 |
4 | Self-emp-inc | Assoc-voc | Married-civ-spouse | Exec-managerial | Husband | White | False | 49.000000 | 349230.006300 | 11.0 |
5 | Local-gov | Some-college | Married-civ-spouse | Exec-managerial | Husband | White | False | 34.000000 | 124827.002059 | 10.0 |
6 | Self-emp-inc | Some-college | Married-civ-spouse | Sales | Husband | White | False | 52.999999 | 290640.002462 | 10.0 |
7 | Private | Some-college | Never-married | Sales | Own-child | White | False | 19.000000 | 106272.998239 | 10.0 |
8 | Private | Some-college | Married-civ-spouse | Protective-serv | Husband | Black | False | 71.999999 | 53684.001668 | 10.0 |
9 | Private | Some-college | Never-married | Sales | Own-child | White | False | 20.000000 | 505980.010609 | 10.0 |
# Create a TabWeightedDL
= to.train
train_ds = np.random.random(len(train_ds))
weights = TabWeightedDL(train_ds, wgts=weights, bs=64, shuffle=True)
train_dl
train_dl.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Local-gov | Masters | Never-married | Prof-specialty | Not-in-family | White | False | 31.000000 | 204469.999932 | 14.0 | <50k |
1 | Self-emp-not-inc | HS-grad | Divorced | Farming-fishing | Not-in-family | White | False | 32.000000 | 34572.002104 | 9.0 | <50k |
2 | ? | Some-college | Widowed | ? | Not-in-family | White | False | 64.000000 | 34099.998990 | 10.0 | <50k |
3 | Private | Some-college | Divorced | Exec-managerial | Not-in-family | White | False | 32.000000 | 251242.999189 | 10.0 | >=50k |
4 | Federal-gov | HS-grad | Married-civ-spouse | Exec-managerial | Husband | White | False | 55.000001 | 176903.999313 | 9.0 | <50k |
5 | Private | 11th | Married-civ-spouse | Transport-moving | Husband | White | False | 50.000000 | 192203.000000 | 7.0 | <50k |
6 | Private | 10th | Never-married | Farming-fishing | Own-child | Black | False | 36.000000 | 181720.999704 | 6.0 | <50k |
7 | Local-gov | Masters | Divorced | Prof-specialty | Not-in-family | Amer-Indian-Eskimo | False | 50.000000 | 220640.001490 | 14.0 | >=50k |
8 | Private | HS-grad | Married-civ-spouse | Adm-clerical | Wife | White | False | 36.000000 | 189381.999993 | 9.0 | >=50k |
9 | Private | Masters | Divorced | Prof-specialty | Unmarried | White | False | 42.000000 | 265697.997341 | 14.0 | <50k |
TabDataLoader 的 create_item 方法
= pd.DataFrame([{'age': 35}])
df = TabularPandas(df)
to = to.dataloaders()
dls print(dls.create_item(0))
# test_eq(dls.create_item(0).items.to_dict(), {'age': 0.5330614747286777, 'workclass': 5, 'fnlwgt': -0.26305443080666174, 'education': 10, 'education-num': 1.169790230219763, 'marital-status': 1, 'occupation': 13, 'relationship': 5, 'race': 3, 'sex': ' Female', 'capital-gain': 0, 'capital-loss': 0, 'hours-per-week': 35, 'native-country': 'United-States', 'salary': 1, 'education-num_na': 1})
age 35
Name: 0, dtype: int8
其他目标类型
多标签类别
one-hot 编码标签
def _mock_multi_label(df):
= [],[],[]
sal,sex,white for row in df.itertuples():
== '>=50k')
sal.append(row.salary == ' Male')
sex.append(row.sex == ' White')
white.append(row.race 'salary'] = np.array(sal)
df['male'] = np.array(sex)
df['white'] = np.array(white)
df[return df
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main,df_test = _mock_multi_label(df_main) df_main
df_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | male | white | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | True | False | True |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | True | True | True |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | False | False | False |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | True | True | False |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | False | False | False |
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['age', 'fnlwgt', 'education-num']
cont_names = [Categorify, FillMissing, Normalize]
procs = RandomSplitter()(range_of(df_main))
splits =["salary", "male", "white"] y_names
CPU times: user 66 ms, sys: 0 ns, total: 66 ms
Wall time: 65.3 ms
= to.dataloaders()
dls dls.valid.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | male | white | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 47.000000 | 164423.000013 | 9.0 | False | False | True |
1 | Private | Some-college | Married-civ-spouse | Transport-moving | Husband | White | False | 74.999999 | 239037.999499 | 10.0 | False | True | True |
2 | Private | HS-grad | Married-civ-spouse | Sales | Wife | White | False | 45.000000 | 228570.000761 | 9.0 | False | False | True |
3 | Self-emp-not-inc | HS-grad | Married-civ-spouse | Exec-managerial | Husband | Asian-Pac-Islander | False | 45.000000 | 285574.998753 | 9.0 | False | True | False |
4 | Private | Some-college | Never-married | Adm-clerical | Own-child | White | False | 21.999999 | 184812.999966 | 10.0 | False | True | True |
5 | Private | 10th | Married-civ-spouse | Transport-moving | Husband | White | False | 67.000001 | 274450.998865 | 6.0 | False | True | True |
6 | Private | HS-grad | Divorced | Exec-managerial | Unmarried | White | False | 53.999999 | 192862.000000 | 9.0 | False | False | True |
7 | Federal-gov | Some-college | Divorced | Tech-support | Unmarried | Amer-Indian-Eskimo | False | 37.000000 | 33486.997455 | 10.0 | False | False | False |
8 | Private | HS-grad | Never-married | Machine-op-inspct | Other-relative | White | False | 30.000000 | 219318.000010 | 9.0 | False | False | True |
9 | Self-emp-not-inc | Bachelors | Married-civ-spouse | Sales | Husband | White | False | 44.000000 | 167279.999960 | 13.0 | False | True | True |
非 one-hot 编码
def _mock_multi_label(df):
= []
targ for row in df.itertuples():
= []
labels if row.salary == '>=50k': labels.append('>50k')
if row.sex == ' Male': labels.append('male')
if row.race == ' White': labels.append('white')
' '.join(labels))
targ.append('target'] = np.array(targ)
df[return df
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main,df_test = _mock_multi_label(df_main) df_main
df_main.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k | >50k white |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k | >50k male white |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k | |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k | >50k male |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
@MultiCategorize
def encodes(self, to:Tabular):
#to.transform(to.y_names, partial(_apply_cats, {n: self.vocab for n in to.y_names}, 0))
return to
@MultiCategorize
def decodes(self, to:Tabular):
#to.transform(to.y_names, partial(_decode_cats, {n: self.vocab for n in to.y_names}))
return to
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['age', 'fnlwgt', 'education-num']
cont_names = [Categorify, FillMissing, Normalize]
procs = RandomSplitter()(range_of(df_main)) splits
CPU times: user 68.6 ms, sys: 0 ns, total: 68.6 ms
Wall time: 67.9 ms
2].vocab to.procs[
['-', '_', 'a', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y']
回归
= untar_data(URLs.ADULT_SAMPLE)
path = pd.read_csv(path/'adult.csv')
df = df.iloc[:10000].copy(),df.iloc[10000:].copy()
df_main,df_test = _mock_multi_label(df_main) df_main
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race']
cat_names = ['fnlwgt', 'education-num']
cont_names = [Categorify, FillMissing, Normalize]
procs = RandomSplitter()(range_of(df_main)) splits
CPU times: user 70.7 ms, sys: 290 µs, total: 71 ms
Wall time: 70.3 ms
-1].means to.procs[
{'fnlwgt': 192085.701, 'education-num': 10.059124946594238}
= to.dataloaders()
dls dls.valid.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | fnlwgt | education-num | age | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Private | 12th | Never-married | Adm-clerical | Other-relative | Black | False | 503454.004078 | 8.0 | 47.0 |
1 | Federal-gov | Bachelors | Married-civ-spouse | Exec-managerial | Husband | White | False | 586656.993690 | 13.0 | 49.0 |
2 | Self-emp-not-inc | Assoc-voc | Married-civ-spouse | Farming-fishing | Husband | White | False | 164607.001243 | 11.0 | 29.0 |
3 | Private | HS-grad | Never-married | Adm-clerical | Not-in-family | Black | False | 155508.999873 | 9.0 | 48.0 |
4 | Private | 11th | Never-married | Other-service | Own-child | White | False | 318189.998679 | 7.0 | 18.0 |
5 | Private | HS-grad | Never-married | Adm-clerical | Other-relative | White | False | 140219.001104 | 9.0 | 47.0 |
6 | Private | Masters | Divorced | #na# | Unmarried | White | True | 235683.001562 | 10.0 | 47.0 |
7 | Private | Bachelors | Married-civ-spouse | Craft-repair | Husband | White | False | 187321.999825 | 13.0 | 43.0 |
8 | Private | Bachelors | Married-civ-spouse | Prof-specialty | Husband | White | False | 104196.002410 | 13.0 | 40.0 |
9 | Private | Some-college | Separated | Priv-house-serv | Other-relative | White | False | 184302.999784 | 10.0 | 25.0 |
目前未使用 - 用于多模态
class TensorTabular(fastuple):
def get_ctxs(self, max_n=10, **kwargs):
= min(self[0].shape[0], max_n)
n_samples = pd.DataFrame(index = range(n_samples))
df return [df.iloc[i] for i in range(n_samples)]
def display(self, ctxs): display_df(pd.DataFrame(ctxs))
class TabularLine(pd.Series):
"A line of a dataframe that knows how to show itself"
def show(self, ctx=None, **kwargs): return self if ctx is None else ctx.append(self)
class ReadTabLine(ItemTransform):
def __init__(self, proc): self.proc = proc
def encodes(self, row):
= (o.map(row.__getitem__) for o in (self.proc.cat_names,self.proc.cont_names))
cats,conts return TensorTabular(tensor(cats).long(),tensor(conts).float())
def decodes(self, o):
= TabularPandas(o, self.proc.cat_names, self.proc.cont_names, self.proc.y_names)
to = self.proc.decode(to)
to return TabularLine(pd.Series({c: v for v,c in zip(to.items[0]+to.items[1], self.proc.cat_names+self.proc.cont_names)}))
class ReadTabTarget(ItemTransform):
def __init__(self, proc): self.proc = proc
def encodes(self, row): return row[self.proc.y_names].astype(np.int64)
def decodes(self, o): return Category(self.proc.classes[self.proc.y_names][o])
# tds = TfmdDS(to.items, tfms=[[ReadTabLine(proc)], ReadTabTarget(proc)])
# enc = tds[1]
# test_eq(enc[0][0], tensor([2,1]))
# test_close(enc[0][1], tensor([-0.628828]))
# test_eq(enc[1], 1)
# dec = tds.decode(enc)
# assert isinstance(dec[0], TabularLine)
# test_close(dec[0], pd.Series({'a': 1, 'b_na': False, 'b': 1}))
# test_eq(dec[1], 'a')
# test_stdout(lambda: print(show_at(tds, 1)), """a 1
# b_na False
# b 1
# category a
# dtype: object""")