表格数据训练

如何在 fastai 中使用表格数据应用

为了说明表格数据应用，我们将使用 Adult 数据集的例子，我们需要使用一些通用数据来预测一个人每年收入是否大于或小于 5 万美元。

from fastai.tabular.all import *

我们可以使用通常的 untar_data 命令下载此数据集的样本

path = untar_data(URLs.ADULT_SAMPLE)
path.ls()

(#3) [Path('/home/ml1/.fastai/data/adult_sample/models'),Path('/home/ml1/.fastai/data/adult_sample/export.pkl'),Path('/home/ml1/.fastai/data/adult_sample/adult.csv')]

然后我们可以看看数据是如何组织的

df = pd.read_csv(path/'adult.csv')
df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	49	Private	101320	Assoc-acdm	12.0	Married-civ-spouse	NaN	Wife	White	Female	0	1902	40	United-States	>=50k
1	44	Private	236746	Masters	14.0	Divorced	Exec-managerial	Not-in-family	White	Male	10520	0	45	United-States	>=50k
2	38	Private	96185	HS-grad	NaN	Divorced	NaN	Unmarried	Black	Female	0	0	32	United-States	<50k
3	38	Self-emp-inc	112847	Prof-school	15.0	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	United-States	>=50k
4	42	Self-emp-not-inc	82297	7th-8th	NaN	Married-civ-spouse	Other-service	Wife	Black	Female	0	0	50	United-States	<50k

有些列是连续型的（如 age），我们将把它们视为浮点数直接输入给模型。其他列是类别型的（如 workclass 或 education），我们将把它们转换为唯一的索引，然后输入到嵌入层。我们可以在 TabularDataLoaders 工厂方法中指定类别型和连续型列名以及因变量名

dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
    cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
    cont_names = ['age', 'fnlwgt', 'education-num'],
    procs = [Categorify, FillMissing, Normalize])

最后一部分是我们应用于数据的预处理器列表

Categorify 将选取所有类别型变量，创建一个从整数到唯一类别的映射，然后将值替换为相应的索引。
FillMissing 将用现有值的中位数填充连续变量中的缺失值（如果您愿意，也可以选择一个特定值）
Normalize 将对连续变量进行标准化（减去均值并除以标准差）

为了进一步展示底层发生的情况，我们使用 fastai 的 TabularPandas 类来重写。我们需要进行一项调整，即定义如何分割数据。默认情况下，上面的工厂方法使用了随机的 80/20 分割，所以我们也这样做

splits = RandomSplitter(valid_pct=0.2)(range_of(df))

to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
                   cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
                   cont_names = ['age', 'fnlwgt', 'education-num'],
                   y_names='salary',
                   splits=splits)

一旦我们构建了 TabularPandas 对象，我们的数据就完全预处理好了，如下所示

to.xs.iloc[:2]

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
15780	2	16	1	5	2	5	1	0.984037	2.210372	-0.033692
17442	5	12	5	8	2	5	1	-1.509555	-0.319624	-0.425324

现在我们可以再次构建 DataLoaders

dls = to.dataloaders(bs=64)

稍后我们将探讨为什么使用 TabularPandas 进行预处理很有价值。

show_batch 方法与其他所有应用一样工作

dls.show_batch()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	State-gov	Bachelors	Married-civ-spouse	Prof-specialty	Wife	White	False	41.000000	75409.001182	13.0	>=50k
1	Private	Some-college	Never-married	Craft-repair	Not-in-family	White	False	24.000000	38455.005013	10.0	<50k
2	Private	Assoc-acdm	Married-civ-spouse	Prof-specialty	Husband	White	False	48.000000	101299.003093	12.0	<50k
3	Private	HS-grad	Never-married	Other-service	Other-relative	Black	False	42.000000	227465.999281	9.0	<50k
4	State-gov	Some-college	Never-married	Prof-specialty	Not-in-family	White	False	20.999999	258489.997130	10.0	<50k
5	Local-gov	12th	Married-civ-spouse	Tech-support	Husband	White	False	39.000000	207853.000067	8.0	<50k
6	Private	Assoc-voc	Married-civ-spouse	Sales	Husband	White	False	36.000000	238414.998930	11.0	>=50k
7	Private	HS-grad	Never-married	Craft-repair	Not-in-family	White	False	19.000000	445727.998937	9.0	<50k
8	Local-gov	Bachelors	Married-civ-spouse	#na#	Husband	White	True	59.000000	196013.000174	10.0	>=50k
9	Private	HS-grad	Married-civ-spouse	Prof-specialty	Wife	Black	False	39.000000	147500.000403	9.0	<50k

我们可以使用 tabular_learner 方法定义模型。当我们定义模型时，fastai 将根据我们之前设置的 y_names 尝试推断损失函数。

注意：有时对于表格数据，您的 y 值可能被编码（例如 0 和 1）。在这种情况下，您应该在构造函数中显式传递 y_block = CategoryBlock，这样 fastai 就不会假定您在进行回归。

learn = tabular_learner(dls, metrics=accuracy)

我们可以使用 fit_one_cycle 方法训练该模型（fine_tune 方法在这里没用，因为我们没有预训练模型）。

learn.fit_one_cycle(1)

epoch	train_loss	valid_loss	accuracy	time
0	0.369360	0.348096	0.840756	00:05

然后我们可以看看一些预测结果

learn.show_results()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary	salary_pred
0	5.0	12.0	3.0	8.0	1.0	5.0	1.0	0.324868	-1.138177	-0.424022	0.0	0.0
1	5.0	10.0	5.0	2.0	2.0	5.0	1.0	-0.482055	-1.351911	1.148438	0.0	0.0
2	5.0	12.0	6.0	12.0	3.0	5.0	1.0	-0.775482	0.138709	-0.424022	0.0	0.0
3	5.0	16.0	5.0	2.0	4.0	4.0	1.0	-1.362335	-0.227515	-0.030907	0.0	0.0
4	5.0	2.0	5.0	0.0	4.0	5.0	1.0	-1.509048	-0.191191	-1.210252	0.0	0.0
5	5.0	16.0	3.0	13.0	1.0	5.0	1.0	1.498575	-0.051096	-0.030907	1.0	1.0
6	5.0	12.0	3.0	15.0	1.0	5.0	1.0	-0.555412	0.039167	-0.424022	0.0	0.0
7	5.0	1.0	5.0	6.0	4.0	5.0	1.0	-1.582405	-1.396391	-1.603367	0.0	0.0
8	5.0	3.0	5.0	13.0	2.0	5.0	1.0	-1.362335	0.158354	-0.817137	0.0	0.0

或在某行上使用 predict 方法

row, clas, probs = learn.predict(df.iloc[0])

row.show()

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num	salary
0	Private	Assoc-acdm	Married-civ-spouse	#na#	Wife	White	False	49.0	101319.99788	12.0	>=50k

clas, probs

(tensor(1), tensor([0.4995, 0.5005]))

要在新 dataframe 上进行预测，可以使用 DataLoaders 的 test_dl 方法。该 dataframe 的列中不需要包含因变量。

test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)

然后 Learner.get_preds 将为您提供预测结果

learn.get_preds(dl=dl)

(tensor([[0.4995, 0.5005],
         [0.4882, 0.5118],
         [0.9824, 0.0176],
         ...,
         [0.5324, 0.4676],
         [0.7628, 0.2372],
         [0.5934, 0.4066]]), None)

注意

由于机器学习模型无法神奇地理解从未训练过的类别，因此数据应该反映这一点。如果测试数据中存在不同的缺失值，您应该在训练之前处理它们

`fastai` 与其他库

如前所述，TabularPandas 是一个强大且易于使用的表格数据预处理工具。与 Random Forests 和 XGBoost 等库集成只需多一步，即 .dataloaders 调用为我们完成的。让我们再看看我们的 to 对象。其值存储在一个类似 DataFrame 的对象中，如果需要，我们可以提取 cats、conts、xs 和 ys

to.xs[:3]

	workclass	education	marital-status	occupation	relationship	race	education-num_na	age	fnlwgt	education-num
25387	5	16	3	5	1	5	1	0.471582	-1.467756	-0.030907
16872	1	16	5	1	4	5	1	-1.215622	-0.649792	-0.030907
25852	5	16	3	5	1	5	1	1.865358	-0.218915	-0.030907

现在所有内容都已编码，您可以通过提取训练集和验证集及其值来将此发送给 XGBoost 或 Random Forests

X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()

现在我们可以直接将其输入！

fastai 与其他库

`fastai` 与其他库