from fastai.tabular.all import *
表格数据训练
为了说明表格数据应用,我们将使用 Adult 数据集的例子,我们需要使用一些通用数据来预测一个人每年收入是否大于或小于 5 万美元。
我们可以使用通常的 untar_data
命令下载此数据集的样本
= untar_data(URLs.ADULT_SAMPLE)
path path.ls()
(#3) [Path('/home/ml1/.fastai/data/adult_sample/models'),Path('/home/ml1/.fastai/data/adult_sample/export.pkl'),Path('/home/ml1/.fastai/data/adult_sample/adult.csv')]
然后我们可以看看数据是如何组织的
= pd.read_csv(path/'adult.csv')
df df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
有些列是连续型的(如 age),我们将把它们视为浮点数直接输入给模型。其他列是类别型的(如 workclass 或 education),我们将把它们转换为唯一的索引,然后输入到嵌入层。我们可以在 TabularDataLoaders
工厂方法中指定类别型和连续型列名以及因变量名
= TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
dls = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cat_names = ['age', 'fnlwgt', 'education-num'],
cont_names = [Categorify, FillMissing, Normalize]) procs
最后一部分是我们应用于数据的预处理器列表
Categorify
将选取所有类别型变量,创建一个从整数到唯一类别的映射,然后将值替换为相应的索引。FillMissing
将用现有值的中位数填充连续变量中的缺失值(如果您愿意,也可以选择一个特定值)Normalize
将对连续变量进行标准化(减去均值并除以标准差)
为了进一步展示底层发生的情况,我们使用 fastai
的 TabularPandas
类来重写。我们需要进行一项调整,即定义如何分割数据。默认情况下,上面的工厂方法使用了随机的 80/20 分割,所以我们也这样做
= RandomSplitter(valid_pct=0.2)(range_of(df)) splits
= TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
to = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cat_names = ['age', 'fnlwgt', 'education-num'],
cont_names ='salary',
y_names=splits) splits
一旦我们构建了 TabularPandas
对象,我们的数据就完全预处理好了,如下所示
2] to.xs.iloc[:
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
---|---|---|---|---|---|---|---|---|---|---|
15780 | 2 | 16 | 1 | 5 | 2 | 5 | 1 | 0.984037 | 2.210372 | -0.033692 |
17442 | 5 | 12 | 5 | 8 | 2 | 5 | 1 | -1.509555 | -0.319624 | -0.425324 |
现在我们可以再次构建 DataLoaders
= to.dataloaders(bs=64) dls
稍后我们将探讨为什么使用
TabularPandas
进行预处理很有价值。
show_batch
方法与其他所有应用一样工作
dls.show_batch()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | State-gov | Bachelors | Married-civ-spouse | Prof-specialty | Wife | White | False | 41.000000 | 75409.001182 | 13.0 | >=50k |
1 | Private | Some-college | Never-married | Craft-repair | Not-in-family | White | False | 24.000000 | 38455.005013 | 10.0 | <50k |
2 | Private | Assoc-acdm | Married-civ-spouse | Prof-specialty | Husband | White | False | 48.000000 | 101299.003093 | 12.0 | <50k |
3 | Private | HS-grad | Never-married | Other-service | Other-relative | Black | False | 42.000000 | 227465.999281 | 9.0 | <50k |
4 | State-gov | Some-college | Never-married | Prof-specialty | Not-in-family | White | False | 20.999999 | 258489.997130 | 10.0 | <50k |
5 | Local-gov | 12th | Married-civ-spouse | Tech-support | Husband | White | False | 39.000000 | 207853.000067 | 8.0 | <50k |
6 | Private | Assoc-voc | Married-civ-spouse | Sales | Husband | White | False | 36.000000 | 238414.998930 | 11.0 | >=50k |
7 | Private | HS-grad | Never-married | Craft-repair | Not-in-family | White | False | 19.000000 | 445727.998937 | 9.0 | <50k |
8 | Local-gov | Bachelors | Married-civ-spouse | #na# | Husband | White | True | 59.000000 | 196013.000174 | 10.0 | >=50k |
9 | Private | HS-grad | Married-civ-spouse | Prof-specialty | Wife | Black | False | 39.000000 | 147500.000403 | 9.0 | <50k |
我们可以使用 tabular_learner
方法定义模型。当我们定义模型时,fastai
将根据我们之前设置的 y_names
尝试推断损失函数。
注意:有时对于表格数据,您的 y
值可能被编码(例如 0 和 1)。在这种情况下,您应该在构造函数中显式传递 y_block = CategoryBlock
,这样 fastai
就不会假定您在进行回归。
= tabular_learner(dls, metrics=accuracy) learn
我们可以使用 fit_one_cycle
方法训练该模型(fine_tune
方法在这里没用,因为我们没有预训练模型)。
1) learn.fit_one_cycle(
epoch | train_loss | valid_loss | accuracy | time |
---|---|---|---|---|
0 | 0.369360 | 0.348096 | 0.840756 | 00:05 |
然后我们可以看看一些预测结果
learn.show_results()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | salary_pred | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 5.0 | 12.0 | 3.0 | 8.0 | 1.0 | 5.0 | 1.0 | 0.324868 | -1.138177 | -0.424022 | 0.0 | 0.0 |
1 | 5.0 | 10.0 | 5.0 | 2.0 | 2.0 | 5.0 | 1.0 | -0.482055 | -1.351911 | 1.148438 | 0.0 | 0.0 |
2 | 5.0 | 12.0 | 6.0 | 12.0 | 3.0 | 5.0 | 1.0 | -0.775482 | 0.138709 | -0.424022 | 0.0 | 0.0 |
3 | 5.0 | 16.0 | 5.0 | 2.0 | 4.0 | 4.0 | 1.0 | -1.362335 | -0.227515 | -0.030907 | 0.0 | 0.0 |
4 | 5.0 | 2.0 | 5.0 | 0.0 | 4.0 | 5.0 | 1.0 | -1.509048 | -0.191191 | -1.210252 | 0.0 | 0.0 |
5 | 5.0 | 16.0 | 3.0 | 13.0 | 1.0 | 5.0 | 1.0 | 1.498575 | -0.051096 | -0.030907 | 1.0 | 1.0 |
6 | 5.0 | 12.0 | 3.0 | 15.0 | 1.0 | 5.0 | 1.0 | -0.555412 | 0.039167 | -0.424022 | 0.0 | 0.0 |
7 | 5.0 | 1.0 | 5.0 | 6.0 | 4.0 | 5.0 | 1.0 | -1.582405 | -1.396391 | -1.603367 | 0.0 | 0.0 |
8 | 5.0 | 3.0 | 5.0 | 13.0 | 2.0 | 5.0 | 1.0 | -1.362335 | 0.158354 | -0.817137 | 0.0 | 0.0 |
或在某行上使用 predict 方法
= learn.predict(df.iloc[0]) row, clas, probs
row.show()
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Private | Assoc-acdm | Married-civ-spouse | #na# | Wife | White | False | 49.0 | 101319.99788 | 12.0 | >=50k |
clas, probs
(tensor(1), tensor([0.4995, 0.5005]))
要在新 dataframe 上进行预测,可以使用 DataLoaders
的 test_dl
方法。该 dataframe 的列中不需要包含因变量。
= df.copy()
test_df 'salary'], axis=1, inplace=True)
test_df.drop([= learn.dls.test_dl(test_df) dl
然后 Learner.get_preds
将为您提供预测结果
=dl) learn.get_preds(dl
(tensor([[0.4995, 0.5005],
[0.4882, 0.5118],
[0.9824, 0.0176],
...,
[0.5324, 0.4676],
[0.7628, 0.2372],
[0.5934, 0.4066]]), None)
由于机器学习模型无法神奇地理解从未训练过的类别,因此数据应该反映这一点。如果测试数据中存在不同的缺失值,您应该在训练之前处理它们
fastai
与其他库
如前所述,TabularPandas
是一个强大且易于使用的表格数据预处理工具。与 Random Forests 和 XGBoost 等库集成只需多一步,即 .dataloaders
调用为我们完成的。让我们再看看我们的 to
对象。其值存储在一个类似 DataFrame
的对象中,如果需要,我们可以提取 cats
、conts
、xs
和 ys
3] to.xs[:
workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
---|---|---|---|---|---|---|---|---|---|---|
25387 | 5 | 16 | 3 | 5 | 1 | 5 | 1 | 0.471582 | -1.467756 | -0.030907 |
16872 | 1 | 16 | 5 | 1 | 4 | 5 | 1 | -1.215622 | -0.649792 | -0.030907 |
25852 | 5 | 16 | 3 | 5 | 1 | 5 | 1 | 1.865358 | -0.218915 | -0.030907 |
现在所有内容都已编码,您可以通过提取训练集和验证集及其值来将此发送给 XGBoost 或 Random Forests
= to.train.xs, to.train.ys.values.ravel()
X_train, y_train = to.valid.xs, to.valid.ys.values.ravel() X_test, y_test
现在我们可以直接将其输入!