from fastai.tabular.all import *表格数据训练
为了说明表格数据应用,我们将使用 Adult 数据集的例子,我们需要使用一些通用数据来预测一个人每年收入是否大于或小于 5 万美元。
我们可以使用通常的 untar_data 命令下载此数据集的样本
path = untar_data(URLs.ADULT_SAMPLE)
path.ls()(#3) [Path('/home/ml1/.fastai/data/adult_sample/models'),Path('/home/ml1/.fastai/data/adult_sample/export.pkl'),Path('/home/ml1/.fastai/data/adult_sample/adult.csv')]
然后我们可以看看数据是如何组织的
df = pd.read_csv(path/'adult.csv')
df.head()| age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 49 | Private | 101320 | Assoc-acdm | 12.0 | Married-civ-spouse | NaN | Wife | White | Female | 0 | 1902 | 40 | United-States | >=50k |
| 1 | 44 | Private | 236746 | Masters | 14.0 | Divorced | Exec-managerial | Not-in-family | White | Male | 10520 | 0 | 45 | United-States | >=50k |
| 2 | 38 | Private | 96185 | HS-grad | NaN | Divorced | NaN | Unmarried | Black | Female | 0 | 0 | 32 | United-States | <50k |
| 3 | 38 | Self-emp-inc | 112847 | Prof-school | 15.0 | Married-civ-spouse | Prof-specialty | Husband | Asian-Pac-Islander | Male | 0 | 0 | 40 | United-States | >=50k |
| 4 | 42 | Self-emp-not-inc | 82297 | 7th-8th | NaN | Married-civ-spouse | Other-service | Wife | Black | Female | 0 | 0 | 50 | United-States | <50k |
有些列是连续型的(如 age),我们将把它们视为浮点数直接输入给模型。其他列是类别型的(如 workclass 或 education),我们将把它们转换为唯一的索引,然后输入到嵌入层。我们可以在 TabularDataLoaders 工厂方法中指定类别型和连续型列名以及因变量名
dls = TabularDataLoaders.from_csv(path/'adult.csv', path=path, y_names="salary",
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
procs = [Categorify, FillMissing, Normalize])最后一部分是我们应用于数据的预处理器列表
Categorify将选取所有类别型变量,创建一个从整数到唯一类别的映射,然后将值替换为相应的索引。FillMissing将用现有值的中位数填充连续变量中的缺失值(如果您愿意,也可以选择一个特定值)Normalize将对连续变量进行标准化(减去均值并除以标准差)
为了进一步展示底层发生的情况,我们使用 fastai 的 TabularPandas 类来重写。我们需要进行一项调整,即定义如何分割数据。默认情况下,上面的工厂方法使用了随机的 80/20 分割,所以我们也这样做
splits = RandomSplitter(valid_pct=0.2)(range_of(df))to = TabularPandas(df, procs=[Categorify, FillMissing,Normalize],
cat_names = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race'],
cont_names = ['age', 'fnlwgt', 'education-num'],
y_names='salary',
splits=splits)一旦我们构建了 TabularPandas 对象,我们的数据就完全预处理好了,如下所示
to.xs.iloc[:2]| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
|---|---|---|---|---|---|---|---|---|---|---|
| 15780 | 2 | 16 | 1 | 5 | 2 | 5 | 1 | 0.984037 | 2.210372 | -0.033692 |
| 17442 | 5 | 12 | 5 | 8 | 2 | 5 | 1 | -1.509555 | -0.319624 | -0.425324 |
现在我们可以再次构建 DataLoaders
dls = to.dataloaders(bs=64)稍后我们将探讨为什么使用
TabularPandas进行预处理很有价值。
show_batch 方法与其他所有应用一样工作
dls.show_batch()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | State-gov | Bachelors | Married-civ-spouse | Prof-specialty | Wife | White | False | 41.000000 | 75409.001182 | 13.0 | >=50k |
| 1 | Private | Some-college | Never-married | Craft-repair | Not-in-family | White | False | 24.000000 | 38455.005013 | 10.0 | <50k |
| 2 | Private | Assoc-acdm | Married-civ-spouse | Prof-specialty | Husband | White | False | 48.000000 | 101299.003093 | 12.0 | <50k |
| 3 | Private | HS-grad | Never-married | Other-service | Other-relative | Black | False | 42.000000 | 227465.999281 | 9.0 | <50k |
| 4 | State-gov | Some-college | Never-married | Prof-specialty | Not-in-family | White | False | 20.999999 | 258489.997130 | 10.0 | <50k |
| 5 | Local-gov | 12th | Married-civ-spouse | Tech-support | Husband | White | False | 39.000000 | 207853.000067 | 8.0 | <50k |
| 6 | Private | Assoc-voc | Married-civ-spouse | Sales | Husband | White | False | 36.000000 | 238414.998930 | 11.0 | >=50k |
| 7 | Private | HS-grad | Never-married | Craft-repair | Not-in-family | White | False | 19.000000 | 445727.998937 | 9.0 | <50k |
| 8 | Local-gov | Bachelors | Married-civ-spouse | #na# | Husband | White | True | 59.000000 | 196013.000174 | 10.0 | >=50k |
| 9 | Private | HS-grad | Married-civ-spouse | Prof-specialty | Wife | Black | False | 39.000000 | 147500.000403 | 9.0 | <50k |
我们可以使用 tabular_learner 方法定义模型。当我们定义模型时,fastai 将根据我们之前设置的 y_names 尝试推断损失函数。
注意:有时对于表格数据,您的 y 值可能被编码(例如 0 和 1)。在这种情况下,您应该在构造函数中显式传递 y_block = CategoryBlock,这样 fastai 就不会假定您在进行回归。
learn = tabular_learner(dls, metrics=accuracy)我们可以使用 fit_one_cycle 方法训练该模型(fine_tune 方法在这里没用,因为我们没有预训练模型)。
learn.fit_one_cycle(1)| epoch | train_loss | valid_loss | accuracy | time |
|---|---|---|---|---|
| 0 | 0.369360 | 0.348096 | 0.840756 | 00:05 |
然后我们可以看看一些预测结果
learn.show_results()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | salary_pred | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5.0 | 12.0 | 3.0 | 8.0 | 1.0 | 5.0 | 1.0 | 0.324868 | -1.138177 | -0.424022 | 0.0 | 0.0 |
| 1 | 5.0 | 10.0 | 5.0 | 2.0 | 2.0 | 5.0 | 1.0 | -0.482055 | -1.351911 | 1.148438 | 0.0 | 0.0 |
| 2 | 5.0 | 12.0 | 6.0 | 12.0 | 3.0 | 5.0 | 1.0 | -0.775482 | 0.138709 | -0.424022 | 0.0 | 0.0 |
| 3 | 5.0 | 16.0 | 5.0 | 2.0 | 4.0 | 4.0 | 1.0 | -1.362335 | -0.227515 | -0.030907 | 0.0 | 0.0 |
| 4 | 5.0 | 2.0 | 5.0 | 0.0 | 4.0 | 5.0 | 1.0 | -1.509048 | -0.191191 | -1.210252 | 0.0 | 0.0 |
| 5 | 5.0 | 16.0 | 3.0 | 13.0 | 1.0 | 5.0 | 1.0 | 1.498575 | -0.051096 | -0.030907 | 1.0 | 1.0 |
| 6 | 5.0 | 12.0 | 3.0 | 15.0 | 1.0 | 5.0 | 1.0 | -0.555412 | 0.039167 | -0.424022 | 0.0 | 0.0 |
| 7 | 5.0 | 1.0 | 5.0 | 6.0 | 4.0 | 5.0 | 1.0 | -1.582405 | -1.396391 | -1.603367 | 0.0 | 0.0 |
| 8 | 5.0 | 3.0 | 5.0 | 13.0 | 2.0 | 5.0 | 1.0 | -1.362335 | 0.158354 | -0.817137 | 0.0 | 0.0 |
或在某行上使用 predict 方法
row, clas, probs = learn.predict(df.iloc[0])row.show()| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | salary | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Private | Assoc-acdm | Married-civ-spouse | #na# | Wife | White | False | 49.0 | 101319.99788 | 12.0 | >=50k |
clas, probs(tensor(1), tensor([0.4995, 0.5005]))
要在新 dataframe 上进行预测,可以使用 DataLoaders 的 test_dl 方法。该 dataframe 的列中不需要包含因变量。
test_df = df.copy()
test_df.drop(['salary'], axis=1, inplace=True)
dl = learn.dls.test_dl(test_df)然后 Learner.get_preds 将为您提供预测结果
learn.get_preds(dl=dl)(tensor([[0.4995, 0.5005],
[0.4882, 0.5118],
[0.9824, 0.0176],
...,
[0.5324, 0.4676],
[0.7628, 0.2372],
[0.5934, 0.4066]]), None)
由于机器学习模型无法神奇地理解从未训练过的类别,因此数据应该反映这一点。如果测试数据中存在不同的缺失值,您应该在训练之前处理它们
fastai 与其他库
如前所述,TabularPandas 是一个强大且易于使用的表格数据预处理工具。与 Random Forests 和 XGBoost 等库集成只需多一步,即 .dataloaders 调用为我们完成的。让我们再看看我们的 to 对象。其值存储在一个类似 DataFrame 的对象中,如果需要,我们可以提取 cats、conts、xs 和 ys
to.xs[:3]| workclass | education | marital-status | occupation | relationship | race | education-num_na | age | fnlwgt | education-num | |
|---|---|---|---|---|---|---|---|---|---|---|
| 25387 | 5 | 16 | 3 | 5 | 1 | 5 | 1 | 0.471582 | -1.467756 | -0.030907 |
| 16872 | 1 | 16 | 5 | 1 | 4 | 5 | 1 | -1.215622 | -0.649792 | -0.030907 |
| 25852 | 5 | 16 | 3 | 5 | 1 | 5 | 1 | 1.865358 | -0.218915 | -0.030907 |
现在所有内容都已编码,您可以通过提取训练集和验证集及其值来将此发送给 XGBoost 或 Random Forests
X_train, y_train = to.train.xs, to.train.ys.values.ravel()
X_test, y_test = to.valid.xs, to.valid.ys.values.ravel()现在我们可以直接将其输入!