特征选择

特征选择是实用机器学习的重要一步，一般数据集都带有太多的特征用于模型构建，如何找出有用特征是值得关注的内容。

1. Feature selection: All-relevant selection with the Boruta package

特征选择两种方法用于分析：

（1）最少最优特征选择（minimal-optimal feature selection)识别少量特征集合（理想状况最少）给出尽可能优的分类结果；

（2）所有相关特征选择（all-relevant feature selection)识别所有与分类有关的所有特征。

本文使用Boruta包，它使用随机森林分类算法，测量每个特征的重要行（z score)。

2. 使用caret包

使用递归特征消除法，rfe参数

x，预测变量的矩阵或数据框

y，输出结果向量（数值型或因子型）

sizes，用于测试的特定子集大小的整型向量

rfeControl，用于指定预测模型和方法的一系列选项

一些列函数可以用于rfeControl$functions，包括：线性回归（lmFuncs），随机森林（rfFuncs），朴素贝叶斯(nbFuncs)，bagged trees（treebagFuncs)和可以用于caret的train函数的函数（caretFuncs）。

1）移除冗余特征

移除高度关联的特征。

Caret R包提供findCorrelation函数，分析特征的关联矩阵，移除冗余特征

set.seed(7)

# load the library

library(mlbench)

library(caret)

# load the data

data(PimaIndiansDiabetes)

#P calculate correlation matrix

correlationMatrix <- cor(PimaIndiansDiabetes[,1:8])

# summarize the correlation matrix

print(correlationMatrix)

# find attributes that are highly corrected (ideally >0.75)

highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)

# print indexes of highly correlated attributes

print(highlyCorrelated)

2) 根据重要性进行特征排序

特征重要性可以通过构建模型获取。一些模型，诸如决策树，内建有特征重要性的获取机制。另一些模型，每个特征重要性利用ROC曲线分析获取。

下例加载Pima Indians Diabetes数据集，构建一个Learning Vector Quantization（LVQ）模型。varImp用于获取特征重要性。从图中可以看出glucose, mass和age是前三个最重要的特征，insulin是最不重要的特征。

# ensure results are repeatable

set.seed(7)

# load the library

library(mlbench)

library(caret)

# load the dataset

data(PimaIndiansDiabetes)

# prepare training scheme

control <- trainControl(method="repeatedcv", number=10, repeats=3)

# train the model

model <- train(diabetes~., data=PimaIndiansDiabetes, method="lvq", preProcess="scale", trControl=control)

# estimate variable importance

importance <- varImp(model, scale=FALSE)

# summarize importance

print(importance)

# plot importance

plot(importance)

3）特征选择

自动特征选择用于构建不同子集的许多模型，识别哪些特征有助于构建准确模型，哪些特征没什么帮助。

特征选择的一个流行的自动方法称为递归特征消除（Recursive Feature Elimination）或RFE。

下例在Pima Indians Diabetes数据集上提供RFE方法例子。随机森林算法用于每一轮迭代中评估模型的方法。该算法用于探索所有可能的特征子集。从图中可以看出当使用4个特征时即可获取与最高性能相差无几的结果。

# ensure the results are repeatable

set.seed(7)

# load the library

library(mlbench)

library(caret)

# load the data

data(PimaIndiansDiabetes)

# define the control using a random forest selection function

control <- rfeControl(functions=rfFuncs, method="cv", number=10)

# run the RFE algorithm

results <- rfe(PimaIndiansDiabetes[,1:8], PimaIndiansDiabetes[,9], sizes=c(1:8), rfeControl=control)

# summarize the results

print(results)

# list the chosen features

predictors(results)

# plot the results

plot(results, type=c("g", "o"))

文章来源：

http://blog.csdn.net/python_learn/article/details/45008073

数据人网（http://shujuren.org）是数据人学习、交流和分享的平台，专注于从数据中学习，分享和传播数据知识。

点击【阅读原文】，遨游【数据人网】。

一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

HQY

要和谐，要有爱~

hqy 发表于2025-03-01 20:50:33 浏览12 评论0百度已收录

少长咸集