Python入门教程：一个强大的交叉验证Scikit - ModelSelection库

你们有没有过那种挑衣服挑到崩溃的时候？面对一柜子的衣服，不知道该选哪件去赴约，选得不好，一晚上都别扭。

在机器学习里，选模型也是这个道理。Scikit - ModelSelection 库就像你的时尚顾问，帮你从众多模型里挑出最合身的那个，还能通过交叉验证这些操作，让你选得明明白白。今天就给大家分享 10 个玩转这个库的高级技巧，新手小白也能轻松掌握！

技巧一：安装 Scikit - learn 及相关依赖

想用 Scikit - ModelSelection，得先把 Scikit - learn 这个 “时尚大管家” 请回家，它可是包含了我们要用的库。安装超简单，打开命令行，输入：

pip install -U scikit - learn

要是网速不给力，半天装不上，别上火，多试几次，或者换个网络环境。要是安装报错，说缺少依赖啥的，也别慌，按照错误提示，把缺少的库装上就行。安装好 Scikit - learn，就像把一柜子衣服都整理好了，就等你用 Scikit - ModelSelection 来挑啦！

技巧二：使用 KFold 交叉验证

假设你要训练一个预测房价的模型，可又担心模型在不同数据上表现不一样，这时候 KFold 交叉验证就派上用场了。它能把你的数据分成好几份，轮流用不同的部分做训练和测试，看看模型是不是真的靠谱。

import numpy as npfrom sklearn.model_selection import KFoldfrom sklearn.linear_model import LinearRegression# 模拟房价数据，假设有10个样本，每个样本有3个特征（面积、房龄、楼层）data = np.array([ [100, 5, 3], [120, 3, 5], [80, 8, 2], [90, 6, 4], [110, 4, 6], [70, 9, 1], [130, 2, 7], [85, 7, 3], [105, 5, 4], [115, 3, 5]])# 模拟房价标签labels = np.array([200, 250, 150, 180, 220, 130, 280, 160, 210, 230])# 创建KFold实例，设置分成5份kf = KFold(n_splits=5)# 创建线性回归模型model = LinearRegression()# 进行KFold交叉验证for train_index, test_index in kf.split(data): X_train, X_test = data[train_index], data[test_index] y_train, y_test = labels[train_index], labels[test_index] model.fit(X_train, y_train) score = model.score(X_test, y_test) print(f"测试集得分: {score}")

代码拆解

import numpy as np导入numpy库，处理数值计算那叫一个溜。from sklearn.model_selection import KFold从sklearn的model_selection模块导入KFold，这就是我们的交叉验证工具。from sklearn.linear_model import LinearRegression导入线性回归模型，这里用它来预测房价。

data = np.array([[100, 5, 3], [120, 3, 5], [80, 8, 2], [90, 6, 4], [110, 4, 6], [70, 9, 1], [130, 2, 7], [85, 7, 3], [105, 5, 4], [115, 3, 5]])模拟房价数据，每一行是一个样本，包含面积、房龄、楼层这 3 个特征。labels = np.array([200, 250, 150, 180, 220, 130, 280, 160, 210, 230])模拟房价标签。

kf = KFold(n_splits=5)创建KFold实例，n_splits=5表示把数据分成 5 份。

model = LinearRegression()创建线性回归模型。

for train_index, test_index in kf.split(data):开始循环，kf.split(data)会生成 5 组训练集和测试集的索引。X_train, X_test = data[train_index], data[test_index]根据索引把数据分成训练集和测试集，y_train, y_test = labels[train_index], labels[test_index]把标签也分成训练集和测试集。model.fit(X_train, y_train)用训练集数据训练模型，score = model.score(X_test, y_test)用测试集数据评估模型，最后打印测试集得分，看看模型在不同数据上的表现咋样。

行业应用案例

在房地产中介行业，中介们要根据房屋的各种特征预测房价，给客户提供参考。通过 KFold 交叉验证，用不同的房屋数据组合训练和测试模型，能更准确地评估模型的性能。比如，在一个新的小区，中介可以用这个方法训练模型，预测不同房屋的价格，吸引更多客户，提高业务量。

技巧三：使用 StratifiedKFold 交叉验证

当你的数据类别分布不均衡时，普通的 KFold 就不太合适了，这时候 StratifiedKFold 就闪亮登场。它能保证每个折叠中的类别比例和原始数据差不多。假设你要训练一个预测客户是否会购买产品的模型，客户购买和不购买的比例差异较大。

import numpy as npfrom sklearn.model_selection import StratifiedKFoldfrom sklearn.linear_model import LogisticRegression# 模拟客户数据，假设有10个客户，每个客户有2个特征（收入、消费频率）data = np.array([ [5000, 3], [6000, 4], [4000, 2], [7000, 5], [5500, 3], [4500, 2], [6500, 4], [7500, 5], [5200, 3], [4800, 2]])# 模拟客户购买产品的标签，1表示购买，0表示未购买labels = np.array([1, 1, 0, 1, 1, 0, 1, 1, 1, 0])# 创建StratifiedKFold实例，设置分成5份skf = StratifiedKFold(n_splits=5)# 创建逻辑回归模型model = LogisticRegression()# 进行StratifiedKFold交叉验证for train_index, test_index in skf.split(data, labels): X_train, X_test = data[train_index], data[test_index] y_train, y_test = labels[train_index], labels[test_index] model.fit(X_train, y_train) score = model.score(X_test, y_test) print(f"测试集得分: {score}")

代码拆解

导入numpy库，from sklearn.model_selection import StratifiedKFold导入分层交叉验证工具，from sklearn.linear_model import LogisticRegression导入逻辑回归模型，因为这是个分类问题。

data = np.array([[5000, 3], [6000, 4], [4000, 2], [7000, 5], [5500, 3], [4500, 2], [6500, 4], [7500, 5], [5200, 3], [4800, 2]])模拟客户数据，labels = np.array([1, 1, 0, 1, 1, 0, 1, 1, 1, 0])模拟客户购买产品的标签。

skf = StratifiedKFold(n_splits=5)创建StratifiedKFold实例，分成 5 份。它会根据标签的类别分布来划分数据，保证每个折叠里购买和未购买的客户比例和原始数据差不多。

model = LogisticRegression()创建逻辑回归模型。

循环部分和 KFold 类似，for train_index, test_index in skf.split(data, labels):根据StratifiedKFold生成的索引划分数据，model.fit(X_train, y_train)训练模型，score = model.score(X_test, y_test)评估模型，打印测试集得分。

行业应用案例

在电商营销中，商家要预测哪些客户会购买自己的产品，以便制定精准的营销策略。由于购买和未购买的客户数量往往不均衡，使用 StratifiedKFold 交叉验证训练模型，能更准确地评估模型在不同客户群体中的性能。比如，电商平台可以用这个方法训练模型，找出最有可能购买产品的客户，针对性地推送广告，提高营销效果，增加销售额。

技巧四：使用 LeaveOneOut 交叉验证

LeaveOneOut 交叉验证每次只留一个样本做测试，其余的都用来训练，能最大程度利用数据。假设你有一个很小的数据集，想充分利用每一个数据点来训练模型。

import numpy as npfrom sklearn.model_selection import LeaveOneOutfrom sklearn.linear_model import LinearRegression# 模拟数据，假设有6个样本，每个样本有2个特征data = np.array([ [1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])# 模拟标签labels = np.array([5, 10, 15, 20, 25, 30])# 创建LeaveOneOut实例loo = LeaveOneOut()# 创建线性回归模型model = LinearRegression()# 进行LeaveOneOut交叉验证for train_index, test_index in loo.split(data): X_train, X_test = data[train_index], data[test_index] y_train, y_test = labels[train_index], labels[test_index] model.fit(X_train, y_train) score = model.score(X_test, y_test) print(f"测试集得分: {score}")

代码拆解

导入numpy库，from sklearn.model_selection import LeaveOneOut导入留一法交叉验证工具，from sklearn.linear_model import LinearRegression导入线性回归模型。

data = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]])模拟数据，labels = np.array([5, 10, 15, 20, 25, 30])模拟标签。

loo = LeaveOneOut()创建LeaveOneOut实例。

model = LinearRegression()创建线性回归模型。

在循环中，for train_index, test_index in loo.split(data):，loo.split(data)每次返回一组训练集和测试集索引，测试集只有一个样本。然后划分数据，训练模型，评估模型并打印测试集得分。

行业应用案例

在医学研究中，有时候收集到的患者数据非常有限，比如某种罕见病的病例。使用 LeaveOneOut 交叉验证，能充分利用每一个患者的数据来训练疾病诊断模型。例如，医生可以用这个方法训练模型，根据患者的症状和检查指标，更准确地诊断疾病，为患者提供更好的治疗方案。

技巧五：使用 ShuffleSplit 交叉验证

ShuffleSplit 交叉验证会随机打乱数据，然后划分成训练集和测试集，而且可以指定划分的次数和比例。假设你想看看不同的数据划分方式对模型的影响。

import numpy as npfrom sklearn.model_selection import ShuffleSplitfrom sklearn.linear_model import LinearRegression# 模拟数据，假设有10个样本，每个样本有3个特征data = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20, 21], [22, 23, 24], [25, 26, 27], [28, 29, 30]])# 模拟标签labels = np.array([10, 20, 30, 40, 50, 60, 70, 80, 90, 100])# 创建ShuffleSplit实例，设置划分次数为5，测试集比例为0.2ss = ShuffleSplit(n_splits=5, test_size=0.2)# 创建线性回归模型model = LinearRegression()# 进行ShuffleSplit交叉验证for train_index, test_index in ss.split(data): X_train, X_test = data[train_index], data[test_index] y_train, y_test = labels[train_index], labels[test_index] model.fit(X_train, y_train) score = model.score(X_test, y_test) print(f"测试集得分: {score}")

代码拆解

导入numpy库，from sklearn.model_selection import ShuffleSplit导入随机划分交叉验证工具，from sklearn.linear_model import LinearRegression导入线性回归模型。

定义数据和标签。

ss = ShuffleSplit(n_splits=5, test_size=0.2)创建ShuffleSplit实例，n_splits=5表示划分 5 次，test_size=0.2表示每次测试集占数据总量的 20%。

创建线性回归模型，在循环中根据ShuffleSplit生成的索引划分数据，训练模型，评估模型并打印测试集得分。每次划分的数据都不一样，能更全面地评估模型在不同数据划分下的性能。

行业应用案例

在市场调研中，企业收集了消费者对产品的反馈数据，想训练一个模型预测消费者的满意度。使用 ShuffleSplit 交叉验证，可以通过不同的数据划分方式，更全面地评估模型的稳定性和准确性。比如，企业可以用这个方法训练模型，找出影响消费者满意度的关键因素，优化产品设计和服务，提高消费者满意度和忠诚度。

技巧六：使用 GridSearchCV 进行超参数调优

超参数对模型的性能影响很大，GridSearchCV 可以帮你找到最优的超参数组合。假设你要训练一个决策树模型，调整它的最大深度和最小样本数。

import numpy as npfrom sklearn.model_selection import GridSearchCVfrom sklearn.tree import DecisionTreeClassifier# 模拟数据，假设有10个样本，每个样本有3个特征data = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20, 21], [22, 23, 24], [25, 26, 27], [28, 29, 30]])# 模拟标签labels = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])# 定义决策树模型model = DecisionTreeClassifier()# 定义超参数搜索空间param_grid = {max_depth: [3, 5, 7],min_samples_split: [2, 5, 10]}# 创建GridSearchCV实例，设置交叉验证次数为3grid_search = GridSearchCV(model, param_grid, cv=3)# 进行超参数调优grid_search.fit(data, labels)# 打印最优超参数组合和得分print("最优超参数组合:", grid_search.best_params_)print("最优得分:", grid_search.best_score_)

代码拆解

导入numpy库，from sklearn.model_selection import GridSearchCV导入网格搜索工具，from sklearn.tree import DecisionTreeClassifier导入决策树模型。

定义数据和标签。

model = DecisionTreeClassifier()创建决策树模型。

4. param_grid = {max_depth: [3, 5, 7],min_samples_split: [2, 5, 10]}定义超参数搜索空间，max_depth尝试 3、5、7 这三个值，min_samples_split尝试 2、5、10 这三个值，组合起来就有多种超参数组合。

5. grid_search = GridSearchCV(model, param_grid, cv=3)创建GridSearchCV实例，把决策树模型model、超参数搜索空间param_grid传进去，cv=3表示进行 3 折交叉验证，也就是把数据分成 3 份，轮流用不同的部分做训练和测试，评估每种超参数组合下模型的性能。

6. grid_search.fit(data, labels)开始超参数调优，GridSearchCV会遍历超参数搜索空间里的每一种组合，用交叉验证评估模型性能。

7. 最后，print("最优超参数组合:", grid_search.best_params_)打印出找到的最优超参数组合，print("最优得分:", grid_search.best_score_)打印出在最优超参数组合下模型的得分，这个得分能反映模型在验证集上的表现，得分越高，说明模型性能越好。

行业应用案例

在金融风控领域，银行需要预测客户的信用风险，决策树模型是常用的工具之一。通过GridSearchCV调整决策树的超参数，如最大深度和最小样本数，可以让模型更准确地识别高风险客户。比如，经过超参数调优后的模型，能更精准地判断哪些客户可能会违约，银行就能提前采取措施，如提高贷款利率、减少贷款额度等，降低坏账风险，保障金融安全。

技巧七：使用 RandomizedSearchCV 进行超参数调优

GridSearchCV是对所有超参数组合进行穷举搜索，当搜索空间很大时，计算量会非常大。这时候RandomizedSearchCV就派上用场了，它会随机选择一些超参数组合进行评估，能大大节省计算时间。假设你要训练一个随机森林模型，调整它的n_estimators（树的数量）和max_features（每次分裂考虑的最大特征数）。

import numpy as npfrom sklearn.model_selection import RandomizedSearchCVfrom sklearn.ensemble import RandomForestClassifierfrom scipy.stats import randint# 模拟数据，假设有10个样本，每个样本有3个特征data = np.array([ [1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12], [13, 14, 15], [16, 17, 18], [19, 20, 21], [22, 23, 24], [25, 26, 27], [28, 29, 30]])# 模拟标签labels = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1])# 定义随机森林模型model = RandomForestClassifier()# 定义超参数搜索空间，使用randint生成随机整数param_dist = {n_estimators: randint(50, 200),max_features: randint(1, 3)}# 创建RandomizedSearchCV实例，设置交叉验证次数为3，随机搜索10次random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=3)# 进行超参数调优random_search.fit(data, labels)# 打印最优超参数组合和得分print("最优超参数组合:", random_search.best_params_)print("最优得分:", random_search.best_score_)

代码拆解

导入numpy库用于数值计算，from sklearn.model_selection import RandomizedSearchCV导入随机搜索超参数工具，from sklearn.ensemble import RandomForestClassifier导入随机森林模型，from scipy.stats import randint用于生成随机整数，在定义超参数搜索空间时会用到。

定义模拟数据和标签，和前面的例子类似。

model = RandomForestClassifier()创建随机森林模型实例。

param_dist = {n_estimators: randint(50, 200),max_features: randint(1, 3)}定义超参数搜索空间，n_estimators在 50 到 200 之间随机取值，max_features在 1 到 3 之间随机取值。

random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=3)创建RandomizedSearchCV实例，传入随机森林模型model、超参数搜索空间param_distributions，n_iter=10表示随机搜索 10 次，cv=3表示进行 3 折交叉验证。

random_search.fit(data, labels)开始超参数调优，RandomizedSearchCV会随机选择 10 组超参数组合，用交叉验证评估模型性能。

最后打印出最优超参数组合和得分，了解模型在最优设置下的表现。

行业应用案例

在电商推荐系统中，为了给用户推荐更符合他们喜好的商品，需要训练一个推荐模型。随机森林模型可以用来分析用户的行为数据，如浏览记录、购买历史等。由于超参数众多，使用RandomizedSearchCV能在相对短的时间内找到较优的超参数组合，提高推荐的准确性。例如，通过调优后的模型，能为用户推荐更感兴趣的商品，增加用户的购买意愿，提升电商平台的销售额。

技巧八：利用学习曲线评估模型

学习曲线可以帮助我们了解模型在不同训练数据量下的性能表现，判断模型是否过拟合或欠拟合。假设你要训练一个支持向量机模型，看看它的学习曲线。

import numpy as npfrom sklearn.model_selection import learning_curvefrom sklearn.svm import SVCimport matplotlib.pyplot as plt# 模拟数据，假设有100个样本，每个样本有5个特征data = np.random.randn(100, 5)# 模拟标签labels = np.random.randint(0, 2, 100)# 定义支持向量机模型model = SVC()# 生成学习曲线数据，cv设置为5折交叉验证train_sizes, train_scores, test_scores = learning_curve(model, data, labels, cv=5)# 计算训练集和测试集的平均得分train_mean = np.mean(train_scores, axis=1)test_mean = np.mean(test_scores, axis=1)# 绘制学习曲线plt.plot(train_sizes, train_mean, label=Training Score)plt.plot(train_sizes, test_mean, label=Test Score)plt.xlabel(Training Examples)plt.ylabel(Score)plt.title(Learning Curve)plt.legend()plt.show()

代码拆解

导入numpy库用于数值计算，from sklearn.model_selection import learning_curve导入生成学习曲线的工具，from sklearn.svm import SVC导入支持向量机模型，import matplotlib.pyplot as plt导入绘图库，用于绘制学习曲线。

生成模拟数据和标签，data = np.random.randn(100, 5)生成 100 个样本，每个样本 5 个特征，labels = np.random.randint(0, 2, 100)生成 0 或 1 的随机标签。

model = SVC()创建支持向量机模型实例。

train_sizes, train_scores, test_scores = learning_curve(model, data, labels, cv=5)生成学习曲线数据，cv=5表示进行 5 折交叉验证。train_sizes是不同训练数据量的数组，train_scores是对应训练数据量下训练集的得分，test_scores是对应训练数据量下测试集的得分。

train_mean = np.mean(train_scores, axis=1)计算每个训练数据量下训练集得分的平均值，test_mean = np.mean(test_scores, axis=1)计算每个训练数据量下测试集得分的平均值。

使用matplotlib绘制学习曲线，plt.plot(train_sizes, train_mean, label=Training Score)绘制训练集得分曲线，plt.plot(train_sizes, test_mean, label=Test Score)绘制测试集得分曲线，设置xlabel、ylabel、title和legend，最后plt.show()显示图形。

行业应用案例

在图像识别领域，训练一个识别不同动物的模型时，通过学习曲线可以了解模型对数据量的需求。如果随着训练数据量的增加，测试集得分持续上升，说明模型还需要更多数据来提升性能；如果训练集和测试集得分差距很大，可能存在过拟合问题。比如，在训练一个识别濒危动物的模型时，通过学习曲线调整数据量和模型参数，能提高模型的识别准确率，更好地保护濒危动物。

技巧九：使用验证曲线调整超参数

验证曲线可以帮助我们了解某个超参数对模型性能的影响，从而选择合适的超参数值。假设你要训练一个 K 近邻模型，调整它的n_neighbors（邻居数）超参数。

import numpy as npfrom sklearn.model_selection import validation_curvefrom sklearn.neighbors import KNeighborsClassifierimport matplotlib.pyplot as plt# 模拟数据，假设有50个样本，每个样本有4个特征data = np.random.randn(50, 4)# 模拟标签labels = np.random.randint(0, 2, 50)# 定义K近邻模型model = KNeighborsClassifier()# 定义要调整的超参数和其取值范围param_range = np.arange(1, 11)# 生成验证曲线数据，cv设置为5折交叉验证train_scores, test_scores = validation_curve(model, data, labels, param_name=n_neighbors, param_range=param_range, cv=5)# 计算训练集和测试集的平均得分train_mean = np.mean(train_scores, axis=1)test_mean = np.mean(test_scores, axis=1)# 绘制验证曲线plt.plot(param_range, train_mean, label=Training Score)plt.plot(param_range, test_mean, label=Test Score)plt.xlabel(Number of Neighbors)plt.ylabel(Score)plt.title(Validation Curve)plt.legend()plt.show()

代码拆解

导入numpy库用于数值计算，from sklearn.model_selection import validation_curve导入生成验证曲线的工具，from sklearn.neighbors import KNeighborsClassifier导入 K 近邻模型，import matplotlib.pyplot as plt导入绘图库。

生成模拟数据和标签。

model = KNeighborsClassifier()创建 K 近邻模型实例。

param_range = np.arange(1, 11)定义要调整的超参数n_neighbors的取值范围，从 1 到 10。

train_scores, test_scores = validation_curve(model, data, labels, param_name=n_neighbors, param_range=param_range, cv=5)生成验证曲线数据，param_name=n_neighbors指定要调整的超参数，cv=5表示进行 5 折交叉验证。train_scores是不同超参数值下训练集的得分，test_scores是不同超参数值下测试集的得分。

计算训练集和测试集得分的平均值，绘制验证曲线，通过观察曲线走势，选择使测试集得分较高且稳定的超参数值。

行业应用案例

在客户细分领域，企业要根据客户的各种特征将客户分成不同的群体，以便制定个性化的营销策略。K 近邻模型可以用于客户细分，通过验证曲线调整n_neighbors超参数，能找到最适合的邻居数，提高客户细分的准确性。比如，企业通过调优后的模型，能更准确地将客户分类，为不同类别的客户提供更符合他们需求的产品和服务，提高客户满意度和忠诚度。

技巧十：进行嵌套交叉验证

嵌套交叉验证是在交叉验证的基础上再进行一层交叉验证，能更准确地评估模型性能，尤其是在进行超参数调优时。假设你要训练一个逻辑回归模型，同时进行超参数调优和模型评估。

import numpy as npfrom sklearn.model_selection import GridSearchCV, StratifiedKFoldfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import make_classificationfrom sklearn.metrics import accuracy_score# 生成分类数据集，包含100个样本，20个特征，10个有效特征，5个类别data, labels = make_classification(n_samples=100, n_features=20, n_informative=10, n_classes=5)# 定义逻辑回归模型model = LogisticRegression()# 定义超参数搜索空间param_grid = {C: [0.01, 0.1, 1]}# 外层交叉验证，使用StratifiedKFold，分成5份outer_cv = StratifiedKFold(n_splits=5)# 存储每次外层交叉验证的得分outer_scores = []for train_index, test_index in outer_cv.split(data, labels): X_train, X_test = data[train_index], data[test_index] y_train, y_test = labels[train_index], labels[test_index] # 内层交叉验证，使用GridSearchCV进行超参数调优，cv设置为3 grid_search = GridSearchCV(model, param_grid, cv=3) grid_search.fit(X_train, y_train) # 使用最优模型在测试集上进行预测 best_model = grid_search.best_estimator_ y_pred = best_model.predict(X_test) # 计算并存储得分 score = accuracy_score(y_test, y_pred) outer_scores.append(score)# 打印平均得分print("平均得分:", np.mean(outer_scores))

代码拆解

导入numpy库用于数值计算，from sklearn.model_selection import GridSearchCV, StratifiedKFold导入网格搜索和分层交叉验证工具，from sklearn.linear_model import LogisticRegression导入逻辑回归模型，from sklearn.datasets import make_classification用于生成分类数据集，from sklearn.metrics import accuracy_score用于计算准确率。

data, labels = make_classification(n_samples=100, n_features=20, n_informative=10, n_classes=5)生成包含 100 个样本，20 个特征，其中 10 个有效特征，5 个类别的分类数据集。

model = LogisticRegression()创建逻辑回归模型实例。

param_grid = {C: [0.01, 0.1, 1]}定义超参数搜索空间，这里调整逻辑回归的正则化参数C。

outer_cv = StratifiedKFold(n_splits=5)创建外层交叉验证实例，使用StratifiedKFold分成 5 份。

在外层循环中，根据外层交叉验证的索引划分数据，X_train, X_test = data[train_index], data[test_index]，y_train, y_test = labels[train_index], labels[test_index]。

在内层，使用GridSearchCV进行超参数调优，grid_search = GridSearchCV(model, param_grid, cv=3)，在训练集上找到最优超参数组合。

best_model = grid_search.best_estimator_获取最优模型，y_pred = best_model.predict(X_test)在测试集上进行预测，score = accuracy_score(y_test, y_pred)计算准确率，将得分存储在outer_scores列表中。

最后计算并打印平均得分，这个得分能更准确地反映模型在不同数据划分和超参数组合下的性能。

行业应用案例

在自动驾驶领域，训练一个识别交通标志的模型时，需要确保模型的准确性和稳定性。使用嵌套交叉验证，既能在不同的数据划分下进行超参数调优，又能准确评估模型性能。比如，通过嵌套交叉验证训练的模型，能更准确地识别各种交通标志，提高自动驾驶的安全性，为实现更可靠的自动驾驶技术提供支持。

掌握了这 10 个使用 Scikit - ModelSelection 的高级技巧，你在选择和训练模型时就能更有把握。要是在实际应用中遇到问题，比如代码报错、模型性能不理想，仔细检查代码里的数据处理、参数设置这些地方，也可以回顾这些技巧。

HQY

要和谐，要有爱~

Python入门教程：一个强大的交叉验证Scikit - ModelSelection库

hqy 发表于2025-02-28 18:41:04 浏览16 评论0百度已收录

少长咸集

« 2025年6月 »
一	二	三	四	五	六	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30