www.久久爱.com狼人,97人人超级对对碰

新聞中心

這里有您想知道的互聯(lián)網(wǎng)營銷解決方案

Python中用XGBoost和scikit-learn進(jìn)行隨機(jī)梯度增強(qiáng)

集成決策樹的一種簡單技術(shù)涉及在訓(xùn)練數(shù)據(jù)集的子樣本上訓(xùn)練樹?？梢圆捎糜?xùn)練數(shù)據(jù)中行的子集來訓(xùn)練稱為袋裝的單個(gè)樹。在計(jì)算每個(gè)分割點(diǎn)時(shí)，如果還使用了訓(xùn)練數(shù)據(jù)的行的子集，則這稱為隨機(jī)森林。這些技術(shù)也可以在稱為隨機(jī)梯度增強(qiáng)的技術(shù)中用于梯度樹增強(qiáng)模型。

從策劃到設(shè)計(jì)制作，每一步都追求做到細(xì)膩，制作可持續(xù)發(fā)展的企業(yè)網(wǎng)站。為客戶提供成都做網(wǎng)站、網(wǎng)站制作、網(wǎng)站策劃、網(wǎng)頁設(shè)計(jì)、域名注冊、網(wǎng)絡(luò)空間、網(wǎng)絡(luò)營銷、VI設(shè)計(jì)、網(wǎng)站改版、漏洞修補(bǔ)等服務(wù)。為客戶提供更好的一站式互聯(lián)網(wǎng)解決方案,以客戶的口碑塑造優(yōu)易品牌,攜手廣大客戶,共同發(fā)展進(jìn)步。

在本文中，您將發(fā)現(xiàn)隨機(jī)梯度增強(qiáng)以及如何使用XGBoost和Python中的scikit-learn來調(diào)整采樣參數(shù)。閱讀這篇文章后，您將知道：

在數(shù)據(jù)子樣本上訓(xùn)練樹的原理以及如何將其用于梯度增強(qiáng)。
如何使用scikit-learn調(diào)整XGBoost中基于行的子采樣。
如何在XGBoost中按樹和拆分點(diǎn)調(diào)整基于列的子采樣。

隨機(jī)梯度提升

梯度增強(qiáng)是一個(gè)貪婪的過程。將新的決策樹添加到模型中，以更正現(xiàn)有模型的殘差。使用貪婪搜索過程創(chuàng)建每個(gè)決策樹，以選擇最能最小化目標(biāo)函數(shù)的分割點(diǎn)。這可能會(huì)導(dǎo)致樹一次又一次使用相同的屬性，甚至使用相同的分割點(diǎn)。

套袋是一種創(chuàng)建決策樹集合的技術(shù)，每個(gè)決策樹都來自訓(xùn)練數(shù)據(jù)中不同的隨機(jī)行子集。效果是，由于樣本的隨機(jī)性允許創(chuàng)建略有不同的樹木，因此從樹的集合中獲得了更好的性能，從而為集合的預(yù)測增加了方差。隨機(jī)森林通過在選擇分割點(diǎn)時(shí)對要素（列）進(jìn)行二次采樣，從而進(jìn)一步擴(kuò)大了這一步驟，從而進(jìn)一步增加了樹木的整體差異。這些相同的技術(shù)可以用于梯度提升中決策樹的構(gòu)建中，這種變化稱為隨機(jī)梯度提升。通常使用訓(xùn)練數(shù)據(jù)的激進(jìn)子樣本，例如40％到80％。

教程概述

在本教程中，我們將研究不同的二次采樣技術(shù)在梯度增強(qiáng)中的作用。我們將調(diào)整Python的XGBoost庫所支持的三種不同的隨機(jī)梯度增強(qiáng)方式，特別是：

創(chuàng)建每棵樹時(shí)，對數(shù)據(jù)集中的行進(jìn)行二次采樣。
創(chuàng)建每棵樹時(shí)對數(shù)據(jù)集中的列進(jìn)行二次采樣。
創(chuàng)建每個(gè)樹時(shí)，數(shù)據(jù)集中每個(gè)拆分的列的子采樣。

問題描述：Otto數(shù)據(jù)集

在本教程中，我們將使用“奧托集團(tuán)產(chǎn)品分類挑戰(zhàn)”數(shù)據(jù)集。該數(shù)據(jù)集可從Kaggle免費(fèi)獲得（您需要注冊到Kaggle才能下載此數(shù)據(jù)集）。您可以從“數(shù)據(jù)”頁面下載訓(xùn)練數(shù)據(jù)集train.csv.zip并將解壓縮后的train.csv文件放入您的工作目錄中。該數(shù)據(jù)集描述了61,000多種產(chǎn)品的93個(gè)混淆細(xì)節(jié)，這些產(chǎn)品分為10個(gè)產(chǎn)品類別（例如，時(shí)尚，電子產(chǎn)品等）。輸入屬性是某種不同事件的計(jì)數(shù)。目標(biāo)是對新產(chǎn)品做出預(yù)測，將其作為10個(gè)類別中每一個(gè)類別的概率數(shù)組，并使用多類對數(shù)損失（也稱為交叉熵）對模型進(jìn)行評(píng)估。該競賽已于2015年5月完成，并且由于示例數(shù)量不多，問題難度大，幾乎不需要數(shù)據(jù)準(zhǔn)備（除了將字符串類變量編碼為整數(shù)）的事實(shí)，該數(shù)據(jù)集對于XGBoost還是一個(gè)很大的挑戰(zhàn)。

在XGBoost中調(diào)整行二次采樣

行二次抽樣涉及選擇訓(xùn)練數(shù)據(jù)集的隨機(jī)樣本而不進(jìn)行替換?？梢栽趕ubsample參數(shù)的XGBoost類的scikit-learn包裝器中指定行子采樣。默認(rèn)值為1.0，該值不進(jìn)行二次采樣。我們可以使用scikit-learn中內(nèi)置的網(wǎng)格搜索功能來評(píng)估從0.1到1.0的不同子樣本值對Otto數(shù)據(jù)集的影響。

 
 
 
  
  
  [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

子樣本有9個(gè)變體，每個(gè)模型將使用10倍交叉驗(yàn)證進(jìn)行評(píng)估，這意味著需要訓(xùn)練和測試9×10或90個(gè)模型。

下面提供了完整的代碼清單。

 
 
 
  
  
  # XGBoost on Otto dataset, tune subsample  
  
  
  from pandas import read_csv  
  
  
  from xgboost import XGBClassifier  
  
  
  from sklearn.model_selection import GridSearchCV  
  
  
  from sklearn.model_selection import StratifiedKFold  
  
  
  from sklearn.preprocessing import LabelEncoder  
  
  
  import matplotlib  
  
  
  matplotlib.use('Agg')  
  
  
  from matplotlib import pyplot  
  
  
  # load data  
  
  
  data = read_csv('train.csv')  
  
  
  datadataset = data.values  
  
  
  # split data into X and y  
  
  
  X = dataset[:,0:94]  
  
  
  y = dataset[:,94]  
  
  
  # encode string class values as integers  
  
  
  label_encoded_y = LabelEncoder().fit_transform(y)  
  
  
  # grid search  
  
  
  model = XGBClassifier()  
  
  
  subsample = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]  
  
  
  param_grid = dict(subsamplesubsample=subsample)  
  
  
  kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)  
  
  
  grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)  
  
  
  grid_result = grid_search.fit(X, label_encoded_y)  
  
  
  # summarize results  
  
  
  print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))  
  
  
  means = grid_result.cv_results_['mean_test_score']  
  
  
  stds = grid_result.cv_results_['std_test_score']  
  
  
  params = grid_result.cv_results_['params']  
  
  
  for mean, stdev, param in zip(means, stds, params):  
  
  
   print("%f (%f) with: %r" % (mean, stdev, param))  
  
  
  # plot  
  
  
  pyplot.errorbar(subsample, means, yerr=stds)  
  
  
  pyplot.title("XGBoost subsample vs Log Loss")  
  
  
  pyplot.xlabel('subsample')  
  
  
  pyplot.ylabel('Log Loss')  
  
  
  pyplot.savefig('subsample.png')

運(yùn)行此示例將打印最佳配置以及每個(gè)測試配置的日志丟失。

注意：由于算法或評(píng)估程序的隨機(jī)性，或者數(shù)值精度的差異，您的結(jié)果可能會(huì)有所不同?？紤]運(yùn)行該示例幾次并比較平均結(jié)果。

我們可以看到，獲得的最佳結(jié)果是0.3，或者使用30％的訓(xùn)練數(shù)據(jù)集樣本訓(xùn)練樹。

 
 
 
  
  
  Best: -0.000647 using {'subsample': 0.3}  
  
  
  -0.001156 (0.000286) with: {'subsample': 0.1}  
  
  
  -0.000765 (0.000430) with: {'subsample': 0.2}  
  
  
  -0.000647 (0.000471) with: {'subsample': 0.3}  
  
  
  -0.000659 (0.000635) with: {'subsample': 0.4}  
  
  
  -0.000717 (0.000849) with: {'subsample': 0.5}  
  
  
  -0.000773 (0.000998) with: {'subsample': 0.6}  
  
  
  -0.000877 (0.001179) with: {'subsample': 0.7}  
  
  
  -0.001007 (0.001371) with: {'subsample': 0.8}  
  
  
  -0.001239 (0.001730) with: {'subsample': 1.0}

我們可以繪制這些均值和標(biāo)準(zhǔn)偏差對數(shù)損失值，以更好地了解性能如何隨子樣本值變化。

我們可以看到確實(shí)有30％的人具有最佳的平均表現(xiàn)，但是我們也可以看到，隨著比率的增加，表現(xiàn)的差異會(huì)明顯增加。有趣的是，所有子樣本值的平均性能都優(yōu)于不進(jìn)行子抽樣的平均性能（子樣本= 1.0）。

在XGBoost中按樹調(diào)整列二次采樣

我們還可以在增強(qiáng)模型中創(chuàng)建每個(gè)決策樹之前，創(chuàng)建要使用的特征（或列）的隨機(jī)樣本。在scikit-learn的XGBoost包裝器中，這由colsample_bytree參數(shù)控制。默認(rèn)值為1.0，表示在每個(gè)決策樹中使用所有列。我們可以在0.1到1.0之間評(píng)估colsample_bytree的值，以0.1為增量。

 
 
 
  
  
  [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]

完整實(shí)例如下：

 
 
 
  
  
  # XGBoost on Otto dataset, tune colsample_bytree  
  
  
  from pandas import read_csv  
  
  
  from xgboost import XGBClassifier  
  
  
  from sklearn.model_selection import GridSearchCV  
  
  
  from sklearn.model_selection import StratifiedKFold  
  
  
  from sklearn.preprocessing import LabelEncoder  
  
  
  import matplotlib  
  
  
  matplotlib.use('Agg')  
  
  
  from matplotlib import pyplot  
  
  
  # load data  
  
  
  data = read_csv('train.csv')  
  
  
  datadataset = data.values  
  
  
  # split data into X and y  
  
  
  X = dataset[:,0:94]  
  
  
  y = dataset[:,94]  
  
  
  # encode string class values as integers  
  
  
  label_encoded_y = LabelEncoder().fit_transform(y)  
  
  
  # grid search  
  
  
  model = XGBClassifier()  
  
  
  colsample_bytree = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]  
  
  
  param_grid = dict(colsample_bytreecolsample_bytree=colsample_bytree)  
  
  
  kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)  
  
  
  grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)  
  
  
  grid_result = grid_search.fit(X, label_encoded_y)  
  
  
  # summarize results  
  
  
  print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))  
  
  
  means = grid_result.cv_results_['mean_test_score']  
  
  
  stds = grid_result.cv_results_['std_test_score']  
  
  
  params = grid_result.cv_results_['params']  
  
  
  for mean, stdev, param in zip(means, stds, params):  
  
  
   print("%f (%f) with: %r" % (mean, stdev, param))  
  
  
  # plot  
  
  
  pyplot.errorbar(colsample_bytree, means, yerr=stds)  
  
  
  pyplot.title("XGBoost colsample_bytree vs Log Loss")  
  
  
  pyplot.xlabel('colsample_bytree')  
  
  
  pyplot.ylabel('Log Loss')  
  
  
  pyplot.savefig('colsample_bytree.png')

運(yùn)行此示例將打印最佳配置以及每個(gè)測試配置的日志丟失。

注意：由于算法或評(píng)估程序的隨機(jī)性，或數(shù)值精度的差異，您的結(jié)果可能會(huì)有所不同。

我們可以看到，模型的最佳性能是colsample_bytree = 1.0。這表明該問題進(jìn)行二次采樣不會(huì)增加價(jià)值。

 
 
 
  
  
  Best: -0.001239 using {'colsample_bytree': 1.0}  
  
  
  -0.298955 (0.002177) with: {'colsample_bytree': 0.1}  
  
  
  -0.092441 (0.000798) with: {'colsample_bytree': 0.2}  
  
  
  -0.029993 (0.000459) with: {'colsample_bytree': 0.3}  
  
  
  -0.010435 (0.000669) with: {'colsample_bytree': 0.4}  
  
  
  -0.004176 (0.000916) with: {'colsample_bytree': 0.5}  
  
  
  -0.002614 (0.001062) with: {'colsample_bytree': 0.6}  
  
  
  -0.001694 (0.001221) with: {'colsample_bytree': 0.7}  
  
  
  -0.001306 (0.001435) with: {'colsample_bytree': 0.8}  
  
  
  -0.001239 (0.001730) with: {'colsample_bytree': 1.0}

繪制結(jié)果，我們可以看到模型平穩(wěn)段的性能（至少在此比例下），值為0.5到1.0。

通過拆分在XGBoost中調(diào)整列二次采樣

不必為每個(gè)樹對列進(jìn)行一次子采樣，我們可以在決策樹的每個(gè)拆分中對它們進(jìn)行子采樣。原則上，這是隨機(jī)森林中使用的方法。我們可以在scikit-learn的XGBoost包裝器類的colsample_bylevel參數(shù)中設(shè)置每個(gè)拆分所使用的列樣本的大小。和以前一樣，我們將比率從10％更改為默認(rèn)值100％。

下面提供了完整的代碼清單。

 
 
 
  
  
  # XGBoost on Otto dataset, tune colsample_bylevel  
  
  
  from pandas import read_csv  
  
  
  from xgboost import XGBClassifier  
  
  
  from sklearn.model_selection import GridSearchCV  
  
  
  from sklearn.model_selection import StratifiedKFold  
  
  
  from sklearn.preprocessing import LabelEncoder  
  
  
  import matplotlib  
  
  
  matplotlib.use('Agg')  
  
  
  from matplotlib import pyplot  
  
  
  # load data  
  
  
  data = read_csv('train.csv')  
  
  
  datadataset = data.values  
  
  
  # split data into X and y  
  
  
  X = dataset[:,0:94]  
  
  
  y = dataset[:,94]  
  
  
  # encode string class values as integers  
  
  
  label_encoded_y = LabelEncoder().fit_transform(y)  
  
  
  # grid search  
  
  
  model = XGBClassifier()  
  
  
  colsample_bylevel = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]  
  
  
  param_grid = dict(colsample_bylevelcolsample_bylevel=colsample_bylevel)  
  
  
  kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)  
  
  
  grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)  
  
  
  grid_result = grid_search.fit(X, label_encoded_y)  
  
  
  # summarize results  
  
  
  print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))  
  
  
  means = grid_result.cv_results_['mean_test_score']  
  
  
  stds = grid_result.cv_results_['std_test_score']  
  
  
  params = grid_result.cv_results_['params']  
  
  
  for mean, stdev, param in zip(means, stds, params):  
  
  
   print("%f (%f) with: %r" % (mean, stdev, param))  
  
  
  # plot  
  
  
  pyplot.errorbar(colsample_bylevel, means, yerr=stds)  
  
  
  pyplot.title("XGBoost colsample_bylevel vs Log Loss")  
  
  
  pyplot.xlabel('colsample_bylevel')  
  
  
  pyplot.ylabel('Log Loss')  
  
  
  pyplot.savefig('colsample_bylevel.png')

運(yùn)行此示例將打印最佳配置以及每個(gè)測試配置的日志丟失。

我們可以看到，通過將colsample_bylevel設(shè)置為70％可獲得最佳結(jié)果，導(dǎo)致（倒置）對數(shù)損失為-0.001062，這比將每棵樹的列采樣設(shè)置為100％時(shí)看到的-0.001239好。

如果每棵樹的結(jié)果建議使用100％的列，則建議不要放棄列二次采樣，而應(yīng)嘗試按拆分的列二次采樣。

 
 
 
  
  
  Best: -0.001062 using {'colsample_bylevel': 0.7}  
  
  
  -0.159455 (0.007028) with: {'colsample_bylevel': 0.1}  
  
  
  -0.034391 (0.003533) with: {'colsample_bylevel': 0.2}  
  
  
  -0.007619 (0.000451) with: {'colsample_bylevel': 0.3}  
  
  
  -0.002982 (0.000726) with: {'colsample_bylevel': 0.4}  
  
  
  -0.001410 (0.000946) with: {'colsample_bylevel': 0.5}  
  
  
  -0.001182 (0.001144) with: {'colsample_bylevel': 0.6}  
  
  
  -0.001062 (0.001221) with: {'colsample_bylevel': 0.7}  
  
  
  -0.001071 (0.001427) with: {'colsample_bylevel': 0.8}  
  
  
  -0.001239 (0.001730) with: {'colsample_bylevel': 1.0}

我們可以繪制每個(gè)colsample_bylevel變化的性能。結(jié)果表明，在此比例下的值為0.3后，方差相對較低，并且性能似乎處于平穩(wěn)狀態(tài)。

文章題目：Python中用XGBoost和scikit-learn進(jìn)行隨機(jī)梯度增強(qiáng)
瀏覽路徑：http://www.dlmjj.cn/article/djppjdc.html

日本综合一区二区|亚洲中文天堂综合|日韩欧美自拍一区|男女精品天堂一区|欧美自拍第6页亚洲成人精品一区|亚洲黄色天堂一区二区成人|超碰91偷拍第一页|日韩av夜夜嗨中文字幕|久久蜜综合视频官网|精美人妻一区二区三区

新聞中心

隨機(jī)梯度提升

教程概述

問題描述：Otto數(shù)據(jù)集

在XGBoost中調(diào)整行二次采樣

在XGBoost中按樹調(diào)整列二次采樣

通過拆分在XGBoost中調(diào)整列二次采樣

其他資訊