This entry explores different ways to combine supervised machine learning models to maximize their predictive capability.
(13 min read)
When you’re building a machine learning model, people generally choose the one that performs the best according to some evaluation metric such as accuracy score or F1 score. However, choosing one model over another, say decision tree over k-nearest Neighbor (Knn) or logistic Regression, may result in us discarding strengths of the remaining models which were able to learn different patterns that might have additional useful properties.
When you conduct a survey, you don’t accept only one “best” answer. You consider a combined response of all the participants, and use statistics like the mode or the mean to represent the responses. The combined responses will likely lead to a better decision than relying on a single response. The same principle applies to ensemble methods, where we could form a new model by combining the existing ones. The combined model will have better performance than any of the individual models, or at least, be as good as the best individual model. In other words, ensemble learning methods is the usage of multiple machine learning models to maximize their predictive capability.
In this post, I will be exploring the usage of ensemble machine learning models to predict which mushrooms are edible based on their properties (e.g., cap size, color, odor). The data set is from the UC-Irvine Machine Learning repository and is currently distributed for practice on Kaggle. I will explore the usage of one model, model voting, bootstrap aggregating (aka bagging), model boosting, and model stacking via Python.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from mlxtend.classifier import StackingClassifier
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
= pd.read_csv("mushrooms.csv")
df
# data set shape
print("There are {} rows and {} columns in this dataset".format(df.shape[0], df.shape[1]))
There are 8124 rows and 23 columns in this dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 class 8124 non-null object
1 cap-shape 8124 non-null object
2 cap-surface 8124 non-null object
3 cap-color 8124 non-null object
4 bruises 8124 non-null object
5 odor 8124 non-null object
6 gill-attachment 8124 non-null object
7 gill-spacing 8124 non-null object
8 gill-size 8124 non-null object
9 gill-color 8124 non-null object
10 stalk-shape 8124 non-null object
11 stalk-root 8124 non-null object
12 stalk-surface-above-ring 8124 non-null object
13 stalk-surface-below-ring 8124 non-null object
14 stalk-color-above-ring 8124 non-null object
15 stalk-color-below-ring 8124 non-null object
16 veil-type 8124 non-null object
17 veil-color 8124 non-null object
18 ring-number 8124 non-null object
19 ring-type 8124 non-null object
20 spore-print-color 8124 non-null object
21 population 8124 non-null object
22 habitat 8124 non-null object
dtypes: object(23)
memory usage: 1.4+ MB
= {'e': 'edible' , 'p':'poisonous'}
class_dict'class'] = df['class'].map(class_dict)
df[
import matplotlib.pyplot as plt
import seaborn as sns
="darkgrid")
sns.set_theme(style= sns.countplot(x="class", data = df)
ax plt.show()
from sklearn.preprocessing import LabelEncoder
=LabelEncoder()
labelencoder
for col in df.columns:
= labelencoder.fit_transform(df[col])
df[col]
#The machine does not know about mushrooms. It only knows pattern of the data as reflected by the type of occurrence.
#Checking the encoded values
'stalk-color-above-ring'].unique() df[
array([7, 3, 6, 4, 0, 2, 5, 1, 8])
print(df.groupby('class').size())
class
0 4208
1 3916
dtype: int64
="darkgrid")
sns.set_theme(style= sns.countplot(x="class", data = df)
ax plt.show()
= 123
RANDOM_STATE
= df.drop('class', axis=1) #features
X = df['class'] #label
y
= train_test_split(X, y, test_size = 0.30,
X_train, X_test, y_train, y_test = RANDOM_STATE) random_state
#Instantiate a Naive Bayes classifier
= GaussianNB()
clf_nb
# Fit the model to the training set
clf_nb.fit(X_train,y_train)
# Calculate the predictions on the test set
GaussianNB()
= clf_nb.predict(X_test)
pred_nb
# Evaluate the performance using the accuracy score
print("Accuracy: {:0.4f}".format(accuracy_score(y_test, pred_nb)))
Accuracy: 0.9130
print("F1: {:0.4f}".format(f1_score(y_test, pred_nb)))
F1: 0.9091
= LogisticRegression(max_iter = 450, random_state = RANDOM_STATE)
clf_lr
clf_lr.fit(X_train,y_train)
LogisticRegression(max_iter=450, random_state=123)
= clf_lr.predict(X_test)
pred_lr
print("Accuracy: {:0.4f}".format(accuracy_score(y_test, pred_lr)))
Accuracy: 0.9471
print("F1: {:0.4f}".format(f1_score(y_test, pred_lr)))
F1: 0.9444
One type of ensemble methods is majority voting, which combines the output of many classifiers by using the mode of the individual predictions. It is recommended to use an odd number of classifiers. For example, if we use four classifiers, the predictions for positive and negative classes could be tied. Therefore, we need at least three classifiers, and when problem constraints allow it, use five or more.
There are some characteristics you need in your “crowd” for a voting ensemble to be effective. First, the ensemble needs to be diverse: you can do this by using different algorithms or different data sets. Second, each prediction needs to be independent and uncorrelated from the rest. Third, each model should be able to make its own prediction without relying on the other predictions. Finally, the ensemble model should aggregate individual predictions into a collective one. Keep in mind that Majority Voting is a technique which can only be applied to classification problems.
In addition to the Naive Bayes model we called for earlier, I will call for two additional models to participate in this majority voting method, namely k-Nearest Neighbors (Knn) and Decision Tree (DT). Knn is a classification algorithm in which a new data point is classified based on similarity in the specific group of neighboring data points. DR is another algorithm that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility to predict outcomes of a data point.
= KNeighborsClassifier(n_neighbors = 5)
clf_knn
= DecisionTreeClassifier(min_samples_leaf = 3, min_samples_split = 9, random_state = RANDOM_STATE)
clf_dt
from mlxtend.classifier import EnsembleVoteClassifier
= EnsembleVoteClassifier(clfs=[clf_nb, clf_knn, clf_dt], voting = "hard")
clf_vote
clf_vote.fit(X_train, y_train)
EnsembleVoteClassifier(clfs=[GaussianNB(), KNeighborsClassifier(),
DecisionTreeClassifier(min_samples_leaf=3,
min_samples_split=9,
random_state=123)])
= clf_vote.predict(X_test)
pred_vote
= f1_score(pred_vote, y_test)
score_vote
print("Accuracy: {:0.4f}".format(accuracy_score(y_test, pred_vote)))
Accuracy: 0.9996
print('F1-Score: {:.3f}'.format(score_vote))
F1-Score: 1.000
= EnsembleVoteClassifier(clfs=[clf_nb, clf_lr, clf_knn, clf_dt], voting = "soft")
clf_ave
clf_ave.fit(X_train, y_train)
EnsembleVoteClassifier(clfs=[GaussianNB(),
LogisticRegression(max_iter=450, random_state=123),
KNeighborsClassifier(),
DecisionTreeClassifier(min_samples_leaf=3,
min_samples_split=9,
random_state=123)],
voting='soft')
= clf_ave.predict(X_test)
pred_ave
= f1_score(pred_ave, y_test)
score_ave
print("Accuracy: {:0.4f}".format(accuracy_score(y_test, pred_ave)))
Accuracy: 0.9979
print('F1-Score: {:.3f}'.format(score_ave))
F1-Score: 0.998
Model Voting and averaging work by combining the predictions of already trained models. These estimators that we included in the voting process are so well trained that, in some cases, they produce decent results on their own. Voting is appropriate when you already have optimized models and want to improve performance further by combining them. But what happens when you don’t have these estimators trained beforehand? Well, that’s when “weak” estimators come into play.
The idea of “weak” doesn’t mean that it is a bad model, just that it is not as strong as a highly optimized, fine-tuned model. Bootstrap Aggregating (aka Bagging) is the ensemble method behind powerful machine learning algorithms such as random forests that works by combining several weak models together to work on the same task. To clarify, a weak model (e.g., a single DT) is the model which works just slightly better than random guessing (approximately 50%). Therefore, the error rate is less than 50% but close to it. However, they are light in terms of space and computational requirements, and fast during training and evaluation.
You might be wondering how it is possible for a large group of “weak” models to be able to achieve good performance? This is the work of the “wisdom of the crowd”. Do ensemble methods with the same model have that potential? We have to refer to “Condorcet’s Jury Theorem”. If a jury (or a model, in our case) has more than 50% probability of getting the right answer, adding more voters increases the probability that the majority decision is correct up to 100% (not exactly 100%, but close enough).
The requirements for this theorem are the following: First, all the models must be independent. Secondly, each model performs better than random guessing. And finally, all individual models have similar performance. If these three conditions are met, then adding more models increases the probability of the ensemble to be correct, and makes this probability tend to 1, equivalent to 100%! The second and third requirements can be fulfilled by using the same “weak” model for all the estimators, as then all will have a similar performance and be better than random guessing. Several jury theorems carry the optimistic message that, in suitable circumstances, “crowds are wise”: many individuals together (using, for instance, majority voting) tend to make good decisions, outperforming fewer or just one individual.
Let’s try using the Random Forest classifier, which is an ensemble of a large number of individual decision trees that are designed to be uncorrelated through randomization. Each tree is unique and has its own errors. A group of unique trees are able to produce results that are moving toward the right direction should the requirements mentioned above are fulfilled.
# random forest model creation
= RandomForestClassifier(random_state = RANDOM_STATE)
clf_rf
clf_rf.fit(X_train, y_train)
# predictions
RandomForestClassifier(random_state=123)
= clf_rf.predict(X_test)
pred_rfc
print("Accuracy: {:0.4f}".format(accuracy_score(y_test, pred_rfc)))
Accuracy: 1.0000
print("F1: {:0.4f}".format(f1_score(y_test, pred_rfc)))
F1: 1.0000
from sklearn import tree
= X.columns #for features
fn = df.columns[0] #for class
cn
= plt.subplots(nrows = 1,ncols = 5,figsize = (10,2), dpi=900)
fig, axes for index in range(0, 5):
tree.plot_tree(clf_rf.estimators_[index],= fn,
feature_names = cn,
class_names = True,
filled = axes[index]);
ax
'Estimator: ' + str(index), fontsize = 11)
axes[index].set_title('rf_5trees.png') fig.savefig(
The ensemble methods you’ve seen so far are based on an idea known as collective learning - that is, the wisdom of the crowd. For collective learning to be efficient, the estimators need to be independent and uncorrelated. In addition, all the estimators are learning the same task, for the same goal: to predict the target variable given the features. Gradual learning methods, on the other hand, are based on the principle of iterative learning. In this approach, each subsequent model tries to fix the errors of the previous model. Gradual learning creates dependent estimators, as each model takes advantage of the knowledge from the previous estimator.
In gradual learning, instead of the same model being corrected in every iteration, a new model is built that tries to fix the errors of the previous model. While this learning approach sounds promising, you should remain vigilant. It’s possible that some incorrect predictions may be made due to noise in the data, not because those data points are hard to predict. Gradient Boosting is another popular and powerful gradual learning ensemble model.
To explain how it works, let’s say that you want to predict values of a variable (e.g., whether a mushroom is edible, in our case). On the first iteration, our initial model is a weak estimator that is fit to the dataset. Then, on each subsequent iteration, a new model is built based pn error from the previous iteration. We repeat this process until the error is small enough such that the difference in performance is negligible. This is a peculiarity of Gradient Boosting, as the individual estimators are not combined through voting or averaging, but by addition. We had a model learn from its previous mistake.
#Call the model
= GradientBoostingClassifier(n_estimators = 100, learning_rate = 0.1, random_state=RANDOM_STATE)
clf_gbm
#Fit the model
clf_gbm.fit(X_train, y_train)
GradientBoostingClassifier(random_state=123)
= clf_gbm.predict(X_test)
pred_gbm
print("Accuracy: {:0.4f}".format(accuracy_score(y_test, pred_gbm)))
Accuracy: 1.0000
print("F1: {:0.4f}".format(f1_score(y_test, pred_gbm)))
F1: 1.0000
= xgb.XGBClassifier(objective='binary:logistic', n_estimators = 10, seed = RANDOM_STATE, use_label_encoder = False)
clf_xgb
clf_xgb.fit(X_train, y_train)
#n_estimator is the number of boosting round
# predictions
[23:57:05] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.5.1/src/learner.cc:1115: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.300000012,
max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=10, n_jobs=8,
num_parallel_tree=1, predictor='auto', random_state=123,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=123,
subsample=1, tree_method='exact', use_label_encoder=False,
validate_parameters=1, verbosity=None)
= clf_xgb.predict(X_test)
pred_xgb
print("Accuracy: {:0.4f}".format(accuracy_score(y_test, pred_xgb)))
Accuracy: 1.0000
print("F1: {:0.4f}".format(f1_score(y_test, pred_xgb)))
F1: 1.0000
The final type of ensemble model we will be discussing here is model stacking. Consider a relay race, in which sprinters run until they pass the baton over to the next on track. This is a good example of teamwork. While all team members must be strong competitors, each individual has a special role to play based on their abilities. In this case, we will have the lead runner (aka anchor model) that knows individual strengths and weaknesses of each team member. Second, they should have clearly define tasks to do. Each team member must know their responsibilities and focus on them. Finally, the anchor must participate in the race to carry out the final run.
Here’s a diagram depicting the architecture of stacking ensembles. Each individual model uses the same data set and input features. These are the first-layer estimators. Then, estimators pass their predictions as additional input features to the second-layer estimator.
So far, we have seen ensemble methods that use simple arithmetic operations like the mean or the mode as combiners. However, in Stacking, the combiner is itself a trainable model. In addition, this combiner model has not only the predictions as input features, but also the original dataset. This allows it to determine which estimator is more accurate depending on the input features. In other words, it brings out the best ability in its team members to complete the task. The combiner model (aka meta learner) plays similar roles to the anchor in the relay race. It is also the last team member and the one which provides the final predictions.
In this post, I will stack Knn, DT, Naive Bayes, and Random Forest classifiers together with logistic regression as the meta model (team leader).
= StackingClassifier(classifiers=[clf_knn, clf_dt, clf_nb, clf_rf], meta_classifier=clf_lr)
clf_stack clf_stack.fit(X_train, y_train)
StackingClassifier(classifiers=[KNeighborsClassifier(),
DecisionTreeClassifier(min_samples_leaf=3,
min_samples_split=9,
random_state=123),
GaussianNB(),
RandomForestClassifier(random_state=123)],
meta_classifier=LogisticRegression(max_iter=450,
random_state=123))
= clf_stack.predict(X_test)
pred_stack
print("Accuracy: {:0.4f}".format(accuracy_score(y_test, pred_stack)))
Accuracy: 1.0000
print("F1: {:0.4f}".format(f1_score(y_test, pred_stack)))
F1: 1.0000
= classification_report(y_test, pred_stack)
report_stack print(report_stack)
precision recall f1-score support
0 1.00 1.00 1.00 1271
1 1.00 1.00 1.00 1167
accuracy 1.00 2438
macro avg 1.00 1.00 1.00 2438
weighted avg 1.00 1.00 1.00 2438
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, pred_stack)
array([[1271, 0],
[ 0, 1167]], dtype=int64)
= {'kneighborsclassifier__n_neighbors': [1, 5],
params 'randomforestclassifier__n_estimators': [10, 50],
'meta_classifier__C': [0.1, 10.0]}
= GridSearchCV(estimator=clf_stack,
grid =params,
param_grid=5,
cv=True)
refit
grid.fit(X_train, y_train)
GridSearchCV(cv=5,
estimator=StackingClassifier(classifiers=[KNeighborsClassifier(),
DecisionTreeClassifier(min_samples_leaf=3,
min_samples_split=9,
random_state=123),
GaussianNB(),
RandomForestClassifier(random_state=123)],
meta_classifier=LogisticRegression(max_iter=450,
random_state=123)),
param_grid={'kneighborsclassifier__n_neighbors': [1, 5],
'meta_classifier__C': [0.1, 10.0],
'randomforestclassifier__n_estimators': [10, 50]})
= ('mean_test_score', 'std_test_score', 'params')
cv_keys
for r, _ in enumerate(grid.cv_results_['mean_test_score']):
print("%0.3f +/- %0.2f %r"
% (grid.cv_results_[cv_keys[0]][r],
1]][r] / 2.0,
grid.cv_results_[cv_keys[2]][r])) grid.cv_results_[cv_keys[
1.000 +/- 0.00 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
1.000 +/- 0.00 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 50}
1.000 +/- 0.00 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 10}
1.000 +/- 0.00 {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 50}
1.000 +/- 0.00 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
1.000 +/- 0.00 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 50}
1.000 +/- 0.00 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 10}
1.000 +/- 0.00 {'kneighborsclassifier__n_neighbors': 5, 'meta_classifier__C': 10.0, 'randomforestclassifier__n_estimators': 50}
print('Best parameters: %s' % grid.best_params_)
Best parameters: {'kneighborsclassifier__n_neighbors': 1, 'meta_classifier__C': 0.1, 'randomforestclassifier__n_estimators': 10}
print('Accuracy: %.2f' % grid.best_score_)
Accuracy: 1.00
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
import matplotlib.gridspec as gridspec
import itertools
from sklearn.decomposition import PCA
= PCA(n_components = 2)
pca
= X_train.to_numpy()
X_np = y_train.to_numpy()
y_np
= pca.fit_transform(X_np)
X_np_reduced
= gridspec.GridSpec(2, 2)
gs
= plt.figure(figsize=(10,8))
fig
for clf, lab, grd in zip([clf_knn, clf_dt, clf_nb, clf_stack],
'KNN',
['Decision Tree',
'Naive Bayes',
'StackingClassifier'],
0, 1], repeat=2)):
itertools.product([
clf.fit(X_np_reduced, y_np)= plt.subplot(gs[grd[0], grd[1]])
ax = plot_decision_regions(X=X_np_reduced, y=y_np, clf=clf)
fig
plt.title(lab)
KNeighborsClassifier()
Text(0.5, 1.0, 'KNN')
DecisionTreeClassifier(min_samples_leaf=3, min_samples_split=9,
random_state=123)
Text(0.5, 1.0, 'Decision Tree')
GaussianNB()
Text(0.5, 1.0, 'Naive Bayes')
StackingClassifier(classifiers=[KNeighborsClassifier(),
DecisionTreeClassifier(min_samples_leaf=3,
min_samples_split=9,
random_state=123),
GaussianNB(),
RandomForestClassifier(random_state=123)],
meta_classifier=LogisticRegression(max_iter=450,
random_state=123))
Text(0.5, 1.0, 'StackingClassifier')
plt.show()
To summarize, Voting combines the predictions of individual models using the mode. As the mode is a categorical measure, Voting can only be applied to Classification. The Averaging method combines the individual predictions using the mean. In contrast to Voting, Averaging can be applied on both classification and regression. Bagging uses a large amount of “weak” estimators. Their predictions are then aggregated by Voting or Averaging. Boosting is based on the iterative learning principle, in which each model attempts to fix the errors from the previous one. Therefore, this approach uses a sequential model building. Finally, model stacking works by combining individual estimators, but the combiner is an estimator itself, instead of just an operation.
Using ensemble techniques is an effective way to maximize the performance of our predictive model. However, a major drawback of this approach is model interpretability. While its performance is good, explaining how features interact with each other to form the result could be challenging. This is the black box problem that happens when the model is too complex that we don’t know how it actually works. If a single model performs relatively well, going for it would be a good compromise. The simpler the model is, the easier for us to explain the result to our audience. Rather than going full technical, we should consider perspectives of the audience as well to make our work accessible. As always, thank you very much for reading! Have a good one!
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wongvorachan (2022, April 9). Tarid Wongvorachan: Combining Multiple Machine Learning Models with the Ensemble Methods. Retrieved from https://taridwong.github.io/posts/2022-04-09-ensemble/
BibTeX citation
@misc{wongvorachan2022combining, author = {Wongvorachan, Tarid}, title = {Tarid Wongvorachan: Combining Multiple Machine Learning Models with the Ensemble Methods}, url = {https://taridwong.github.io/posts/2022-04-09-ensemble/}, year = {2022} }