Ensembles
Ensembles are machine learning techniques that combine the predictions from multiple models in order to increase accuracy, robustness, and reliability in classification and regression tasks. Scikit-learn provides tools to build these sophisticated predictive systems effectively. Some of the ensemble techniques include Bagging and Boosting.
Bagging (Bootstrap Aggregating)
Bagging refers to training multiple models in parallel on different subsets of the data generated using bootstrapping or random sampling with replacement. The predictions from the models are combined.
This approach reduces the variance and prevents overfitting. Some popular algorithms that can be classified under bagging are Random Forest
and Bagging Classifier
.
Boosting
Boosting creates models sequentially, where each new model corrects the mistakes of the previous one by focusing on the harder instances that the former model failed to predict. Well-known boosting algorithms include AdaBoost
, Gradient Boosting
, and XGBoost
.
Syntax
Sklearn offers the BaggingClassifier
for performing classification tasks:
BaggingClassifier(estimator=None, n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=False, warm_start=False, n_jobs=None, random_state=None, verbose=0)
estimator
(None
, default=None
): The base estimator to fit on random subsets of the dataset. IfNone
, the algorithm uses a decision tree as the default estimator.n_estimators
(int, default=10
): Number of estimators in the ensemble.max_samples
(float, default=1.0
): The fraction of samples for fitting each estimator, must be between0
and1
.max_features
(float, default=1.0
): The fraction of features for fitting each estimator, must be between0
and1
.bootstrap
(bool, default=True
): Whether to use bootstrap sampling (sampling with replacement) for creating datasets for each estimator.bootstrap_features
(bool, default=False
): Determines whether to sample features with replacement for each estimator.oob_score
(bool, default=False
): Determines whether to use out-of-bag samples for estimating the generalization error.warm_start
(bool, default=False
): IfTrue
, the fit method adds more estimators to the existing ensemble instead of starting from scratch.n_jobs
(int, default=None
): The number of jobs to run in parallel for fitting the base estimators.None
means using1
core,-1
uses all available cores.random_state
(int, default=None
): Controls the randomness of the estimator fitting process, ensuring reproducibility.verbose
(int, default=0
): Controls the verbosity level of the fitting process, with higher values producing more detailed output.
Example
This example code demonstrates the use of the BaggingClassifier
to build an ensemble of Decision Trees
and examine its performance on the Iris dataset:
# Import all the necessary librariesfrom sklearn.ensemble import BaggingClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import accuracy_score# Load the Iris datasetdata = load_iris()X = data.datay = data.target# Split the dataset into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# Initialize a BaggingClassifier with a DecisionTreeClassifier as the base estimatorbagging_clf = BaggingClassifier(estimator=DecisionTreeClassifier(),n_estimators=50,max_samples=0.8,max_features=0.8,bootstrap=True,random_state=42)# Train the BaggingClassifierbagging_clf.fit(X_train, y_train)# Predict on the test sety_pred = bagging_clf.predict(X_test)# Evaluate accuracyaccuracy = accuracy_score(y_test, y_pred)print(f"Accuracy: {accuracy:.2f}")
The code results in the following output:
Accuracy: 1.00
Codebyte Example
This is an example that demonstrates the use of a VotingClassifier
to combine multiple classifiers (Decision Tree
, Support Vector Classifier
, and K-Nearest Neighbors
) for a classification task on the Iris dataset:
All contributors
- Anonymous contributor
Contribute to Docs
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.
Learn Python:Sklearn on Codecademy
- Career path
Data Scientist: Machine Learning Specialist
Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.Includes 27 CoursesWith Professional CertificationBeginner Friendly90 hours - Skill path
Intermediate Machine Learning
Level up your machine learning skills with tuning methods, advanced models, and dimensionality reduction.Includes 5 CoursesWith CertificateIntermediate8 hours