Feature Selection
Feature Selection is a critical step in machine learning that helps identify a dataset’s most relevant features, improving model performance, reducing overfitting, and decreasing computation time.
Sklearn offers various methods for feature selection, including statistical tests, model-based selection, and iterative approaches.
Types of Feature Selection in Sklearn
Variance Threshold
- A simple baseline technique that removes features with variance below a predefined threshold.
- Features with very low variance across samples typically contribute little to the predictive power of the model.
Univariate Feature Selection
- Selects features based on univariate statistical tests.
- Commonly used methods include
SelectKBest
andSelectPercentile
. - Example tests:
f_classif
: For classification tasks, calculates the ANOVA F-value.chi2
: For non-negative feature values in classification.mutual_info_classif
: Captures non-linear dependencies for classification.
Recursive Feature Elimination (RFE)
- Iteratively fits a model and removes the least important features, refining the subset with each iteration.
- Works best with models that provide feature importance, such as linear models or tree-based algorithms.
Sequential Feature Selection (SFS)
- Sequentially adds or removes features to optimize a performance metric (e.g., accuracy).
- Two approaches:
- Forward Selection: Starts with no features, adding one at a time.
- Backward Elimination: Starts with all features, removing one at a time.
Advantages of Feature Selection
- Enhances model accuracy and efficiency.
- Reduces overfitting by removing irrelevant features.
- Simplifies model interpretability.
Syntax
Below is an example of using SelectKBest
for univariate feature selection:
from sklearn.feature_selection import SelectKBest, f_classif
# Applying SelectKBest to select the top 5 features with the highest ANOVA F-values
selector = SelectKBest(score_func=f_classif, k=5)
X_new = selector.fit_transform(X, y)
# Selected feature indices
selected_features = selector.get_support(indices=True)
print("Selected Features:", selected_features)
Example
Below is a complete example showcasing SelectFromModel
with Lasso regression for feature selection:
from sklearn.datasets import make_regressionfrom sklearn.linear_model import Lassofrom sklearn.feature_selection import SelectFromModelfrom sklearn.model_selection import train_test_split# Generating a synthetic regression datasetX, y = make_regression(n_samples=100, n_features=10, noise=10, random_state=42)# Splitting data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# Fitting Lasso regression with stronger regularizationmodel = Lasso(alpha=0.5).fit(X_train, y_train)# Use the pre-trained Lasso model for feature selectionselector = SelectFromModel(model, prefit=True)X_train_selected = selector.transform(X_train)X_test_selected = selector.transform(X_test)# Selected feature indicesselected_features = selector.get_support(indices=True)print("Selected Features:", selected_features)
The code above produces the output as follows:
Selected Features: [1 4 7]
Note: The exact selected features depend on the dataset and Lasso regularization strength.
Contribute to Docs
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.
Learn Python:Sklearn on Codecademy
- Career path
Computer Science
Looking for an introduction to the theory behind programming? Master Python while learning data structures, algorithms, and more!Includes 6 CoursesWith Professional CertificationBeginner Friendly75 hours - Course
Learn Python 3
Learn the basics of Python 3.12, one of the most powerful, versatile, and in-demand programming languages today.With CertificateBeginner Friendly23 hours