Filter Methods
Introduction
Filter methods are a type of feature selection method that works by selecting features based on some criteria prior to building the model. Because they don’t involve actually testing the subsetted features using a model, they are computationally inexpensive and flexible to use for any type of machine learning algorithm. This makes filter methods an efficient initial step for narrowing down the pool of features to only the most relevant, predictive ones.
There are many different filter methods that can be used for evaluating and selecting features. In this article, we will use variance thresholds, correlation, and mutual information to rank and select the top features. To demonstrate how these methods work in Python, we will use the feature_selection
module from scikit-learn as well as the pandas
library.
Example dataset
Let’s suppose we have the following dataset containing information on a class of middle school students:
import pandas as pddf = pd.DataFrame(data={'edu_goal': ['bachelors', 'bachelors', 'bachelors', 'masters', 'masters', 'masters', 'masters', 'phd', 'phd', 'phd'],'hours_study': [1, 2, 3, 3, 3, 4, 3, 4, 5, 5],'hours_TV': [4, 3, 4, 3, 2, 3, 2, 2, 1, 1],'hours_sleep': [10, 10, 8, 8, 6, 6, 8, 8, 10, 10],'height_cm': [155, 151, 160, 160, 156, 150, 164, 151, 158, 152],'grade_level': [8, 8, 8, 8, 8, 8, 8, 8, 8, 8],'exam_score': [71, 72, 78, 79, 85, 86, 92, 93, 99, 100]})print(df)
Output:
edu_goal | hours_study | hours_TV | hours_sleep | height_cm | grade_level | exam_score |
---|---|---|---|---|---|---|
bachelors | 1 | 4 | 10 | 155 | 8 | 71 |
bachelors | 2 | 3 | 10 | 151 | 8 | 72 |
bachelors | 3 | 4 | 8 | 160 | 8 | 78 |
masters | 3 | 3 | 8 | 160 | 8 | 79 |
masters | 3 | 2 | 6 | 156 | 8 | 85 |
masters | 4 | 3 | 6 | 150 | 8 | 86 |
masters | 3 | 2 | 8 | 164 | 8 | 92 |
phd | 4 | 2 | 8 | 151 | 8 | 93 |
phd | 5 | 1 | 10 | 158 | 8 | 99 |
phd | 5 | 1 | 10 | 152 | 8 | 100 |
Our goal is to use the data to predict how well each student will perform on the exam. Thus, our target variable is exam_score
and the remaining 6 variables are our features. We’ll prepare the data by separating the features matrix (X
) and the target vector (y
).
10 x 6 features matrix:
X = df.drop(columns=['exam_score'])print(X)
Output:
edu_goal | hours_study | hours_TV | hours_sleep | height_cm | grade_level |
---|---|---|---|---|---|
bachelors | 1 | 4 | 10 | 155 | 8 |
bachelors | 2 | 3 | 10 | 151 | 8 |
bachelors | 3 | 4 | 8 | 160 | 8 |
masters | 3 | 3 | 8 | 160 | 8 |
masters | 3 | 2 | 6 | 156 | 8 |
masters | 4 | 3 | 6 | 150 | 8 |
masters | 3 | 2 | 8 | 164 | 8 |
phd | 4 | 2 | 8 | 151 | 8 |
phd | 5 | 1 | 10 | 158 | 8 |
phd | 5 | 1 | 10 | 152 | 8 |
10 x 1 target vector:
y = df['exam_score']print(y)
Output:
exam_score |
---|
71 |
72 |
78 |
79 |
85 |
86 |
92 |
93 |
99 |
100 |
Variance threshold
One of the most basic filter methods is to use a variance threshold to remove any features that have little to no variation in their values. This is because features with low variance do not contribute much information to a model. Since variance can only be calculated on numeric values, this method only works on quantitative features. That said, we may also want to remove categorical features for which all or a majority of the values are the same. To do that, we would need to dummy code the categorical variables first, but we won’t demonstrate that here.
In our example dataset, edu_goal
is the only feature that is not numeric. We can use the .drop()
method to remove it from our features DataFrame and store the remaining numeric features in X_num
:
X_num = X.drop(columns=['edu_goal'])print(X_num)
Output:
hours_study | hours_TV | hours_sleep | height_cm | grade_level |
---|---|---|---|---|
1 | 4 | 10 | 155 | 8 |
2 | 3 | 10 | 151 | 8 |
3 | 4 | 8 | 160 | 8 |
3 | 3 | 8 | 160 | 8 |
3 | 2 | 6 | 156 | 8 |
4 | 3 | 6 | 150 | 8 |
3 | 2 | 8 | 164 | 8 |
4 | 2 | 8 | 151 | 8 |
5 | 1 | 10 | 158 | 8 |
5 | 1 | 10 | 152 | 8 |
Now, we’ll be able to use the VarianceThreshold
class from scikit-learn
to help remove the low-variance features from X_num
. By default, it drops all features with zero variance, but we can adjust the threshold during class instantiation using the threshold
parameter if we want to allow some variation. The .fit_transform()
method returns the filtered features as a numpy array:
from sklearn.feature_selection import VarianceThresholdselector = VarianceThreshold(threshold=0) # 0 is defaultprint(selector.fit_transform(X_num))
The output will look like this:
[[ 1 4 10 155][ 2 3 10 151][ 3 4 8 160][ 3 3 8 160][ 3 2 6 156][ 4 3 6 150][ 3 2 8 164][ 4 2 8 151][ 5 1 10 158][ 5 1 10 152]]
As we can see, grade_level
was removed because there is no variation in its values — all students are 8th graders. Since this data is the same across the board, a student’s grade level will not be able to provide any useful predictive information about their exam score, so it makes sense to drop grade_level
as a feature.
Something to note is that loading datasets with scikit-learn
generally works with numpy arrays internally, hence the output type of .fit_transform()
. However, the methods can also accept other data types that can be converted to numpy arrays, such as Python lists or pandas DataFrames, like the X_num
we used. From a human perspective, one downside of working with numpy arrays as compared to pandas DataFrame is that we lose information like column headings, making the data harder to visually inspect.
Luckily, VarianceThreshold
offers another method called .get_support()
that can return the indices of the selected features, which we can use to manually subset our numeric features DataFrame.
Specify indices=True
to get indices of selected features:
print(selector.get_support(indices=True))
Output:
[0 1 2 3]
Use indices to get the corresponding column names of selected features:
num_cols = list(X_num.columns[selector.get_support(indices=True)])print(num_cols)
Output:
['hours_study', 'hours_TV', 'hours_sleep', 'height_cm']
Subset X_num
to retain only selected features:
X_num = X_num[num_cols]print(X_num)
Output:
hours_study | hours_TV | hours_sleep | height_cm |
---|---|---|---|
1 | 4 | 10 | 155 |
2 | 3 | 10 | 151 |
3 | 4 | 8 | 160 |
3 | 3 | 8 | 160 |
3 | 2 | 6 | 156 |
4 | 3 | 6 | 150 |
3 | 2 | 8 | 164 |
4 | 2 | 8 | 151 |
5 | 1 | 10 | 158 |
5 | 1 | 10 | 152 |
Finally, to obtain our entire features DataFrame, including the categorical column edu_goal
, we could do:
X = X[['edu_goal'] + num_cols]print(X)
Output:
edu_goal | hours_study | hours_TV | hours_sleep | height_cm |
---|---|---|---|---|
bachelors | 1 | 4 | 10 | 155 |
bachelors | 2 | 3 | 10 | 151 |
bachelors | 3 | 4 | 8 | 160 |
masters | 3 | 3 | 8 | 160 |
masters | 3 | 2 | 6 | 156 |
masters | 4 | 3 | 6 | 150 |
masters | 3 | 2 | 8 | 164 |
phd | 4 | 2 | 8 | 151 |
phd | 5 | 1 | 10 | 158 |
phd | 5 | 1 | 10 | 152 |
Pearson’s correlation
Another type of filter method involves finding the correlation between variables. In particular, the Pearson’s correlation coefficient is useful for measuring the linear relationship between two numeric, continuous variables — a coefficient close to 1
represents a positive correlation, -1
represents a negative correlation, and 0
represents no correlation. Like variance, Pearson’s correlation coefficient cannot be calculated for categorical variables. Although, there is a related point biserial correlation coefficient that can be computed when one variable is dichotomous, but we won’t focus on that here.
There are 2 main ways of using correlation for feature selection — to detect correlation between features and to detect correlation between a feature and the target variable.
Correlation between features
When two features are highly correlated with one another, then keeping just one to be used in the model will be enough because otherwise they provide duplicate information. The second variable would only be redundant and serve to contribute unnecessary noise.
To determine which variables are correlated with one another, we can use the .corr()
method from pandas
to find the correlation coefficient between each pair of numeric features in a DataFrame. By default, .corr()
computes the Pearson’s correlation coefficient, but alternative methods can be specified using the method
parameter. We can visualize the resulting correlation matrix using a heatmap:
import matplotlib.pyplot as pltimport seaborn as snscorr_matrix = X_num.corr(method='pearson') # 'pearson' is defaultsns.heatmap(corr_matrix, annot=True, cmap='RdBu_r')plt.show()
Output:
Let’s define high correlation as having a coefficient of greater than 0.7
or less than -0.7
. We can loop through the correlation matrix to identify the highly correlated variables:
# Loop over bottom diagonal of correlation matrixfor i in range(len(corr_matrix.columns)):for j in range(i):# Print variables with high correlationif abs(corr_matrix.iloc[i, j]) > 0.7:print(corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j])
The output to our code is:
hours_TV hours_study -0.780763315142435
As seen, hours_TV
appears to be highly negatively correlated with hours_study
— a student who watches a lot of TV tends to spend fewer hours studying, and vice versa. Because they provide redundant information, we can choose to remove one of those variables. To decide which one, we can look at their correlation with the target variable, then remove the one that is less associated with the target. This is explored in the next section.
Correlation between feature and target
As mentioned, the second way correlation can be used is to determine if there is a relationship between a feature and the target variable. In the case of Pearson’s correlation, this is especially useful if we intend to fit a linear model, which assumes a linear relationship between the target and predictor variables. If a feature is not very correlated with the target variable, such as having a coefficient of between -0.3
and 0.3
, then it may not be very predictive and can potentially be filtered out.
We can use the same .corr()
method seen previously to obtain the correlation between the target variable and the rest of the features. First, we’ll need to create a new DataFrame containing the numeric features with the exam_score
column:
X_y = X_num.copy()X_y['exam_score'] = yprint(X_y)
Output:
hours_study | hours_TV | hours_sleep | height_cm | exam_score |
---|---|---|---|---|
1 | 4 | 10 | 155 | 71 |
2 | 3 | 10 | 151 | 72 |
3 | 4 | 8 | 160 | 78 |
3 | 3 | 8 | 160 | 79 |
3 | 2 | 6 | 156 | 85 |
4 | 3 | 6 | 150 | 86 |
3 | 2 | 8 | 164 | 92 |
4 | 2 | 8 | 151 | 93 |
5 | 1 | 10 | 158 | 99 |
5 | 1 | 10 | 152 | 100 |
Then, we can generate the correlation matrix and isolate the column corresponding to the target variable to see how strongly each feature is correlated with it:
corr_matrix = X_y.corr()# Isolate the column corresponding to `exam_score`corr_target = corr_matrix[['exam_score']].drop(labels=['exam_score'])sns.heatmap(corr_target, annot=True, fmt='.3', cmap='RdBu_r')plt.show()
Output:
As seen, hours_study
is positively correlated with exam_score
and hours_TV
is negatively correlated with it. It makes sense that hours_study
and hours_TV
would be negatively correlated with each other as we saw earlier, and just one of those features would suffice for predicting exam_score
. Since hours_study
has a stronger correlation with the target variable, let’s remove hours_TV
as the redundant feature:
X = X.drop(columns=['hours_TV'])print(X)
Output:
edu_goal | hours_study | hours_sleep | height_cm |
---|---|---|---|
bachelors | 1 | 10 | 155 |
bachelors | 2 | 10 | 151 |
bachelors | 3 | 8 | 160 |
masters | 3 | 8 | 160 |
masters | 3 | 6 | 156 |
masters | 4 | 6 | 150 |
masters | 3 | 8 | 164 |
phd | 4 | 8 | 151 |
phd | 5 | 10 | 158 |
phd | 5 | 10 | 152 |
The other two features, hours_sleep
and height_cm
, both do not seem to be correlated with exam_score
, suggesting they would not be very good predictors. We could potentially remove either or both of them as being uninformative. But before we do, it is a good idea to use other methods to double check that the features truly are not predictive. We will do that in the next section by using mutual information to see if there are any non-linear associations between the features and target variable.
To conclude this section, we’ll briefly note an alternative approach for assessing the correlation between variables. Instead of generating the full correlation matrix, we could use the f_regression()
function from scikit-learn
to find the F-statistic for a model with each predictor on its own. The F-statistic will be larger (and p-value will be smaller) for predictors that are more highly correlated with the target variable, thus it will perform the same filtering:
from sklearn.feature_selection import f_regressionprint(f_regression(X_num, y))
Output:
(array([3.61362007e+01, 3.44537037e+01, 0.00000000e+00, 1.70259066e-03]),
array([3.19334945e-04, 3.74322763e-04, 1.00000000e+00, 9.68097878e-01]))
The function returns the F-statistic in the first array and the p-value in the second. As seen, the result is consistent with what we had observed in the correlation matrix — the stronger the correlation (either positive or negative) between the feature and target, the higher the corresponding F-statistic and lower the p-value. For example, amongst all the features, hours_study
has the largest correlation coefficient (0.905
), highest F-statistic (3.61e+01
), and lowest p-value (3.19e-04
).
Mutual information
The final filter method we’ll look at is using mutual information to rank and select the top features. Mutual information is a measure of dependence between two variables and can be used to gauge how much a feature contributes to the prediction of the target variable. It is similar to Pearson’s correlation, but is not limited to detecting linear associations. This makes mutual information useful for more flexible models where a linear functional form is not assumed. Another advantage of mutual information is that it also works on discrete features or target, unlike correlation. Although, categorical variables need to be numerically encoded first.
In our example, we can encode the edu_goal
column using the LabelEncoder
class from scikit-learn
‘s preprocessing
module:
from sklearn.preprocessing import LabelEncoderle = LabelEncoder()# Create copy of `X` for encoded versionX_enc = X.copy()X_enc['edu_goal'] = le.fit_transform(X['edu_goal'])print(X_enc)
Output:
edu_goal | hours_study | hours_sleep | height_cm |
---|---|---|---|
0 | 1 | 10 | 155 |
0 | 2 | 10 | 151 |
0 | 3 | 8 | 160 |
1 | 3 | 8 | 160 |
1 | 3 | 6 | 156 |
1 | 4 | 6 | 150 |
1 | 3 | 8 | 164 |
2 | 4 | 8 | 151 |
2 | 5 | 10 | 158 |
2 | 5 | 10 | 152 |
Now, we can compute the mutual information between each feature and exam_score
using mutual_info_regression()
. This function is used because our target variable is continuous, but if we had a discrete target variable, we would use mutual_info_classif()
. We specify the random_state
in the function in order obtain reproducible results:
from sklearn.feature_selection import mutual_info_regressionprint(mutual_info_regression(X_enc, y, random_state=68))
Output:
[0.50396825 0.40896825 0.06896825 0. ]
The estimated mutual information between each feature and the target is returned in a numpy array, where each value is a non-negative number — the higher the value, the more predictive power is assumed.
However, we are missing one more important piece here. Earlier, even though we encoded edu_goal
to be numeric, that does not mean it should be treated as a continuous variable. In other words, the values of edu_goal
are still discrete and should be interpreted as such. If we plot edu_goal
against exam_score
on a graph, we can clearly see the steps between the values of edu_goal
:
In order to properly calculate the mutual information, we need to tell mutual_info_regression()
which features are discrete by providing their index positions using the discrete_features
parameter:
print(mutual_info_regression(X_enc, y, discrete_features=[0], random_state=68))
Output:
[0.75563492 0.38896825 0.18563492 0. ]
Compared to the earlier results, we now get greater mutual information between edu_goal
and the target variable once it is correctly interpreted as a discrete feature.
From the results, we can also see that there is 0
mutual information between height_cm
and exam_score
, suggesting that these variables are largely independent. This is consistent with what we saw earlier with Pearson’s correlation, where the correlation coefficient between them is very close to 0
as well.
What is interesting to note is that the mutual information between hours_sleep
and exam_score
is a positive value, even though their Pearson’s correlation coefficient is 0
. The answer becomes more clear when we plot the relationship between hours_sleep
and exam_score
:
As seen, there do seem to be some association between the variables, only it is not a linear one, which is why it was detected using mutual information but not Pearson’s correlation coefficient.
Finally, let’s look at using the SelectKBest
class from scikit-learn
to help pick out the top k
features with the highest ranked scores. In our case, we are looking to select features that share the most mutual information with the target variable. When we instantiate SelectKBest
, we’ll specify which scoring function to use and how many top features to select. Here, our scoring function is mutual_info_regression()
, but because we want to specify additional arguments besides the X
and y
inputs, we’ll need the help of the partial()
function from Python’s built-in functools
module. Then, the .fit_transform()
method will return the filtered features as a numpy array:
from sklearn.feature_selection import SelectKBestfrom functools import partialscore_func = partial(mutual_info_regression, discrete_features=[0], random_state=68)# Select top 3 features with the most mutual informationselection = SelectKBest(score_func=score_func, k=3)print(selection.fit_transform(X_enc, y))
Output:
[[ 0 1 10]
[ 0 2 10]
[ 0 3 8]
[ 1 3 8]
[ 1 3 6]
[ 1 4 6]
[ 1 3 8]
[ 2 4 8]
[ 2 5 10]
[ 2 5 10]]
As seen above, we selected the top 3 features based on mutual information, thus dropping height_cm
. Like VarianceThreshold
, SelectKBest
also offers the .get_support()
method that returns the indices of the selected features, so we could subset our original features DataFrame:
X = X[X.columns[selection.get_support(indices=True)]]print(X)
Output:
edu_goal | hours_study | hours_sleep |
---|---|---|
bachelors | 1 | 10 |
bachelors | 2 | 10 |
bachelors | 3 | 8 |
masters | 3 | 8 |
masters | 3 | 6 |
masters | 4 | 6 |
masters | 3 | 8 |
phd | 4 | 8 |
phd | 5 | 10 |
phd | 5 | 10 |
Conclusion
In our example dataset, we started out with 6 features for predicting the exam_score
of students. Using various filter methods, we narrowed down that set to just the top most relevant and informative ones. First, we eliminated grade_level
because it has zero variance and would contribute nothing to the model. Then, we dropped hours_TV
since it is highly correlated with hours_study
and is therefore redundant. Lastly, we filtered out height_cm
based on mutual information, which suggested that it does not have any meaningful association with the target variable, linear or otherwise, and would not have been very predictive.
Phew! That was a lot we were able to accomplish using filter methods. Being the most simple type of feature selection method, they sure do not lack power nor potential. It is certainly worth considering how you might want to incorporate filter methods into your next machine learning project.
Author
The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.
Meet the full team