EDA Prior To Fitting a Regression Model

Codecademy Team
Learn about recommended EDA steps before fitting a regression model.

Introduction

Before fitting any model, it is often important to conduct an exploratory data analysis (EDA) in order to check assumptions, inspect the data for anomalies (such as missing, duplicated, or mis-coded data), and inform feature selection/transformation. In this article, we will use pandas to explore some of the EDA techniques that are generally employed prior to fitting a regression model.

The data

For our example analysis, we’ve downloaded a dataset from Kaggle which contains data on Major League Baseball (MLB) games from the 2016 season. We’ve saved this data as a DataFrame named bb. Suppose that we want to fit a linear regression to predict attendance using the following predictors:

  • game_type — is the game during the day or at night?
  • day_of_week — what day of the week did the game occur?
  • temperature — average game temperature (Fahrenheit).
  • sky — description of sky condition at the time of the game.
  • total_runs — total runs scored in the game.

Preview the dataset

Any EDA process will probably begin by inspecting a subset of data. For a pandas DataFrame, this can be done by using the .head() method:

bb.head()
attendance game_type day_of_week temperature sky total_runs
0 40030.0 Night Game Sunday 74 Sunny 7
1 21621.0 Night Game Wednesday 55 Overcast 5
2 12622.0 Night Game Wednesday 48 Unknown 6
3 18531.0 Night Game Wednesday 65 Cloudy 4
4 18572.0 Day Game Wednesday 77 In Dome 7

By looking at the first few rows of the data, we can often figure out what kind of data we have (eg., discrete or continuous) and get a sense of how they are coded. For example, we can see that attendance, temperature, and total_runs are numbers, while game_type, day_of_week, and sky appear to be text.

After our initial inspection, we’ll want to dig deeper to investigate the following:

  • The data type of each variable.
  • How discrete/categorical data is coded (and whether we need to make any changes).
  • How the data are scaled.
  • Whether there is missing data and how it is coded.
  • Whether there are outliers.
  • The distributions of continuous features.
  • The relationships between pairs of features.

Data types

It is important to check the data type for each feature. The quantitative variables should be read in as numbers — either int64 or float64 — and categorical variables should be stored as strings (columns of strings have a dtype of object because of how they are stored in Python). We can check data types of columns in a pandas DataFrame using the .dtypes property.

bb.dtypes

The output will look like this:

attendance     float64
game_type       object
day_of_week     object
temperature     object
sky             object
total_runs      object
dtype: object

From this output, we can see that temperature and total_runs are both quantitative variables but they were read in as object dtypes. This can happen when there is a non-numeric character — such as a letter or punctuation symbol — in the same column. We would need to explore further in order to figure out what’s going on. For example, we might inspect a different set of rows and see the following:

attendance game_type day_of_week temperature sky total_runs
5 12757.0 Night Game Tuesday 72 In Dome 5
6 28329.0 Night Game Tuesday 70 Unknown -
7 26049.0 Night Game Tuesday Unknown Sunny 11
8 10478.0 Night Game Tuesda 70 NaN 9
9 47820.0 Day Game Tuesday 36 Sunny 8

We note that temperature has missing data coded as Unknown rather than NaN. If we fit a regression on this data as is, we will end up treating temperature as a categorical variable and therefore fitting separate slopes for every value of temperature; instead, we probably want a single slope. To fix this, we’ll need to replace every Unknown with some other value (or remove them from the data altogether) and recode the temperature column as an int.

Categorical encoding

EDA is also important during the feature engineering process in order to inform decisions around categorical encoding. This is important because categorical features with many levels are “expensive” to include in a regression model (we need to calculate a separate slope for each level). If one of the levels has only a few observations, we might want to delete those records from the data before fitting the model. We can check this using .value_counts():

bb['game_type'].value_counts(dropna=False)

Output:

Night Game    1664
Day Game       799
Name: game_type, dtype: int64

Based on the output, we can see here that there are two levels for game_type; about one-third of games are day games and two-thirds are night games.

The .value_counts() accessor can also illuminate other issues. For example, in the following output, we notice that one instance of 'Tuesday' was miscoded as Tuesda. This can either be corrected or removed before proceeding with a regression model.

bb['day_of_week'].value_counts(dropna=False)

Output:

Saturday     396
Friday       394
Sunday       392
Wednesday    379
Tuesday      375
Monday       278
Thursday     248
Tuesda         1
Name: day_of_week, dtype: int64

There are a few different options for how we might want to code the day_of_week variable. If attendance increases approximately linearly throughout the week, we might argue that day_of_week is ordinal and code it as an int in our model. However, attendance goes up and down throughout the week, we’re better off leaving it as an unordered category (str). Finally, if we see that games on Friday-Sunday simply have higher attendance than other days of the week, we might re-code this feature to only have two levels: Weekend and Weekday. We can check this by using boxplots:

Boxplot of attendance by day of the week where weekend values are higher than weekday values

We can see here that attendance on Friday, Saturday, and Sunday is on average higher than the other days of the week. Therefore it may be beneficial to re-code this feature to either Weekend or Weekday.

Scaling

For quantitative features, it is important to think about how each feature is scaled. Some features will be on vastly different scales than others just based on the nature of what the feature is measuring. For example, let’s look at temperature and total_runs using the .describe() method:

bb.describe()

The output will look like this:

         attendance	  temperature	total_runs
count	2457.000000	 2457.000000	2457.000000
mean	 30380.462352	73.834959	  8.949187
std	  9874.626652	 10.567219	  4.579542
min	  8766.000000	 31.000000	  1.000000
25%	  22437.000000	67.000000	  6.000000
50%	  30628.000000	74.000000	  8.000000
75%	  38412.000000	81.000000	  12.000000
max	  54449.000000	101.000000	 60.000000

These two features are on different scales because what they are measuring are different (temperature is in degrees Fahrenheit, total_runs is the number of runs scored in a game). Because of this, the ranges of values and the standard deviations for each are very different from one another. We can see here that temperature has a standard deviation of about 10.57, while total_runs has a standard deviation of about 4.58.

When working with features with largely differing scales, it is often a good idea to standardize the features so that they all have a mean of 0 and a standard deviation of 1.

A feature without any values close to zero may also make it more difficult to estimate and interpret the intercept of a regression model. Standardizing or otherwise re-scaling the feature can fix this issue.

Missing data

When we initially inspected the data, we saw some evidence that missing data is coded in a few different ways:

attendance game_type day_of_week temperature sky total_runs
5 12757.0 Night Game Tuesday 72 In Dome 5
6 28329.0 Night Game Tuesday 70 Unknown -
7 26049.0 Night Game Tuesday Unknown Sunny 11
8 10478.0 Night Game Tuesda 70 NaN 9
9 47820.0 Day Game Tuesday 36 Sunny 8

For example, temperature uses the term Unknown, sky uses both Unknown and NaN, and total_runs has - to represent a missing value. The observations with missing values will either have to be removed or replaced (with an imputed value or missing data type that Python can recognize, such as np.NaN) in order to proceed with fitting a regression model.

Outliers

In our EDA, it is important to check for outliers and skew in the data. One way to check for outliers is to use scatter plots:

bb.plot.scatter(x = 'total_runs',y = 'attendance')

Scatter plot where the points are compressed to the left side with one extreme value on the right side

We can see here that there is one instance where the total runs in a single game is about 60, which is much larger than in the other games. Depending on the situation, we may first want to verify that this value is correct, then we can decide whether or not to remove it prior to fitting the model.

Distributions and associations

Prior to fitting a linear regression model, it can be important to inspect the distributions of quantitative features and investigate the relationships between features. We can visually inspect both of these by using a pair plot:

Pair plot between attendance, temperature, and total runs

Looking at the histograms along the diagonal, total_runs appears to be somewhat right-skewed. This indicates that we may want to transform this feature to make it more normally distributed.

We can explore the relationships between pairs of features by looking at the scatterplots off of the diagonal. This is useful for a few different reasons. For example, if we see non-linear associations between any of the predictors and the outcome variable, that might lead us to test out polynomial terms in our model. We can also get a sense for which features are most highly related to our outcome variable and check for collinearity. In this example, there appears to be a slight positive linear association between temperature and the total number of runs. We can further investigate this using a heat map of the correlation matrix:

Map of the correlation between temperature, total number of runs, and attendance where stronger correlation is more darkly colored

There is a correlation of 0.35 between temperature and the total number of runs. This is not large enough to cause concern; however, if two or more predictors are highly correlated, we may consider leaving only one in our analysis. On the other hand, features that are highly correlated with our outcome variable are especially important to include in the model.

Conclusion

Let’s review the ways we were able to explore this data set in preparation for a regression model:

  • We previewed the first few rows of the data set using the .head() method.
  • We checked the data type of each variable in the data set using .dtypes and corrected variables with incorrect data types.
  • We investigated our categorical data to inform categorical encoding.
  • We investigated the scale of our quantitative variables and considered whether standardizing/scaling might be appropriate.
  • We investigated missing data.
  • We checked for outliers.
  • We inspected the distributions of our quantitative variables.
  • We looked at the relationships between pairs of features using both scatter plots and box plots.

By going through these steps, we are more prepared to make decisions about feature selection/engineering and have learned valuable information about how to build a more accurate predictive model.