# EDA Prior To Fitting a Regression Model

Learn about recommended EDA steps before fitting a regression model.### Introduction

Before fitting any model, it is often important to conduct an exploratory data analysis (EDA) in order to check assumptions, inspect the data for anomalies (such as missing, duplicated, or mis-coded data), and inform feature selection/transformation. In this article, we will use pandas to explore some of the EDA techniques that are generally employed prior to fitting a regression model.

## The data

For our example analysis, we’ve downloaded a dataset from Kaggle which contains data on Major League Baseball (MLB) games from the 2016 season. We’ve saved this data as a `DataFrame`

named `bb`

. Suppose that we want to fit a linear regression to predict `attendance`

using the following predictors:

`game_type`

— is the game during the day or at night?`day_of_week`

— what day of the week did the game occur?`temperature`

— average game temperature (Fahrenheit).`sky`

— description of sky condition at the time of the game.`total_runs`

— total runs scored in the game.

### Preview the dataset

Any EDA process will probably begin by inspecting a subset of data. For a pandas `DataFrame`

, this can be done by using the `.head()`

method:

bb.head()

attendance | game_type | day_of_week | temperature | sky | total_runs | |
---|---|---|---|---|---|---|

0 | 40030.0 | Night Game | Sunday | 74 | Sunny | 7 |

1 | 21621.0 | Night Game | Wednesday | 55 | Overcast | 5 |

2 | 12622.0 | Night Game | Wednesday | 48 | Unknown | 6 |

3 | 18531.0 | Night Game | Wednesday | 65 | Cloudy | 4 |

4 | 18572.0 | Day Game | Wednesday | 77 | In Dome | 7 |

By looking at the first few rows of the data, we can often figure out what kind of data we have (eg., discrete or continuous) and get a sense of how they are coded. For example, we can see that `attendance`

, `temperature`

, and `total_runs`

are numbers, while `game_type`

, `day_of_week`

, and `sky`

appear to be text.

After our initial inspection, we’ll want to dig deeper to investigate the following:

- The data type of each variable.
- How discrete/categorical data is coded (and whether we need to make any changes).
- How the data are scaled.
- Whether there is missing data and how it is coded.
- Whether there are outliers.
- The distributions of continuous features.
- The relationships between pairs of features.

## Data types

It is important to check the data type for each feature. The quantitative variables should be read in as numbers — either `int64`

or `float64`

— and categorical variables should be stored as strings (columns of strings have a `dtype`

of `object`

because of how they are stored in Python). We can check data types of columns in a pandas `DataFrame`

using the `.dtypes`

property.

bb.dtypes

The output will look like this:

```
attendance float64
game_type object
day_of_week object
temperature object
sky object
total_runs object
dtype: object
```

From this output, we can see that `temperature`

and `total_runs`

are both quantitative variables but they were read in as `object`

dtypes. This can happen when there is a non-numeric character — such as a letter or punctuation symbol — in the same column. We would need to explore further in order to figure out what’s going on. For example, we might inspect a different set of rows and see the following:

attendance | game_type | day_of_week | temperature | sky | total_runs | |
---|---|---|---|---|---|---|

5 | 12757.0 | Night Game | Tuesday | 72 | In Dome | 5 |

6 | 28329.0 | Night Game | Tuesday | 70 | Unknown | - |

7 | 26049.0 | Night Game | Tuesday | Unknown | Sunny | 11 |

8 | 10478.0 | Night Game | Tuesda | 70 | NaN | 9 |

9 | 47820.0 | Day Game | Tuesday | 36 | Sunny | 8 |

We note that `temperature`

has missing data coded as `Unknown`

rather than `NaN`

. If we fit a regression on this data as is, we will end up treating temperature as a categorical variable and therefore fitting separate slopes for every value of temperature; instead, we probably want a single slope. To fix this, we’ll need to replace every `Unknown`

with some other value (or remove them from the data altogether) and recode the temperature column as an `int`

.

## Categorical encoding

EDA is also important during the feature engineering process in order to inform decisions around categorical encoding. This is important because categorical features with many levels are “expensive” to include in a regression model (we need to calculate a separate slope for each level). If one of the levels has only a few observations, we might want to delete those records from the data before fitting the model. We can check this using `.value_counts()`

:

bb['game_type'].value_counts(dropna=False)

Output:

```
Night Game 1664
Day Game 799
Name: game_type, dtype: int64
```

Based on the output, we can see here that there are two levels for `game_type`

; about one-third of games are day games and two-thirds are night games.

The `.value_counts()`

accessor can also illuminate other issues. For example, in the following output, we notice that one instance of `'Tuesday'`

was miscoded as `Tuesda`

. This can either be corrected or removed before proceeding with a regression model.

bb['day_of_week'].value_counts(dropna=False)

Output:

```
Saturday 396
Friday 394
Sunday 392
Wednesday 379
Tuesday 375
Monday 278
Thursday 248
Tuesda 1
Name: day_of_week, dtype: int64
```

There are a few different options for how we might want to code the `day_of_week`

variable. If attendance increases approximately linearly throughout the week, we might argue that `day_of_week`

is ordinal and code it as an `int`

in our model. However, attendance goes up and down throughout the week, we’re better off leaving it as an unordered category (`str`

). Finally, if we see that games on Friday-Sunday simply have higher attendance than other days of the week, we might re-code this feature to only have two levels: `Weekend`

and `Weekday`

. We can check this by using boxplots:

We can see here that attendance on Friday, Saturday, and Sunday is on average higher than the other days of the week. Therefore it may be beneficial to re-code this feature to either `Weekend`

or `Weekday`

.

## Scaling

For quantitative features, it is important to think about how each feature is scaled. Some features will be on vastly different scales than others just based on the nature of what the feature is measuring. For example, let’s look at `temperature`

and `total_runs`

using the `.describe()`

method:

bb.describe()

The output will look like this:

```
attendance temperature total_runs
count 2457.000000 2457.000000 2457.000000
mean 30380.462352 73.834959 8.949187
std 9874.626652 10.567219 4.579542
min 8766.000000 31.000000 1.000000
25% 22437.000000 67.000000 6.000000
50% 30628.000000 74.000000 8.000000
75% 38412.000000 81.000000 12.000000
max 54449.000000 101.000000 60.000000
```

These two features are on different scales because what they are measuring are different (`temperature`

is in degrees Fahrenheit, `total_runs`

is the number of runs scored in a game). Because of this, the ranges of values and the standard deviations for each are very different from one another. We can see here that `temperature`

has a standard deviation of about 10.57, while `total_runs`

has a standard deviation of about 4.58.

When working with features with largely differing scales, it is often a good idea to *standardize* the features so that they all have a mean of 0 and a standard deviation of 1.

A feature without any values close to zero may also make it more difficult to estimate and interpret the intercept of a regression model. Standardizing or otherwise re-scaling the feature can fix this issue.

## Missing data

When we initially inspected the data, we saw some evidence that missing data is coded in a few different ways:

attendance | game_type | day_of_week | temperature | sky | total_runs | |
---|---|---|---|---|---|---|

5 | 12757.0 | Night Game | Tuesday | 72 | In Dome | 5 |

6 | 28329.0 | Night Game | Tuesday | 70 | Unknown | - |

7 | 26049.0 | Night Game | Tuesday | Unknown | Sunny | 11 |

8 | 10478.0 | Night Game | Tuesda | 70 | NaN | 9 |

9 | 47820.0 | Day Game | Tuesday | 36 | Sunny | 8 |

For example, `temperature`

uses the term `Unknown`

, `sky`

uses both `Unknown`

and `NaN`

, and `total_runs`

has `-`

to represent a missing value. The observations with missing values will either have to be removed or replaced (with an imputed value or missing data type that Python can recognize, such as `np.NaN`

) in order to proceed with fitting a regression model.

## Outliers

In our EDA, it is important to check for outliers and skew in the data. One way to check for outliers is to use scatter plots:

bb.plot.scatter(x = 'total_runs',y = 'attendance')

We can see here that there is one instance where the total runs in a single game is about 60, which is much larger than in the other games. Depending on the situation, we may first want to verify that this value is correct, then we can decide whether or not to remove it prior to fitting the model.

## Distributions and associations

Prior to fitting a linear regression model, it can be important to inspect the distributions of quantitative features and investigate the relationships between features. We can visually inspect both of these by using a pair plot:

Looking at the histograms along the diagonal, `total_runs`

appears to be somewhat right-skewed. This indicates that we may want to transform this feature to make it more normally distributed.

We can explore the relationships between pairs of features by looking at the scatterplots off of the diagonal. This is useful for a few different reasons. For example, if we see non-linear associations between any of the predictors and the outcome variable, that might lead us to test out polynomial terms in our model. We can also get a sense for which features are most highly related to our outcome variable and check for collinearity. In this example, there appears to be a slight positive linear association between temperature and the total number of runs. We can further investigate this using a heat map of the correlation matrix:

There is a correlation of 0.35 between temperature and the total number of runs. This is not large enough to cause concern; however, if two or more predictors are highly correlated, we may consider leaving only one in our analysis. On the other hand, features that are highly correlated with our outcome variable are especially important to include in the model.

### Conclusion

Let’s review the ways we were able to explore this data set in preparation for a regression model:

- We previewed the first few rows of the data set using the
`.head()`

method. - We checked the data type of each variable in the data set using
`.dtypes`

and corrected variables with incorrect data types. - We investigated our categorical data to inform categorical encoding.
- We investigated the scale of our quantitative variables and considered whether standardizing/scaling might be appropriate.
- We investigated missing data.
- We checked for outliers.
- We inspected the distributions of our quantitative variables.
- We looked at the relationships between pairs of features using both scatter plots and box plots.

By going through these steps, we are more prepared to make decisions about feature selection/engineering and have learned valuable information about how to build a more accurate predictive model.