Codecademy Logo

NFL Stats Case Study

NFL Game Stats

Offensive and defensive production is a set of team-level NFL stats that can be used to predict the outcome of a game.

Offensive stats include:

  • First down conversions by a team’s offense
  • Total yards gained by a team’s offense
  • Total passing yards gained by a team’s offense
  • Total rushing yards gained by a team’s offense
  • Turnovers committed by a team’s offense

Defensive stats include:

  • First down conversions allowed by a team’s defense
  • Total yards allowed by a team’s defense
  • Total passing yards allowed by a team’s defense
  • Total rushing yards allowed by a team’s defense
  • Turnovers in favor of the defensive team

We can explore further relationships within the data. For example, total yards is the sum of passing yards and rushing yards. Say two teams have high total yards, but one team has most of their yards gained from rushing yards. Based on this data point, we may predict that this team is likely to win against another team that defends poorly against a rushing offense, despite generally defending well.

We can view the stats and first few rows of our dataset using .head() in Python.

# import library and data
import pandas as pd
arizona_cardinals = pd.read_csv('2021_CRD.csv')
# view first 5 rows
arizona_cardinals.head()

Identifying Important Game Stats

The importance of each stat to a game prediction model can be assessed using a feature importance metric, such as comparing standardized coefficients from the model. Feature importance tells us how much each feature influences the final prediction.

For NFL data, this might suggest to us that a team’s defense could be its strong suit, or it could be that team’s ability to pass the football. In the example plot, rush yards was the most important stat for modeling wins.

Bar plot titled "Feature Importance for NFL Model". The x-axis is labeled "Score" running from 0.0 to 3.5. The y-axis is labeled "Stat" with three categories: "First Downs" is at 0.6, "Pass Yards" is at 1.4, and "Rush Yards" is at 3.4.

# get the importance coefficients from the model
importance = abs(model.coef_[0])
# visualize feature importance
sns.barplot(x=importance, y=features_names)
# add labels and titles
plt.suptitle('Feature Importance for NFL Model')
plt.xlabel('Score')
plt.ylabel('Stat')
plt.show()

Summarizing Game Stats

Python can be used to summarize NFL stats using counts and basic statistics. For example, say we have a variable or header called Day that tells us which day of the week a game was played. We can get counts of how many games were played each day using the function .value_counts().

Day Count
Sun 15
Mon 2
Sat 1
Thu 1
# import library and data
import pandas as pd
buffalo_bills = pd.read_csv('2021_BUF.csv')
# get counts for days of the week
buffalo_bills.value_counts('Day')

Improving Prediction Accuracy

A prediction model’s accuracy can be improved using tuning techniques in Python that adjust different parameters of a model.

For example, we can sweep across a range for two different model values:

  • C: a regularization parameter of our model
  • test_size: the proportion of data reserved for model testing.

By iterating over both lists, we can create a model of each combination of C and test_size. Then we can evaluate each model on the same testing data and select the model that has the best performance.

# model values to try
C = [1.0, 5.0, 10.0]
test_sizes = [0.05, 0.10, 0.15]
for c in C:
for test_size in test_sizes:
# split the data with the value of test_size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=1)
# fit the model with the value of C
model = LogisticRegression(C=c)
model.fit(X_train, y_train)
# evaluate model accuracy with this combination
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

Plotting Game Stats

Python can be used to visualize trends in NFL stats across winning and losing games. Say we wanted to look at how many yards a team’s defense allows on average, both for games won and for games lost. We can use the seaborn library in Python to generate a box plot that shows the variable for yards allowed by the defense PassY_defense by the game outcome result.

A plot showing PassY_defense vs result. The box plot for "loss/tie" runs from about 175 to 380, with the box edges at about 225 and 345, and the center line at about 260. The box plot for "win" runs from about 110 to 320, with the box edges at about 175 and 275, and the center line at about 200.

# import libraries and data
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
dallas_cowboys = pd.read_csv('2021_DAL.csv')
# generate box plot
plot = sns.boxplot(x='result', y='PassY_defense', data=dallas_cowboys)
plot.set_xticklabels(['loss/tie','win'])
plt.show()

Modeling Wins with Scikit-Learn

The sklearn library in Python can be used to run a logistic regression that predicts winning a game from NFL game stats. To use this, we simply load our dataset and perform the following three steps:

  1. Separate the game stats (known as features X) from the game outcomes (known as labels y).
  2. Randomly assign a proportion of our dataset to be training data (X_train and y_train) and testing data (X_test and y_test) using the train-test-split() function. Setting the random_state parameter to any positive integer ensures we can reproduce our work.
  3. Create an instance of the model LogisticRegression() and fit it to the training data. Our model will learn the patterns of game stats associated with wins in the training data.
  4. Call the predict() function of our trained model model on the testing data game stats to use our model to predict wins.
# import libraries and data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
seattle_seahawks = pd.load_csv('2021_SEA.csv')
# separate features and labels
X = seattle_seahawks.drop('result') # keep just the features, drop the label
y = seattle_seahawks['result'] # keep only the labels
# perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
# fit the model
model = LogisticRegression()
model.fit(X_train, y_train)
# save predictions for the test data
y_predicted = model.predict(X_test)

Checking Prediction Accuracy

Python can be used to check a logistic regression model’s accuracy, which is the percentage of correct predictions on a testing set of NFL stats with known game outcomes. The accuracy_score() function from sklearn.metrics will compare the model’s predicted outcomes to the known outcomes of the testing data and output the proportion of correct predictions.

For example, say we saved the known testing data outcomes to the variable y_test and the predicted testing data outcomes in the variable y_predicted. Using the accuracy_score() function with these two variables might produce an output like the following:

0.7246376811594203

This indicates that about 72% of our model’s predictions for the testing data were correct.

# import function from library
from sklearn.metrics import accuracy_score
# compute model accuracy
accuracy_score(y_test, y_predicted)

Learn more on Codecademy