Offensive and defensive production is a set of team-level NFL stats that can be used to predict the outcome of a game.
Offensive stats include:
Defensive stats include:
We can explore further relationships within the data. For example, total yards is the sum of passing yards and rushing yards. Say two teams have high total yards, but one team has most of their yards gained from rushing yards. Based on this data point, we may predict that this team is likely to win against another team that defends poorly against a rushing offense, despite generally defending well.
We can view the stats and first few rows of our dataset using .head()
in Python.
# import library and dataimport pandas as pdarizona_cardinals = pd.read_csv('2021_CRD.csv')# view first 5 rowsarizona_cardinals.head()
The importance of each stat to a game prediction model can be assessed using a feature importance metric, such as comparing standardized coefficients from the model. Feature importance tells us how much each feature influences the final prediction.
For NFL data, this might suggest to us that a team’s defense could be its strong suit, or it could be that team’s ability to pass the football. In the example plot, rush yards was the most important stat for modeling wins.
# get the importance coefficients from the modelimportance = abs(model.coef_[0])# visualize feature importancesns.barplot(x=importance, y=features_names)# add labels and titlesplt.suptitle('Feature Importance for NFL Model')plt.xlabel('Score')plt.ylabel('Stat')plt.show()
Python can be used to summarize NFL stats using counts and basic statistics. For example, say we have a variable or header called Day
that tells us which day of the week a game was played. We can get counts of how many games were played each day using the function .value_counts()
.
Day | Count |
---|---|
Sun | 15 |
Mon | 2 |
Sat | 1 |
Thu | 1 |
# import library and dataimport pandas as pdbuffalo_bills = pd.read_csv('2021_BUF.csv')# get counts for days of the weekbuffalo_bills.value_counts('Day')
A prediction model’s accuracy can be improved using tuning techniques in Python that adjust different parameters of a model.
For example, we can sweep across a range for two different model values:
C
: a regularization parameter of our modeltest_size
: the proportion of data reserved for model testing.By iterating over both lists, we can create a model of each combination of C
and test_size
. Then we can evaluate each model on the same testing data and select the model that has the best performance.
# model values to tryC = [1.0, 5.0, 10.0]test_sizes = [0.05, 0.10, 0.15]for c in C:for test_size in test_sizes:# split the data with the value of test_sizeX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=1)# fit the model with the value of Cmodel = LogisticRegression(C=c)model.fit(X_train, y_train)# evaluate model accuracy with this combinationpredictions = model.predict(X_test)accuracy = accuracy_score(y_test, predictions)
Python can be used to visualize trends in NFL stats across winning and losing games. Say we wanted to look at how many yards a team’s defense allows on average, both for games won and for games lost. We can use the seaborn library in Python to generate a box plot that shows the variable for yards allowed by the defense PassY_defense
by the game outcome result
.
# import libraries and dataimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsdallas_cowboys = pd.read_csv('2021_DAL.csv')# generate box plotplot = sns.boxplot(x='result', y='PassY_defense', data=dallas_cowboys)plot.set_xticklabels(['loss/tie','win'])plt.show()
The sklearn
library in Python can be used to run a logistic regression that predicts winning a game from NFL game stats. To use this, we simply load our dataset and perform the following three steps:
X
) from the game outcomes (known as labels y
).X_train
and y_train
) and testing data (X_test
and y_test
) using the train-test-split()
function. Setting the random_state
parameter to any positive integer ensures we can reproduce our work.LogisticRegression()
and fit it to the training data. Our model will learn the patterns of game stats associated with wins in the training data.predict()
function of our trained model model
on the testing data game stats to use our model to predict wins.# import libraries and dataimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionseattle_seahawks = pd.load_csv('2021_SEA.csv')# separate features and labelsX = seattle_seahawks.drop('result') # keep just the features, drop the labely = seattle_seahawks['result'] # keep only the labels# perform train-test splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)# fit the modelmodel = LogisticRegression()model.fit(X_train, y_train)# save predictions for the test datay_predicted = model.predict(X_test)
Python can be used to check a logistic regression model’s accuracy, which is the percentage of correct predictions on a testing set of NFL stats with known game outcomes. The accuracy_score()
function from sklearn.metrics
will compare the model’s predicted outcomes to the known outcomes of the testing data and output the proportion of correct predictions.
For example, say we saved the known testing data outcomes to the variable y_test
and the predicted testing data outcomes in the variable y_predicted
. Using the accuracy_score()
function with these two variables might produce an output like the following:
0.7246376811594203
This indicates that about 72% of our model’s predictions for the testing data were correct.
# import function from libraryfrom sklearn.metrics import accuracy_score# compute model accuracyaccuracy_score(y_test, y_predicted)