Offensive and defensive production is a set of team-level NFL stats that can be used to predict the outcome of a game.
Offensive stats include:
Defensive stats include:
We can explore further relationships within the data. For example, total yards is the sum of passing yards and rushing yards. Say two teams have high total yards, but one team has most of their yards gained from rushing yards. Based on this data point, we may predict that this team is likely to win against another team that defends poorly against a rushing offense, despite generally defending well.
We can view the stats and first few rows of our dataset using
.head() in Python.
# import library and dataimport pandas as pdarizona_cardinals = pd.read_csv('2021_CRD.csv')# view first 5 rowsarizona_cardinals.head()
The importance of each stat to a game prediction model can be assessed using a feature importance metric, such as comparing standardized coefficients from the model. Feature importance tells us how much each feature influences the final prediction.
For NFL data, this might suggest to us that a team’s defense could be its strong suit, or it could be that team’s ability to pass the football. In the example plot, rush yards was the most important stat for modeling wins.
# get the importance coefficients from the modelimportance = abs(model.coef_)# visualize feature importancesns.barplot(x=importance, y=features_names)# add labels and titlesplt.suptitle('Feature Importance for NFL Model')plt.xlabel('Score')plt.ylabel('Stat')plt.show()
Python can be used to summarize NFL stats using counts and basic statistics. For example, say we have a variable or header called
Day that tells us which day of the week a game was played. We can get counts of how many games were played each day using the function
# import library and dataimport pandas as pdbuffalo_bills = pd.read_csv('2021_BUF.csv')# get counts for days of the weekbuffalo_bills.value_counts('Day')
A prediction model’s accuracy can be improved using tuning techniques in Python that adjust different parameters of a model.
For example, we can sweep across a range for two different model values:
C: a regularization parameter of our model
test_size: the proportion of data reserved for model testing.
By iterating over both lists, we can create a model of each combination of
test_size. Then we can evaluate each model on the same testing data and select the model that has the best performance.
# model values to tryC = [1.0, 5.0, 10.0]test_sizes = [0.05, 0.10, 0.15]for c in C:for test_size in test_sizes:# split the data with the value of test_sizeX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=1)# fit the model with the value of Cmodel = LogisticRegression(C=c)model.fit(X_train, y_train)# evaluate model accuracy with this combinationpredictions = model.predict(X_test)accuracy = accuracy_score(y_test, predictions)
Python can be used to visualize trends in NFL stats across winning and losing games. Say we wanted to look at how many yards a team’s defense allows on average, both for games won and for games lost. We can use the seaborn library in Python to generate a box plot that shows the variable for yards allowed by the defense
PassY_defense by the game outcome
# import libraries and dataimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsdallas_cowboys = pd.read_csv('2021_DAL.csv')# generate box plotplot = sns.boxplot(x='result', y='PassY_defense', data=dallas_cowboys)plot.set_xticklabels(['loss/tie','win'])plt.show()
sklearn library in Python can be used to run a logistic regression that predicts winning a game from NFL game stats. To use this, we simply load our dataset and perform the following three steps:
X) from the game outcomes (known as labels
y_train) and testing data (
y_test) using the
train-test-split()function. Setting the
random_stateparameter to any positive integer ensures we can reproduce our work.
LogisticRegression()and fit it to the training data. Our model will learn the patterns of game stats associated with wins in the training data.
predict()function of our trained model
modelon the testing data game stats to use our model to predict wins.
# import libraries and dataimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionseattle_seahawks = pd.load_csv('2021_SEA.csv')# separate features and labelsX = seattle_seahawks.drop('result') # keep just the features, drop the labely = seattle_seahawks['result'] # keep only the labels# perform train-test splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)# fit the modelmodel = LogisticRegression()model.fit(X_train, y_train)# save predictions for the test datay_predicted = model.predict(X_test)
Python can be used to check a logistic regression model’s accuracy, which is the percentage of correct predictions on a testing set of NFL stats with known game outcomes. The
accuracy_score() function from
sklearn.metrics will compare the model’s predicted outcomes to the known outcomes of the testing data and output the proportion of correct predictions.
For example, say we saved the known testing data outcomes to the variable
y_test and the predicted testing data outcomes in the variable
y_predicted. Using the
accuracy_score() function with these two variables might produce an output like the following:
This indicates that about 72% of our model’s predictions for the testing data were correct.
# import function from libraryfrom sklearn.metrics import accuracy_score# compute model accuracyaccuracy_score(y_test, y_predicted)