# NFL Stats Case Study

### NFL Game Stats

Offensive and defensive production is a set of team-level NFL stats that can be used to predict the outcome of a game.

Offensive stats include:

• First down conversions by a team’s offense
• Total yards gained by a team’s offense
• Total passing yards gained by a team’s offense
• Total rushing yards gained by a team’s offense
• Turnovers committed by a team’s offense

Defensive stats include:

• First down conversions allowed by a team’s defense
• Total yards allowed by a team’s defense
• Total passing yards allowed by a team’s defense
• Total rushing yards allowed by a team’s defense
• Turnovers in favor of the defensive team

We can explore further relationships within the data. For example, total yards is the sum of passing yards and rushing yards. Say two teams have high total yards, but one team has most of their yards gained from rushing yards. Based on this data point, we may predict that this team is likely to win against another team that defends poorly against a rushing offense, despite generally defending well.

We can view the stats and first few rows of our dataset using `.head()` in Python.

```# import library and dataimport pandas as pdarizona_cardinals = pd.read_csv('2021_CRD.csv')

### Identifying Important Game Stats

The importance of each stat to a game prediction model can be assessed using a feature importance metric, such as comparing standardized coefficients from the model. Feature importance tells us how much each feature influences the final prediction.

For NFL data, this might suggest to us that a team’s defense could be its strong suit, or it could be that team’s ability to pass the football. In the example plot, rush yards was the most important stat for modeling wins. ```# get the importance coefficients from the modelimportance = abs(model.coef_)
# visualize feature importancesns.barplot(x=importance, y=features_names)
# add labels and titlesplt.suptitle('Feature Importance for NFL Model')plt.xlabel('Score')plt.ylabel('Stat')plt.show()
```

### Summarizing Game Stats

Python can be used to summarize NFL stats using counts and basic statistics. For example, say we have a variable or header called `Day` that tells us which day of the week a game was played. We can get counts of how many games were played each day using the function `.value_counts()`.

Day Count
Sun 15
Mon 2
Sat 1
Thu 1
```# import library and dataimport pandas as pdbuffalo_bills = pd.read_csv('2021_BUF.csv')
# get counts for days of the weekbuffalo_bills.value_counts('Day')```

### Improving Prediction Accuracy

A prediction model’s accuracy can be improved using tuning techniques in Python that adjust different parameters of a model.

For example, we can sweep across a range for two different model values:

• `C`: a regularization parameter of our model
• `test_size`: the proportion of data reserved for model testing.

By iterating over both lists, we can create a model of each combination of `C` and `test_size`. Then we can evaluate each model on the same testing data and select the model that has the best performance.

```# model values to tryC = [1.0, 5.0, 10.0]test_sizes = [0.05, 0.10, 0.15]

for c in C:	for test_size in test_sizes:        # split the data with the value of test_size    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=1)        # fit the model with the value of C    model = LogisticRegression(C=c)    model.fit(X_train, y_train)        # evaluate model accuracy with this combination    predictions = model.predict(X_test)    accuracy = accuracy_score(y_test, predictions)```

### Plotting Game Stats

Python can be used to visualize trends in NFL stats across winning and losing games. Say we wanted to look at how many yards a team’s defense allows on average, both for games won and for games lost. We can use the seaborn library in Python to generate a box plot that shows the variable for yards allowed by the defense `PassY_defense` by the game outcome `result`. ```# import libraries and dataimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsdallas_cowboys = pd.read_csv('2021_DAL.csv')
# generate box plotplot = sns.boxplot(x='result', y='PassY_defense', data=dallas_cowboys)plot.set_xticklabels(['loss/tie','win'])plt.show()```

### Modeling Wins with Scikit-Learn

The `sklearn` library in Python can be used to run a logistic regression that predicts winning a game from NFL game stats. To use this, we simply load our dataset and perform the following three steps:

1. Separate the game stats (known as features `X`) from the game outcomes (known as labels `y`).
2. Randomly assign a proportion of our dataset to be training data (`X_train` and `y_train`) and testing data (`X_test` and `y_test`) using the `train-test-split()` function. Setting the `random_state` parameter to any positive integer ensures we can reproduce our work.
3. Create an instance of the model `LogisticRegression()` and fit it to the training data. Our model will learn the patterns of game stats associated with wins in the training data.
4. Call the `predict()` function of our trained model `model` on the testing data game stats to use our model to predict wins.
```# import libraries and dataimport pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionseattle_seahawks = pd.load_csv('2021_SEA.csv')
# separate features and labelsX = seattle_seahawks.drop('result') # keep just the features, drop the labely = seattle_seahawks['result'] # keep only the labels
# perform train-test splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
# fit the modelmodel = LogisticRegression()model.fit(X_train, y_train)
# save predictions for the test datay_predicted = model.predict(X_test)```

### Checking Prediction Accuracy

Python can be used to check a logistic regression model’s accuracy, which is the percentage of correct predictions on a testing set of NFL stats with known game outcomes. The `accuracy_score()` function from `sklearn.metrics` will compare the model’s predicted outcomes to the known outcomes of the testing data and output the proportion of correct predictions.

For example, say we saved the known testing data outcomes to the variable `y_test` and the predicted testing data outcomes in the variable `y_predicted`. Using the `accuracy_score()` function with these two variables might produce an output like the following:

``````0.7246376811594203
``````

This indicates that about 72% of our model’s predictions for the testing data were correct.

```# import function from libraryfrom sklearn.metrics import accuracy_score
# compute model accuracyaccuracy_score(y_test, y_predicted)```