Another way of choosing a model to make predictions for new data (also called out-of-sample prediction) is by using training and test datasets. The idea is that we only use PART of our data to fit the model, then we see how well the model performs in predicting the outcome of interest for the rest of our data. The process is as follows:
- First, we split our data into two subsets: a training set and a test set. Often, the training set is a larger proportion of the data.
- In other Python libraries, there are built-in functions to split a dataframe, but for the sake of understanding, we’ll do it explicitly here by randomly sampling from a list of row indices:
import numpy as np # Create a list of indices indices = range(len(rentals)) # Determine the size of the training set (s) s = int(0.8*len(indices)) # Randomly select 80% of the indices train_ind = np.random.choice(indices, size = s, replace = False) # Create a list of the remaining 20% of indices test_ind = list(set(indices) - set(train_ind)) # Split the data into the training and test sets rentals_train = rentals.iloc[train_ind] rentals_test = rentals.iloc[test_ind]
- Next, we fit the models we want to compare using the training set data only:
model1 = sm.OLS.from_formula('rent ~ bedrooms + bathrooms + size_sqft + min_to_subway + floor', data=rentals_train).fit() model2 = sm.OLS.from_formula('rent ~ bedrooms + bathrooms + size_sqft + min_to_subway + floor + borough', data=rentals_train).fit()
- Then, we use those models to predict the rental price for the apartments in the test set:
fitted1 = model1.predict(rentals_test) fitted2 = model2.predict(rentals_test)
- Finally, we can compare the predicted rents to the true rents in the test set and use a metric to determine how well each model performed.
- In this example, we’ll use a metric called predictive root mean squared error (PRMSE), which is exactly what the name sounds like: the square root of the mean squared difference between predicted and true values of the outcome variable. A smaller PRMSE means that the model performed better (the predicted values were more similar to the true values):
true = rentals_test.rent prmse1 = np.mean((true-fitted1)**2)**.5 prmse2 = np.mean((true-fitted2)**2)**.5 print(prmse1) #output: 1326.258 print(prmse2) #output: 1224.269
Based on this metric, we would choose the second model over the first one because it has a smaller PRMSE.
Instructions
In script.py, we’ve provided you with code to split the bikes
dataset into training and test sets (bikes_train
and bikes_test
). We’ve then fit two different models with the training set and saved those models as model1
and model2
, respectively.
Use model1
to predict the number of bikes rented (cnt
) for each day in the test dataset and save the result as fitted1
.
Use model2
to predict the number of bikes rented (cnt
) for each day in the test dataset and save the result as fitted2
.
Calculate the PRMSE for model1
and save it as prmse1
.
Calculate the PRMSE for model2
and save it as prmse2
.
Print out the PRMSE for both models.
Which model would you choose based on PRMSE? Indicate your answer by setting a variable named which_model
equal to 1
if you would choose model1
and equal to 2
if you would choose model2
.