Another way of choosing a model to make predictions for new data (also called out-of-sample prediction) is by using training and test datasets. The idea is that we only use PART of our data to fit the model, then we see how well the model performs in predicting the outcome of interest for the rest of our data. The process is as follows:

  • First, we split our data into two subsets: a training set and a test set. Often, the training set is a larger proportion of the data.
  • In other Python libraries, there are built-in functions to split a dataframe, but for the sake of understanding, we’ll do it explicitly here by randomly sampling from a list of row indices:
import numpy as np # Create a list of indices indices = range(len(rentals)) # Determine the size of the training set (s) s = int(0.8*len(indices)) # Randomly select 80% of the indices train_ind = np.random.choice(indices, size = s, replace = False) # Create a list of the remaining 20% of indices test_ind = list(set(indices) - set(train_ind)) # Split the data into the training and test sets rentals_train = rentals.iloc[train_ind] rentals_test = rentals.iloc[test_ind]
  • Next, we fit the models we want to compare using the training set data only:
model1 = sm.OLS.from_formula('rent ~ bedrooms + bathrooms + size_sqft + min_to_subway + floor', data=rentals_train).fit() model2 = sm.OLS.from_formula('rent ~ bedrooms + bathrooms + size_sqft + min_to_subway + floor + borough', data=rentals_train).fit()
  • Then, we use those models to predict the rental price for the apartments in the test set:
fitted1 = model1.predict(rentals_test) fitted2 = model2.predict(rentals_test)
  • Finally, we can compare the predicted rents to the true rents in the test set and use a metric to determine how well each model performed.
  • In this example, we’ll use a metric called predictive root mean squared error (PRMSE), which is exactly what the name sounds like: the square root of the mean squared difference between predicted and true values of the outcome variable. A smaller PRMSE means that the model performed better (the predicted values were more similar to the true values):
true = rentals_test.rent prmse1 = np.mean((true-fitted1)**2)**.5 prmse2 = np.mean((true-fitted2)**2)**.5 print(prmse1) #output: 1326.258 print(prmse2) #output: 1224.269

Based on this metric, we would choose the second model over the first one because it has a smaller PRMSE.



In script.py, we’ve provided you with code to split the bikes dataset into training and test sets (bikes_train and bikes_test). We’ve then fit two different models with the training set and saved those models as model1 and model2, respectively.

Use model1 to predict the number of bikes rented (cnt) for each day in the test dataset and save the result as fitted1.


Use model2 to predict the number of bikes rented (cnt) for each day in the test dataset and save the result as fitted2.


Calculate the PRMSE for model1 and save it as prmse1.


Calculate the PRMSE for model2 and save it as prmse2.


Print out the PRMSE for both models.


Which model would you choose based on PRMSE? Indicate your answer by setting a variable named which_model equal to 1 if you would choose model1 and equal to 2 if you would choose model2.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?