Another way of choosing a model to make predictions for new data (also called out-of-sample prediction) is by using *training* and *test* datasets. The idea is that we only use PART of our data to fit the model, then we see how well the model performs in predicting the outcome of interest for the rest of our data. The process is as follows:

- First, we split our data into two subsets: a training set and a test set. Often, the training set is a larger proportion of the data.
- In other Python libraries, there are built-in functions to split a dataframe, but for the sake of understanding, we’ll do it explicitly here by randomly sampling from a list of row indices:

import numpy as np # Create a list of indices indices = range(len(rentals)) # Determine the size of the training set (s) s = int(0.8*len(indices)) # Randomly select 80% of the indices train_ind = np.random.choice(indices, size = s, replace = False) # Create a list of the remaining 20% of indices test_ind = list(set(indices) - set(train_ind)) # Split the data into the training and test sets rentals_train = rentals.iloc[train_ind] rentals_test = rentals.iloc[test_ind]

- Next, we fit the models we want to compare using the training set data only:

model1 = sm.OLS.from_formula('rent ~ bedrooms + bathrooms + size_sqft + min_to_subway + floor', data=rentals_train).fit() model2 = sm.OLS.from_formula('rent ~ bedrooms + bathrooms + size_sqft + min_to_subway + floor + borough', data=rentals_train).fit()

- Then, we use those models to predict the rental price for the apartments in the test set:

fitted1 = model1.predict(rentals_test) fitted2 = model2.predict(rentals_test)

- Finally, we can compare the predicted rents to the true rents in the test set and use a metric to determine how well each model performed.
- In this example, we’ll use a metric called
*predictive root mean squared error (PRMSE)*, which is exactly what the name sounds like: the square root of the mean squared difference between predicted and true values of the outcome variable. A smaller PRMSE means that the model performed better (the predicted values were more similar to the true values):

true = rentals_test.rent prmse1 = np.mean((true-fitted1)**2)**.5 prmse2 = np.mean((true-fitted2)**2)**.5 print(prmse1) #output: 1326.258 print(prmse2) #output: 1224.269

Based on this metric, we would choose the second model over the first one because it has a smaller PRMSE.

### Instructions

**1.**

In **script.py**, we’ve provided you with code to split the `bikes`

dataset into training and test sets (`bikes_train`

and `bikes_test`

). We’ve then fit two different models with the training set and saved those models as `model1`

and `model2`

, respectively.

Use `model1`

to predict the number of bikes rented (`cnt`

) for each day in the test dataset and save the result as `fitted1`

.

**2.**

Use `model2`

to predict the number of bikes rented (`cnt`

) for each day in the test dataset and save the result as `model2_predicted`

.

**3.**

Calculate the PRMSE for `model1`

and save it as `prmse1`

.

**4.**

Calculate the PRMSE for `model2`

and save it as `prmse2`

.

**5.**

Print out the PRMSE for both models.

**6.**

Which model would you choose based on PRMSE? Indicate your answer by setting a variable named `which_model`

equal to `1`

if you would choose `model1`

and equal to `2`

if you would choose `model2`

.