As with most machine learning algorithms, we have to split our dataset into:

  • Training set: the data used to fit the model
  • Test set: the data partitioned away at the very start of the experiment (to provide an unbiased evaluation of the model)

Training Set vs. Testing Set

In general, putting 80% of your data in the training set and 20% of your data in the test set is a good place to start.

Suppose you have some values in x and some values in y:

from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2)

Here are the parameters:

  • train_size: the proportion of the dataset to include in the train split (between 0.0 and 1.0)
  • test_size: the proportion of the dataset to include in the test split (between 0.0 and 1.0)
  • random_state: the seed used by the random number generator [optional]

Import train_test_split from sklearn.model_selection.


Create a DataFrame x that selects the following columns from the main df DataFrame:

  • 'bedrooms'
  • 'bathrooms'
  • 'size_sqft'
  • 'min_to_subway'
  • 'floor'
  • 'building_age_yrs'
  • 'no_fee'
  • 'has_roofdeck'
  • 'has_washer_dryer'
  • 'has_doorman'
  • 'has_elevator'
  • 'has_dishwasher'
  • 'has_patio'
  • 'has_gym'

Create a DataFrame y that selects the rent column from the main df DataFrame.

These are the columns we want to use for our regression model.


Use scikit-learn’s train_test_split() method to split x into 80% training set and 20% testing set and generate:

  • x_train
  • x_test
  • y_train
  • y_test

Set the random_state to 6.


Let’s take a look at the shapes of x_train, x_test, y_train, and y_test to see we got the proportion we wanted.

We have 14 features that we’re looking for for each apartment, and 1 label we’re looking for for each apartment.

