Learn
Multiple Linear Regression
Training Set vs. Test Set

As with most machine learning algorithms, we have to split our dataset into:

• Training set: the data used to fit the model
• Test set: the data partitioned away at the very start of the experiment (to provide an unbiased evaluation of the model) In general, putting 80% of your data in the training set and 20% of your data in the test set is a good place to start.

Suppose you have some values in x and some values in y:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2)

Here are the parameters:

• train_size: the proportion of the dataset to include in the train split (between 0.0 and 1.0)
• test_size: the proportion of the dataset to include in the test split (between 0.0 and 1.0)
• random_state: the seed used by the random number generator [optional]

To learn more, here is a Training Set vs Validation Set vs Test Set article.

### Instructions

1.

Import train_test_split from sklearn.model_selection.

2.

Create a DataFrame x that selects the following columns from the main df DataFrame:

• 'bedrooms'
• 'bathrooms'
• 'size_sqft'
• 'min_to_subway'
• 'floor'
• 'building_age_yrs'
• 'no_fee'
• 'has_roofdeck'
• 'has_washer_dryer'
• 'has_doorman'
• 'has_elevator'
• 'has_dishwasher'
• 'has_patio'
• 'has_gym'

Create a DataFrame y that selects the rent column from the main df DataFrame.

These are the columns we want to use for our regression model.

3.

Use scikit-learn’s train_test_split() method to split x into 80% training set and 20% testing set and generate:

• x_train
• x_test
• y_train
• y_test

Set the random_state to 6.

4.

Let’s take a look at the shapes of x_train, x_test, y_train, and y_test to see we got the proportion we wanted.

We have 14 features that we’re looking for for each apartment, and 1 label we’re looking for for each apartment.