Log in from a computer to take this course

You'll need to log in from a computer to start Learn the Basics of Machine Learning. But you can practice or keep up your coding streak with the Codecademy Go app. Download the app to get started.

apple storegoogle store
Learn

As with most machine learning algorithms, we have to split our dataset into:

  • Training set: the data used to fit the model
  • Test set: the data partitioned away at the very start of the experiment (to provide an unbiased evaluation of the model)

Training Set vs. Testing Set

In general, putting 80% of your data in the training set and 20% of your data in the test set is a good place to start.

Suppose you have some values in x and some values in y:

from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, test_size=0.2)

Here are the parameters:

  • train_size: the proportion of the dataset to include in the train split (between 0.0 and 1.0)
  • test_size: the proportion of the dataset to include in the test split (between 0.0 and 1.0)
  • random_state: the seed used by the random number generator [optional]

To learn more, here is a Training Set vs Validation Set vs Test Set article.

Instructions

1.

Import train_test_split from sklearn.model_selection.

2.

Create a DataFrame x that selects the following columns from the main df DataFrame:

  • 'bedrooms'
  • 'bathrooms'
  • 'size_sqft'
  • 'min_to_subway'
  • 'floor'
  • 'building_age_yrs'
  • 'no_fee'
  • 'has_roofdeck'
  • 'has_washer_dryer'
  • 'has_doorman'
  • 'has_elevator'
  • 'has_dishwasher'
  • 'has_patio'
  • 'has_gym'

Create a DataFrame y that selects the rent column from the main df DataFrame.

These are the columns we want to use for our regression model.

3.

Use scikit-learn’s train_test_split() method to split x into 80% training set and 20% testing set and generate:

  • x_train
  • x_test
  • y_train
  • y_test

Set the random_state to 6.

4.

Let’s take a look at the shapes of x_train, x_test, y_train, and y_test to see we got the proportion we wanted.

We have 14 features that we’re looking for for each apartment, and 1 label we’re looking for for each apartment.

Sign up to start coding

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?