Articles

Linear Regression with scikit-learn: A Step-by-Step Guide Using Python

Discover the fundamentals of linear regression and learn how to build linear regression and multiple regression models using the sklearn library in Python.

In the real world, events often follow patterns. A person with a high BMI is more likely to have a high blood sugar level. Similarly, a company’s stock prices depend on its profits, order book value, and liabilities. By identifying and modeling these patterns, we can predict outcomes that will help us make better decisions across domains. To achieve this, we can build a linear regression model using the sklearn module in Python.

In this article, we will discuss linear regression and how it works. We will also implement linear regression models using the sklearn module in Python to predict the disease progression of diabetic patients using features like BMI, blood pressure, and age. Finally, we will discuss the assumptions and use cases for linear regression models that will help you decide whether to use linear regression for a given dataset or not.

What is linear regression?

In statistics and machine learning, regression is the process of modeling the relationship between independent and dependent variables. Linear regression is a supervised machine learning algorithm that models the relationship between independent and dependent variables, assuming that the dependent variable is a linear combination of the input features. For example, we can model the relationship between age and blood sugar level of a given population as follows:

BloodSugarLevel=70+2×ageBlood Sugar Level = 70 + 2 \times age

Here, we have assumed that people’s blood sugar levels are linearly dependent on their age. According to the formula, a newborn child will have a blood sugar level in the 70s, and a 20-year-old person will have a blood sugar level of 110.

Now, suppose we have other population features, such as body mass index (BMI), blood pressure, and age. In that case, we can model the relationship between the features and the blood sugar level of a given population as follows:

BloodSugarLevel=110+0.5×age1.2×BMI+0.1×BloodPressureBlood Sugar Level = 110 + 0.5 \times age - 1.2 \times BMI + 0.1 \times BloodPressure

We can use linear regression in tasks like revenue prediction, drug dosage calculation, rent estimation, property price prediction, demand forecasting, etc. For these tasks, we build a linear regression model using the historical data and use the model to predict values for a given set of input features. Let’s discuss what a linear regression model is.

Related Course

Linear Regression in Python

Learn how to fit, interpret, and compare linear regression models in Python.Try it for free

What is a linear regression model?

A linear regression model mathematically represents the relationship between independent variables and a dependent variable. We can represent a linear regression model that predicts a variable Y based on an input variable X as follows:

Y=α+βXY = \alpha + \beta X

In this equation,

  • Y is the predicted output, like blood sugar level or house rent.
  • X is an input feature, like the age of a person or the carpet area of a house.
  • α is the intercept.
  • β is the slope of the linear equation, i.e., the coefficient of X.

If we have multiple independent variables, we can estimate the relationship between Yt and input features Xi by training a linear regression model using the following equation:

Y=α+β1X1+β2X2+β3X3++βNXNY = \alpha + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} + \dots + \beta_{N} X_{N}

Here,

  • Y is the predicted value for a dataset having input features X1, X2, X3, … XN.
  • α is the intercept.
  • β1, β2, β3,… βN are coefficients of input features X1, X2, X3, … XN.

What do α and β represent in the linear regression model?

In a linear regression model,

  • The intercept α represents the portion of the output not influenced by the input features included in the model. It serves as the starting point for evaluating the effects of the features on the output. The intercept represents the baseline output of the model when all the input features are at their mean values (since the features are normalized while feeding into the model, their mean is 0).
  • The coefficients βi of the input features represent the strength and direction of the relationship between each independent and dependent variable.
  • The magnitude of the coefficient βi represents how much Y changes for an increase in Xi​, holding all other variables constant.
  • The sign of the coefficient βi represents the direction of change in Y with change in Xi​. If βi>0, Y increases with an increase in the value of Xi, and vice versa​. If βi< 0$, Y decreases with an increase in the value Xi​, and vice versa.

By training a linear regression model to find the linear equation representing the relation between the input and output variables, we can predict the output for a given set of inputs. For this task, we use the LinearRegression() function defined in the sklearn module in Python.

For training the linear regression model, let’s first install the required libraries.

To implement linear regression using the sklearn module in Python, we need to install the scikit-learn library along with some helper modules like Pandas, Numpy, Matplotlib, and Seaborn on our machine. You can install these libraries using PIP by executing the following command on your machine.

pip install scikit-learn numpy pandas matplotlib seaborn

To check the version of the installed sklearn library, use the following command:

pip list | grep scikit-learn

Executing this command will give you an output as follows:

scikit-learn 1.6.1

You can check for installed versions of the other modules by replacing scikit-learn with other library names:

pip list | grep library_name

After installing the sklearn and other necessary modules to implement linear regression in Python, let’s download and analyze the dataset.

Downloading and analyzing the diabetes dataset for linear regression

We will use the diabetes dataset in the sklearn.datasets module to implement linear regression.

Download the dataset

We will use the load_diabetes() function in the sklearn.datasets module to download the dataset. In the load_diabetes() function, we will set the as_frame parameter to True so that the function returns the dataset as a pandas dataframe. By default, the scaled parameter in the function is set to True, and it returns the dataset scaled using standard scaling. We will also set the scaled parameter to False so that the function returns the original dataset values.

from sklearn.datasets import load_diabetes
dataset = load_diabetes(as_frame=True, scaled=False)

Get input and output features of the dataset

We can get the feature names from the dataset using the feature_names attribute.

input_features = dataset.feature_names
print("The features in the input dataset are:", input_features)

Output:

The features in the input dataset are: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

If you want to get more details about the dataset, its features, and its description, you can use the DESCR attribute of the dataset as follows:

print(dataset.DESCR)

To get the dataframe containing the input features, you can use the data attribute of the dataset:

data_df=dataset.data
print("The input dataset is:")
print(data_df.head())

Output:

The input dataset is:
age sex bmi bp s1 s2 s3 s4 s5 s6
0 59.0 2.0 32.1 101.0 157.0 93.2 38.0 4.0 4.8598 87.0
1 48.0 1.0 21.6 87.0 183.0 103.2 70.0 3.0 3.8918 69.0
2 72.0 2.0 30.5 93.0 156.0 93.6 41.0 4.0 4.6728 85.0
3 24.0 1.0 25.3 84.0 198.0 131.4 40.0 5.0 4.8903 89.0
4 50.0 1.0 23.0 101.0 192.0 125.4 52.0 4.0 4.2905 80.0

The output feature in the dataset is the target attribute that contains numbers representing the disease progression one year after baseline. You can get the target column using the target attribute:

disease_progression = dataset.target
print("The output feature values are:")
print(disease_progression.head())

Output:

The output feature values are:
0 151.0
1 75.0
2 141.0
3 206.0
4 135.0
Name: target, dtype: float64

Now that we have the dataset, let’s explore the relationships between the input features and the target variable.

Visualize the relationship between the input features and the output

To build a good linear regression model, the relationship between the input and output should be approximately linear to fit the data into a linear equation.

In the following sections, we will develop a linear regression model to predict diabetes progression based on the BMI feature from the input data. To see if the target value is linearly dependent on BMI, let’s create a scatter plot.

import matplotlib.pyplot as plt
plt.scatter(data_df["bmi"], disease_progression, color="midnightblue")
plt.title("BMI vs Diabetes Progression One Year After Baseline")
plt.xlabel("BMI")
plt.ylabel("Diabetes Progression")
plt.show()

The scatter plot for disease progression vs BMI looks as follows:

BMI vs. Diabetes Progression

You can observe that the change in target value is somewhat proportional to the change in BMI. Thus, we can build a linear regression model that estimates the disease progression value using BMI. For this, we can represent the model using a straight line, as shown in the following image:

BMI vs. Diabetes Progression regression line

We can model this straight line using the LinearRegression() function in the sklearn module. To do this, let’s first preprocess the dataset.

Preprocess data for linear regression

We will build the linear regression model to predict diabetes progression using BMI values. Hence, we will use the bmi column of the data_df as input and the disease_progression series as output. Let’s assign the input to a variable X and the target values to a variable y.

X = data_df[["bmi"]]
y = disease_progression

After assigning the input and output features to the X and y variables, we will divide the data into training and test sets for model training and evaluation.

Split dataset into train and test sets

We will divide the input dataset into training and test sets. The training set is used to train the linear regression model, and the test set is reserved for evaluating its performance. We will use the train_test_split() function defined in the sklearn.model_selection module to split the input and output features into train and test sets. It takes the following inputs:

  • The input X and the target y are the first and second inputs to the train_test_split() function.
  • Using the test_size parameter, we define the data portion to be used as the test set. We will set the test_size to 0.2 to use twenty percent of the input data as the test dataset.
  • By default, the train_test_split() function generates random train and test sets in each execution. We set the random_state parameter to an integer value to ensure that the data splits can be replicated. For this tutorial, we will set the random_state to 0. After this, the function will split the dataset in the same manner every time we execute the function.

After execution, the train_test_split() function returns training and test sets for input and target features, as follows:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 0 )

Scaling the dataset

Before feeding the training data into the model, we must also normalize it. For linear regression models, we scale the input data using standard scaling. It transforms the data so that each feature has a mean of zero and a standard deviation of one, as in standard normal distribution. Scaling the dataset helps us in the following ways:

  • If we have multiple input features with different scales, features with large magnitudes dominate the gradient, and the linear model converges slowly and unevenly while training. Scaling the feature values makes the model converge faster, leading to less training time.
  • Without scaling, the trained model’s coefficients are influenced by the units of measurement of the input features. The model’s coefficients trained on scaled data tell us how much the output changes per standard deviation change in the input feature, which is more interpretable.
  • Even if we are building a linear regression model using a single input feature, we should scale the values as scaling helps us remove extreme values that often destabilize gradient updates and slow down model convergence.

To scale our dataset, we will use the StandardScaler() function defined in the sklearn.preprocessing module. First, we will create the scaler object as follows:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

Next, we will train the scaler using the training input dataset by passing the X_train data to the fit() method.

scaler= scaler.fit(X_train)

Here, we have trained the standard scaler using only the training dataset, not the entire dataset. If we train it on the entire dataset, including the test data, we will be using information from the test set to normalize the training set, violating the principle that the test set must be completely unseen.

After training the standard scaler, we will transform the training and test dataset:

X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)

Scaling the target values is generally not required unless there are many extreme values. If we scale the target values during training, we must also inverse transform predictions generated by the trained model to the original scale, which creates an overhead. Hence, we won’t scale the target values.

With the input dataset scaled, we can proceed to build a linear regression model using the sklearn module in Python.

Implement linear regression using the sklearn module in Python

To implement linear regression in Python, we use the LinearRegression() function defined in the sklearn.linear_model module. Let’s discuss the steps to build a linear regression model using the LinearRegression() function.

Step 1: Create an untrained model

When we execute the LinearRegression() function, it returns an untrained linear regression model.

from sklearn.linear_model import LinearRegression
untrained_model=LinearRegression()

We can train this linear regression model using the fit() method.

Step 2: Train the linear regression model

The fit() method, when invoked on the untrained linear regression model, takes the scaled training dataset and target values as its first and second input arguments. After execution, it returns a trained linear regression model.

We will pass the X_train and y_train variables as the first and second input arguments to the fit() method, respectively.

trained_model= untrained_model.fit(X_train, y_train)

The trained model is fitted on a linear equation in the form Y= α+ βX. We can get the intercept α and the coefficient β from the model to get the linear equation.

  • To get the intercept α, we use the intercept_ attribute of the trained model.
  • To get the coefficients of the input features to the model, we use the coef_ attribute. It contains a list of the coefficients of all the input features.
intercept=trained_model.intercept_
coefficient= trained_model.coef_
print("The intercept is:",intercept)
print("The coefficient is:",coefficient)

Output:

The intercept is: 151.60623229461754
The coefficient is: [47.98832087]

You can see that the intercept of the model is 151.606, and the coefficient is 47.988. Hence, we can represent this model as Y = 151.606 + 47.988*bmi. We can use this model to predict the diabetes progression for BMI values in the test dataset.

Step 3: Predict values using the trained linear regression model

To predict outputs for new input values using the trained model, we use the predict() method. The predict() method takes the pandas dataframe containing independent features as its input and returns a numpy array with predictions. For example, we can predict the disease progression for the data points in the test dataset, as shown below:

y_predicted= trained_model.predict(X_test)
print("The predicted values are:")
print(y_predicted)

Output:

The predicted values are:
[255.17426905 211.79462571 161.0087018 129.26749936 196.98206457,.....]

Instead of the test dataset, we can also pass a list of new BMI values to predict disease progression. For this, we will use the following steps:

  • First, we will create a 2D list containing the new BMI values for which we need the predicted output.
  • Next, we will scale the new BMI values with the scaler we trained on the training dataset.
  • Finally, we will pass the scaled values to the predict() method to produce the output.

After execution, the predict() method returns a list with predicted output for each input value.

new_bmi_values=[[20],[22],[25],[27],[30]]
scaled_values=scaler.transform(new_bmi_values)
print("New BMI values are:")
print(new_bmi_values)
print("The scaled values are:")
print(scaled_values)
predictions= trained_model.predict(scaled_values)
print("The predicted values are:")
print(predictions)

Output:

New BMI values are:
[[20], [22], [25], [27], [30]]
The scaled values are:
[[-1.39151392]
[-0.95055659]
[-0.2891206 ]
[ 0.15183672]
[ 0.81327271]]
The predicted values are:
[ 84.82981594 105.99061756 137.73182001 158.89262164 190.63382408]

In this output, you can see that the model predicts a disease progression of 84.8 for BMI 20, 105.99 for BMI 22, and 190.3 for BMI 30. These values may or may not be accurate. Hence, we need to evaluate the model’s performance.

Step 4: Evaluate the model performance

We use the test dataset created using the train_test_split() function to evaluate the model’s performance. To do this, we will first predict the output of the model for the data in the test set.

Then, we will compare the predicted values with the actual values given in the target test set using metrics like mean squared error(MSE) and coefficient of determination (R2 score).

  • MSE calculates the average squared difference between predicted and actual values. For a model with a good performance, MSE should be low.
  • R2 score is a metric to evaluate how well a regression model performs. R2 scores lie between 0 and 1. A R2 score of 1 represents perfect prediction. A R2 score of 0 means that the regression model does no better than predicting the mean.

You can use these metrics using the mean_squared_error and r2_score functions defined in the sklearn.metrics module.

from sklearn.metrics import mean_squared_error, r2_score
y_predicted= trained_model.predict(X_test)
mse = mean_squared_error(y_test, y_predicted)
r2 = r2_score(y_test, y_predicted)
print("Mean squared error (MSE): %.2f" % mse)
print("R² score: %.2f" % r2)

Output:

Mean squared error (MSE): 4150.68
R² score: 0.19

You can see that the model has a very high MSE for the test set and a low R2 score of 0.19. This suggests that the trained model isn’t good for predicting diabetes disease progression using BMI.

In the previous subsections, we discussed how to build and evaluate a linear regression model using a single independent variable. Now, let’s discuss how to build linear regression models with multiple independent features.

Implement multiple linear regression using the sklearn module in Python

We will use the age, bmi, and bp features in the diabetes dataset to build a linear regression model with multiple input variables. The target values will be the disease progression values.

from sklearn.datasets import load_diabetes
dataset = load_diabetes(as_frame=True, scaled=False)
data_df=dataset.data
disease_progression = dataset.target
X = data_df[["age","bmi","bp"]]
y = disease_progression

Now, let’s preprocess this data.

Preprocess data for multiple linear regression

To implement multiple linear regression using the sklearn module, the model training, prediction, and evaluation process remains the same as those discussed for linear regression using a single input feature. However, we need to preprocess the data in a slightly different manner. Let’s discuss how to do this.

Correlation analysis

To build linear regression models with multiple features, we must ensure that two or more independent features aren’t highly correlated. When the input dataset contains highly correlated features, the model struggles to assign clear weights to correlated features while training, leading to large variances in coefficient estimates.

We can use the corr() method on the input dataframe X to check the correlation between the input features.

correlation_df=X.corr()
print("Correlation between features:")
print(correlation_df)

Output:

correlation_df=X.corr()
print("Correlation between features:")
print(correlation_df)
Correlation between features:
age bmi bp
age 1.000000 0.185085 0.335428
bmi 0.185085 1.000000 0.395411
bp 0.335428 0.395411 1.000000

As you can see, we don’t have highly correlated features in the dataset. Hence, we can train the linear regression model with the dataset. If your dataset has highly correlated features, you can choose one feature from the correlated features and drop the rest. You can also use techniques like principal component analysis to select the input features.

After correlation analysis, we can split the data into train and test sets, and scale the features as follows:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 0 )
# Normalize data using standard scaling
scaler = StandardScaler()
scaler= scaler.fit(X_train)
X_train=scaler.transform(X_train)
X_test=scaler.transform(X_test)

Train the regression model using multiple features

After getting the scaled dataset, the model training process for multiple regression is the same as that of a regression with a single variable. You can train the model as follows:

from sklearn.linear_model import LinearRegression
# Create an untrained linear regression model
untrained_model=LinearRegression()
# Train the model with multiple independent features
trained_model= untrained_model.fit(X_train,y_train)
# Get the intercept and coefficient from the model
intercept=trained_model.intercept_
coefficient= trained_model.coef_
print("The intercept is:",intercept)
print("The coefficients are:",coefficient)

Output:

The intercept is: 151.60623229461757
The coefficients are: [-0.51656823 40.63115896 18.26516127]

In this output, the intercept of the model is 151.6. The coefficient list has three values, i.e., -0.51, 40.63, and 18.26, which are the coefficients of the features age, bmi, and bp, respectively. Hence, we can represent this model as Y = 151.606 - 0.516*age + 40.63*bmi + 18.26*bp.

Evaluate the model performance

After training the model, we can predict values and evaluate the model as follows:

from sklearn.metrics import mean_squared_error, r2_score
y_predicted= trained_model.predict(X_test)
mse = mean_squared_error(y_test, y_predicted)
r2 = r2_score(y_test, y_predicted)
print("Mean squared error (MSE): %.2f" % mse)
print("R² score: %.2f" % r2)

Output:

Mean squared error (MSE): 3709.35
R² score: 0.28

You can see that the MSE for the linear regression model is 3709.35, and the R2 score is 0.28. It is still not a good enough model. However, the MSE for this model is less than the MSE for the model with one variable, and the R2 score of this model is greater than the R2 score of the previous model. Hence, this model is slightly better.

Assumptions for building linear regression models

Linear regression models are one of the simplest models to implement. However, we cannot use linear regression models for all types of datasets. This is because linear regression models depend on a set of assumptions:

  • Linearity: Linear regression assumes that the output variable is linearly dependent on the input features. The model performs poorly if the dependent and independent features aren’t linearly related.
  • Continuous target values: The dependent variable should be continuous as the linear regression model estimates a linear equation.
  • Independence: The data points in the input dataset should be independent of each other.
  • No multi-collinearity: Linear regression assumes that the input features aren’t highly correlated. If highly correlated features are present in the training dataset, the model struggles to assign clear weights to correlated features, leading to unstable coefficients while training.
  • Normality of Errors: Residuals are the differences between the actual values and the values predicted by the model. For a linear regression model, the residuals should have a normal distribution. If the residuals aren’t normally distributed, we cannot conduct hypothesis tests on coefficients or use confidence intervals for predictions.
  • Homoscedasticity: For a linear regression model, the variance of residuals is assumed to be constant across all the levels of independent variables. Thus, the spread of the residuals should remain roughly the same for a good model, no matter what the predicted value or the feature value is.

Verifying the assumptions of a linear regression model is important before applying its results to real-world scenarios. We should use linear regression models only if the datasets or the outputs satisfy the above assumptions.

Use cases for linear regression models

We use linear regression to model the relationship between one or more independent features and a continuous dependent variable. Linear regression has applications in multiple domains.

  • In business and marketing, linear regression can be used for sales forecasting, budget optimization, or predicting customer lifetime value.
  • In healthcare, we can use linear regression to predict blood pressure or blood sugar based on age, BMI, weight, etc.
  • In real estate, we can build linear regression models to predict property prices and rent.
  • In finance and economics, we can use linear regression to estimate stock returns or credit scores or to model GDP based on economic factors.

Conclusion

Linear regression models are very helpful in predictive modeling due to their simplicity and interpretability. Whether you want to predict housing prices, analyze health trends, or build credit risk models, linear regression offers a reliable starting point for the analytical process. In this article, we discussed the basics of linear regression, along with its assumptions and use cases. We also discussed how to implement linear regression with single and multiple variables using the sklearn module in Python.

To learn more about linear regression, you can take this course on Introduction to regression in machine learning, which discusses linear regression and multiple linear regression in detail. You might also like this machine learning engineer career path that teaches building end-to-end ML pipelines in Python.

Frequently asked questions

1. When to use linear regression?

We use linear regression to model the relationship between a continuous dependent variable and independent variables, assuming that the output is linearly dependent on the independent variables.

2. What are the different types of regression?

The different types of regression algorithms include linear regression, multiple linear regression, polynomial regression, logistic regression, and regularization techniques like Lasso regression and Ridge regression.

3. Is regression supervised or unsupervised?

Regression is a supervised machine learning algorithm.

4. What is the difference between correlation and regression?

Correlation denotes the relationship between two variables. Regression is used to model this relationship and predict the value of one variable based on the other.

5. Is ANOVA a regression analysis?

We can consider the Analysis of Variance (ANOVA) a special case of linear regression in which the independent variables are categorical. However, it’s not a regression analysis technique.

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team