Linear Regression with scikit-learn: A Step-by-Step Guide Using Python
In the real world, events often follow patterns. A person with a high BMI is more likely to have a high blood sugar level. Similarly, a company’s stock prices depend on its profits, order book value, and liabilities. By identifying and modeling these patterns, we can predict outcomes that will help us make better decisions across domains. To achieve this, we can build a linear regression model using the sklearn module in Python.
In this article, we will discuss linear regression and how it works. We will also implement linear regression models using the sklearn module in Python to predict the disease progression of diabetic patients using features like BMI, blood pressure, and age. Finally, we will discuss the assumptions and use cases for linear regression models that will help you decide whether to use linear regression for a given dataset or not.
What is linear regression?
In statistics and machine learning, regression is the process of modeling the relationship between independent and dependent variables. Linear regression is a supervised machine learning algorithm that models the relationship between independent and dependent variables, assuming that the dependent variable is a linear combination of the input features. For example, we can model the relationship between age and blood sugar level of a given population as follows:
Here, we have assumed that people’s blood sugar levels are linearly dependent on their age. According to the formula, a newborn child will have a blood sugar level in the 70s, and a 20-year-old person will have a blood sugar level of 110.
Now, suppose we have other population features, such as body mass index (BMI), blood pressure, and age. In that case, we can model the relationship between the features and the blood sugar level of a given population as follows:
We can use linear regression in tasks like revenue prediction, drug dosage calculation, rent estimation, property price prediction, demand forecasting, etc. For these tasks, we build a linear regression model using the historical data and use the model to predict values for a given set of input features. Let’s discuss what a linear regression model is.
Linear Regression in Python
Learn how to fit, interpret, and compare linear regression models in Python.Try it for freeWhat is a linear regression model?
A linear regression model mathematically represents the relationship between independent variables and a dependent variable. We can represent a linear regression model that predicts a variable Y based on an input variable X as follows:
In this equation,
- Y is the predicted output, like blood sugar level or house rent.
- X is an input feature, like the age of a person or the carpet area of a house.
- α is the intercept.
- β is the slope of the linear equation, i.e., the coefficient of X.
If we have multiple independent variables, we can estimate the relationship between Yt and input features Xi by training a linear regression model using the following equation:
Here,
- Y is the predicted value for a dataset having input features X1, X2, X3, … XN.
- α is the intercept.
- β1, β2, β3,… βN are coefficients of input features X1, X2, X3, … XN.
What do α and β represent in the linear regression model?
In a linear regression model,
- The intercept α represents the portion of the output not influenced by the input features included in the model. It serves as the starting point for evaluating the effects of the features on the output. The intercept represents the baseline output of the model when all the input features are at their mean values (since the features are normalized while feeding into the model, their mean is 0).
- The coefficients βi of the input features represent the strength and direction of the relationship between each independent and dependent variable.
- The magnitude of the coefficient βi represents how much Y changes for an increase in Xi, holding all other variables constant.
- The sign of the coefficient βi represents the direction of change in Y with change in Xi. If βi>0, Y increases with an increase in the value of Xi, and vice versa. If βi< 0$, Y decreases with an increase in the value Xi, and vice versa.
By training a linear regression model to find the linear equation representing the relation between the input and output variables, we can predict the output for a given set of inputs. For this task, we use the LinearRegression()
function defined in the sklearn module in Python.
For training the linear regression model, let’s first install the required libraries.
Installing the scikit-learn and related libraries
To implement linear regression using the sklearn module in Python, we need to install the scikit-learn library along with some helper modules like Pandas, Numpy, Matplotlib, and Seaborn on our machine. You can install these libraries using PIP by executing the following command on your machine.
pip install scikit-learn numpy pandas matplotlib seaborn
To check the version of the installed sklearn
library, use the following command:
pip list | grep scikit-learn
Executing this command will give you an output as follows:
scikit-learn 1.6.1
You can check for installed versions of the other modules by replacing scikit-learn
with other library names:
pip list | grep library_name
After installing the sklearn and other necessary modules to implement linear regression in Python, let’s download and analyze the dataset.
Downloading and analyzing the diabetes dataset for linear regression
We will use the diabetes dataset in the sklearn.datasets
module to implement linear regression.
Download the dataset
We will use the load_diabetes()
function in the sklearn.datasets
module to download the dataset. In the load_diabetes()
function, we will set the as_frame
parameter to True
so that the function returns the dataset as a pandas dataframe. By default, the scaled
parameter in the function is set to True
, and it returns the dataset scaled using standard scaling. We will also set the scaled
parameter to False
so that the function returns the original dataset values.
from sklearn.datasets import load_diabetesdataset = load_diabetes(as_frame=True, scaled=False)
Get input and output features of the dataset
We can get the feature names from the dataset using the feature_names
attribute.
input_features = dataset.feature_namesprint("The features in the input dataset are:", input_features)
Output:
The features in the input dataset are: ['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
If you want to get more details about the dataset, its features, and its description, you can use the DESCR
attribute of the dataset as follows:
print(dataset.DESCR)
To get the dataframe containing the input features, you can use the data
attribute of the dataset:
data_df=dataset.dataprint("The input dataset is:")print(data_df.head())
Output:
The input dataset is:age sex bmi bp s1 s2 s3 s4 s5 s60 59.0 2.0 32.1 101.0 157.0 93.2 38.0 4.0 4.8598 87.01 48.0 1.0 21.6 87.0 183.0 103.2 70.0 3.0 3.8918 69.02 72.0 2.0 30.5 93.0 156.0 93.6 41.0 4.0 4.6728 85.03 24.0 1.0 25.3 84.0 198.0 131.4 40.0 5.0 4.8903 89.04 50.0 1.0 23.0 101.0 192.0 125.4 52.0 4.0 4.2905 80.0
The output feature in the dataset is the target
attribute that contains numbers representing the disease progression one year after baseline. You can get the target column using the target
attribute:
disease_progression = dataset.targetprint("The output feature values are:")print(disease_progression.head())
Output:
The output feature values are:0 151.01 75.02 141.03 206.04 135.0Name: target, dtype: float64
Now that we have the dataset, let’s explore the relationships between the input features and the target variable.
Visualize the relationship between the input features and the output
To build a good linear regression model, the relationship between the input and output should be approximately linear to fit the data into a linear equation.
In the following sections, we will develop a linear regression model to predict diabetes progression based on the BMI feature from the input data. To see if the target value is linearly dependent on BMI, let’s create a scatter plot.
import matplotlib.pyplot as pltplt.scatter(data_df["bmi"], disease_progression, color="midnightblue")plt.title("BMI vs Diabetes Progression One Year After Baseline")plt.xlabel("BMI")plt.ylabel("Diabetes Progression")plt.show()
The scatter plot for disease progression vs BMI looks as follows:
You can observe that the change in target value is somewhat proportional to the change in BMI. Thus, we can build a linear regression model that estimates the disease progression value using BMI. For this, we can represent the model using a straight line, as shown in the following image:
We can model this straight line using the LinearRegression()
function in the sklearn
module. To do this, let’s first preprocess the dataset.
Preprocess data for linear regression
We will build the linear regression model to predict diabetes progression using BMI values. Hence, we will use the bmi
column of the data_df
as input and the disease_progression
series as output. Let’s assign the input to a variable X
and the target values to a variable y
.
X = data_df[["bmi"]]y = disease_progression
After assigning the input and output features to the X
and y
variables, we will divide the data into training and test sets for model training and evaluation.
Split dataset into train and test sets
We will divide the input dataset into training and test sets. The training set is used to train the linear regression model, and the test set is reserved for evaluating its performance. We will use the train_test_split()
function defined in the sklearn.model_selection
module to split the input and output features into train and test sets. It takes the following inputs:
- The input
X
and the targety
are the first and second inputs to thetrain_test_split()
function. - Using the
test_size
parameter, we define the data portion to be used as the test set. We will set thetest_size
to 0.2 to use twenty percent of the input data as the test dataset. - By default, the
train_test_split()
function generates random train and test sets in each execution. We set therandom_state
parameter to an integer value to ensure that the data splits can be replicated. For this tutorial, we will set therandom_state
to0
. After this, the function will split the dataset in the same manner every time we execute the function.
After execution, the train_test_split()
function returns training and test sets for input and target features, as follows:
from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 0 )
Scaling the dataset
Before feeding the training data into the model, we must also normalize it. For linear regression models, we scale the input data using standard scaling. It transforms the data so that each feature has a mean of zero and a standard deviation of one, as in standard normal distribution. Scaling the dataset helps us in the following ways:
- If we have multiple input features with different scales, features with large magnitudes dominate the gradient, and the linear model converges slowly and unevenly while training. Scaling the feature values makes the model converge faster, leading to less training time.
- Without scaling, the trained model’s coefficients are influenced by the units of measurement of the input features. The model’s coefficients trained on scaled data tell us how much the output changes per standard deviation change in the input feature, which is more interpretable.
- Even if we are building a linear regression model using a single input feature, we should scale the values as scaling helps us remove extreme values that often destabilize gradient updates and slow down model convergence.
To scale our dataset, we will use the StandardScaler()
function defined in the sklearn.preprocessing
module. First, we will create the scaler object as follows:
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
Next, we will train the scaler using the training input dataset by passing the X_train
data to the fit()
method.
scaler= scaler.fit(X_train)
Here, we have trained the standard scaler using only the training dataset, not the entire dataset. If we train it on the entire dataset, including the test data, we will be using information from the test set to normalize the training set, violating the principle that the test set must be completely unseen.
After training the standard scaler, we will transform the training and test dataset:
X_train=scaler.transform(X_train)X_test=scaler.transform(X_test)
Scaling the target values is generally not required unless there are many extreme values. If we scale the target values during training, we must also inverse transform predictions generated by the trained model to the original scale, which creates an overhead. Hence, we won’t scale the target values.
With the input dataset scaled, we can proceed to build a linear regression model using the sklearn module in Python.
Implement linear regression using the sklearn module in Python
To implement linear regression in Python, we use the LinearRegression()
function defined in the sklearn.linear_model
module. Let’s discuss the steps to build a linear regression model using the LinearRegression()
function.
Step 1: Create an untrained model
When we execute the LinearRegression()
function, it returns an untrained linear regression model.
from sklearn.linear_model import LinearRegressionuntrained_model=LinearRegression()
We can train this linear regression model using the fit()
method.
Step 2: Train the linear regression model
The fit()
method, when invoked on the untrained linear regression model, takes the scaled training dataset and target values as its first and second input arguments. After execution, it returns a trained linear regression model.
We will pass the X_train
and y_train
variables as the first and second input arguments to the fit()
method, respectively.
trained_model= untrained_model.fit(X_train, y_train)
The trained model is fitted on a linear equation in the form Y= α+ βX
. We can get the intercept α and the coefficient β from the model to get the linear equation.
- To get the intercept α, we use the
intercept_
attribute of the trained model. - To get the coefficients of the input features to the model, we use the
coef_
attribute. It contains a list of the coefficients of all the input features.
intercept=trained_model.intercept_coefficient= trained_model.coef_print("The intercept is:",intercept)print("The coefficient is:",coefficient)
Output:
The intercept is: 151.60623229461754The coefficient is: [47.98832087]
You can see that the intercept of the model is 151.606, and the coefficient is 47.988. Hence, we can represent this model as Y = 151.606 + 47.988*bmi
. We can use this model to predict the diabetes progression for BMI values in the test dataset.
Step 3: Predict values using the trained linear regression model
To predict outputs for new input values using the trained model, we use the predict()
method. The predict()
method takes the pandas dataframe containing independent features as its input and returns a numpy array with predictions. For example, we can predict the disease progression for the data points in the test dataset, as shown below:
y_predicted= trained_model.predict(X_test)print("The predicted values are:")print(y_predicted)
Output:
The predicted values are:[255.17426905 211.79462571 161.0087018 129.26749936 196.98206457,.....]
Instead of the test dataset, we can also pass a list of new BMI values to predict disease progression. For this, we will use the following steps:
- First, we will create a 2D list containing the new BMI values for which we need the predicted output.
- Next, we will scale the new BMI values with the scaler we trained on the training dataset.
- Finally, we will pass the scaled values to the
predict()
method to produce the output.
After execution, the predict()
method returns a list with predicted output for each input value.
new_bmi_values=[[20],[22],[25],[27],[30]]scaled_values=scaler.transform(new_bmi_values)print("New BMI values are:")print(new_bmi_values)print("The scaled values are:")print(scaled_values)predictions= trained_model.predict(scaled_values)print("The predicted values are:")print(predictions)
Output:
New BMI values are:[[20], [22], [25], [27], [30]]The scaled values are:[[-1.39151392][-0.95055659][-0.2891206 ][ 0.15183672][ 0.81327271]]The predicted values are:[ 84.82981594 105.99061756 137.73182001 158.89262164 190.63382408]
In this output, you can see that the model predicts a disease progression of 84.8 for BMI 20, 105.99 for BMI 22, and 190.3 for BMI 30. These values may or may not be accurate. Hence, we need to evaluate the model’s performance.
Step 4: Evaluate the model performance
We use the test dataset created using the train_test_split()
function to evaluate the model’s performance. To do this, we will first predict the output of the model for the data in the test set.
Then, we will compare the predicted values with the actual values given in the target test set using metrics like mean squared error(MSE) and coefficient of determination (R2 score).
- MSE calculates the average squared difference between predicted and actual values. For a model with a good performance, MSE should be low.
- R2 score is a metric to evaluate how well a regression model performs. R2 scores lie between 0 and 1. A R2 score of 1 represents perfect prediction. A R2 score of 0 means that the regression model does no better than predicting the mean.
You can use these metrics using the mean_squared_error
and r2_score
functions defined in the sklearn.metrics
module.
from sklearn.metrics import mean_squared_error, r2_scorey_predicted= trained_model.predict(X_test)mse = mean_squared_error(y_test, y_predicted)r2 = r2_score(y_test, y_predicted)print("Mean squared error (MSE): %.2f" % mse)print("R² score: %.2f" % r2)
Output:
Mean squared error (MSE): 4150.68R² score: 0.19
You can see that the model has a very high MSE for the test set and a low R2 score of 0.19. This suggests that the trained model isn’t good for predicting diabetes disease progression using BMI.
In the previous subsections, we discussed how to build and evaluate a linear regression model using a single independent variable. Now, let’s discuss how to build linear regression models with multiple independent features.
Implement multiple linear regression using the sklearn module in Python
We will use the age
, bmi
, and bp
features in the diabetes dataset to build a linear regression model with multiple input variables. The target values will be the disease progression values.
from sklearn.datasets import load_diabetesdataset = load_diabetes(as_frame=True, scaled=False)data_df=dataset.datadisease_progression = dataset.targetX = data_df[["age","bmi","bp"]]y = disease_progression
Now, let’s preprocess this data.
Preprocess data for multiple linear regression
To implement multiple linear regression using the sklearn module, the model training, prediction, and evaluation process remains the same as those discussed for linear regression using a single input feature. However, we need to preprocess the data in a slightly different manner. Let’s discuss how to do this.
Correlation analysis
To build linear regression models with multiple features, we must ensure that two or more independent features aren’t highly correlated. When the input dataset contains highly correlated features, the model struggles to assign clear weights to correlated features while training, leading to large variances in coefficient estimates.
We can use the corr()
method on the input dataframe X
to check the correlation between the input features.
correlation_df=X.corr()print("Correlation between features:")print(correlation_df)
Output:
correlation_df=X.corr()print("Correlation between features:")print(correlation_df)Correlation between features:age bmi bpage 1.000000 0.185085 0.335428bmi 0.185085 1.000000 0.395411bp 0.335428 0.395411 1.000000
As you can see, we don’t have highly correlated features in the dataset. Hence, we can train the linear regression model with the dataset. If your dataset has highly correlated features, you can choose one feature from the correlated features and drop the rest. You can also use techniques like principal component analysis to select the input features.
After correlation analysis, we can split the data into train and test sets, and scale the features as follows:
from sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScaler# Split the data into train and test setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state= 0 )# Normalize data using standard scalingscaler = StandardScaler()scaler= scaler.fit(X_train)X_train=scaler.transform(X_train)X_test=scaler.transform(X_test)
Train the regression model using multiple features
After getting the scaled dataset, the model training process for multiple regression is the same as that of a regression with a single variable. You can train the model as follows:
from sklearn.linear_model import LinearRegression# Create an untrained linear regression modeluntrained_model=LinearRegression()# Train the model with multiple independent featurestrained_model= untrained_model.fit(X_train,y_train)# Get the intercept and coefficient from the modelintercept=trained_model.intercept_coefficient= trained_model.coef_print("The intercept is:",intercept)print("The coefficients are:",coefficient)
Output:
The intercept is: 151.60623229461757The coefficients are: [-0.51656823 40.63115896 18.26516127]
In this output, the intercept of the model is 151.6. The coefficient list has three values, i.e., -0.51, 40.63, and 18.26, which are the coefficients of the features age
, bmi
, and bp
, respectively. Hence, we can represent this model as Y = 151.606 - 0.516*age + 40.63*bmi + 18.26*bp
.
Evaluate the model performance
After training the model, we can predict values and evaluate the model as follows:
from sklearn.metrics import mean_squared_error, r2_scorey_predicted= trained_model.predict(X_test)mse = mean_squared_error(y_test, y_predicted)r2 = r2_score(y_test, y_predicted)print("Mean squared error (MSE): %.2f" % mse)print("R² score: %.2f" % r2)
Output:
Mean squared error (MSE): 3709.35R² score: 0.28
You can see that the MSE for the linear regression model is 3709.35, and the R2 score is 0.28. It is still not a good enough model. However, the MSE for this model is less than the MSE for the model with one variable, and the R2 score of this model is greater than the R2 score of the previous model. Hence, this model is slightly better.
Assumptions for building linear regression models
Linear regression models are one of the simplest models to implement. However, we cannot use linear regression models for all types of datasets. This is because linear regression models depend on a set of assumptions:
- Linearity: Linear regression assumes that the output variable is linearly dependent on the input features. The model performs poorly if the dependent and independent features aren’t linearly related.
- Continuous target values: The dependent variable should be continuous as the linear regression model estimates a linear equation.
- Independence: The data points in the input dataset should be independent of each other.
- No multi-collinearity: Linear regression assumes that the input features aren’t highly correlated. If highly correlated features are present in the training dataset, the model struggles to assign clear weights to correlated features, leading to unstable coefficients while training.
- Normality of Errors: Residuals are the differences between the actual values and the values predicted by the model. For a linear regression model, the residuals should have a normal distribution. If the residuals aren’t normally distributed, we cannot conduct hypothesis tests on coefficients or use confidence intervals for predictions.
- Homoscedasticity: For a linear regression model, the variance of residuals is assumed to be constant across all the levels of independent variables. Thus, the spread of the residuals should remain roughly the same for a good model, no matter what the predicted value or the feature value is.
Verifying the assumptions of a linear regression model is important before applying its results to real-world scenarios. We should use linear regression models only if the datasets or the outputs satisfy the above assumptions.
Use cases for linear regression models
We use linear regression to model the relationship between one or more independent features and a continuous dependent variable. Linear regression has applications in multiple domains.
- In business and marketing, linear regression can be used for sales forecasting, budget optimization, or predicting customer lifetime value.
- In healthcare, we can use linear regression to predict blood pressure or blood sugar based on age, BMI, weight, etc.
- In real estate, we can build linear regression models to predict property prices and rent.
- In finance and economics, we can use linear regression to estimate stock returns or credit scores or to model GDP based on economic factors.
Conclusion
Linear regression models are very helpful in predictive modeling due to their simplicity and interpretability. Whether you want to predict housing prices, analyze health trends, or build credit risk models, linear regression offers a reliable starting point for the analytical process. In this article, we discussed the basics of linear regression, along with its assumptions and use cases. We also discussed how to implement linear regression with single and multiple variables using the sklearn module in Python.
To learn more about linear regression, you can take this course on Introduction to regression in machine learning, which discusses linear regression and multiple linear regression in detail. You might also like this machine learning engineer career path that teaches building end-to-end ML pipelines in Python.
Frequently asked questions
1. When to use linear regression?
We use linear regression to model the relationship between a continuous dependent variable and independent variables, assuming that the output is linearly dependent on the independent variables.
2. What are the different types of regression?
The different types of regression algorithms include linear regression, multiple linear regression, polynomial regression, logistic regression, and regularization techniques like Lasso regression and Ridge regression.
3. Is regression supervised or unsupervised?
Regression is a supervised machine learning algorithm.
4. What is the difference between correlation and regression?
Correlation denotes the relationship between two variables. Regression is used to model this relationship and predict the value of one variable based on the other.
5. Is ANOVA a regression analysis?
We can consider the Analysis of Variance (ANOVA) a special case of linear regression in which the independent variables are categorical. However, it’s not a regression analysis technique.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
Introduction to Regression Analysis
This article is a brief introduction to the formal theory (otherwise known as Math) behind regression analysis. - Article
Building a Neural Network Model Using TensorFlow
Learn how to build a neural network model in TensorFlow by creating a digits classification model using the MNIST dataset. - Article
The Machine Learning Process
Learn the general structure of how to approach Machine Learning problems in a methodical way.
Learn more on Codecademy
- Free course
Linear Regression in Python
Learn how to fit, interpret, and compare linear regression models in Python.Intermediate6 hours - Free course
How to Choose a Linear Regression Model
Learn about the differences between different regression models and how to decide which one to use.Intermediate1 hour - Free course
Simple Linear Regression
Learn how to fit and interpret linear regression with a single predictor variableBeginner Friendly2 hours
- What is linear regression?
- What is a linear regression model?
- Installing the scikit-learn and related libraries
- Downloading and analyzing the diabetes dataset for linear regression
- Preprocess data for linear regression
- Implement linear regression using the sklearn module in Python
- Implement multiple linear regression using the sklearn module in Python
- Assumptions for building linear regression models
- Use cases for linear regression models
- Conclusion
- Frequently asked questions