Log Transformation in Linear Regression: When and How to Use It
What is log transformation in regression?
When analyzing data using linear regression, we often assume a linear relationship between the independent and dependent variables. However, real-world data doesn’t always behave that way. In many cases, the dependent variable may have a skewed distribution or show non-linear patterns, which can violate regression assumptions like normality and homoscedasticity.
In linear regression, a log transformation (or logarithmic transformation) refers to applying the natural logarithm (log base e
) to the dependent variable to address issues like non-linearity, skewed distributions, or unequal variance in residuals. Log transformation compresses large values and spreads smaller ones, making the data more suitable for modeling with a linear approach.
Let’s look at when using a log transformation is the right choice.
Linear Regression in Python
Learn how to fit, interpret, and compare linear regression models in Python.Try it for freeWhen to use log transform
Using a logarithm to transform the response variable may make sense if we notice either or both of the following when checking the assumptions for linear regression:
The residuals appear skewed, violating the normality assumption. This can happen if the relationship we are trying to model is nonlinear.
There appears to be a pattern or asymmetry in the plot of residuals vs. fitted values, violating the homoscedasticity assumption. This can (also) happen due to a non-linear relationship or if there is more variation in the outcome variable for particular values of a predictor.
Sometimes violated regression assumptions indicate that we should not use a linear regression model. However, if transforming the dependent variable appears to correct these violations, we may be justified in proceeding carefully.
Let’s walk through a real example using a dataset to see this in action.
Linear regression example
As an example, we’ll use a dataset called countries
, which is a cleaned subset of a larger dataset from Kaggle. This dataset contains variables for 221 countries for the years 1970-2017, including the following:
Birthrate
: A country’s birth rate as births per 1000 peoplePhones (per 1000)
: A country’s number of phones per 1000 people
Though the concepts in this article certainly apply to multiple linear regression, we’ll use a simple linear regression as an example. Let’s say we are interested in predicting phones
from Birthrate
using a linear model. First, let’s read in the CSV dataset, examine the first few observations, and look at a scatter plot of the number of phones versus birth rate.
import pandas as pdimport seaborn as snsimport matplotlib.pyplot as plt# Load and clean datacountries = pd.read_csv('countries.csv')print(countries.head())countries['Birthrate'] = countries['Birthrate'].str.replace(',', '.').astype(float)countries['Phones (per 1000)'] = countries['Phones (per 1000)'].str.replace(',', '.').astype(float)# Plotsns.lmplot(x='Birthrate', y='Phones (per 1000)', ci=None, data=countries)plt.title('Number of Phones vs Birth Rate', fontsize=16, weight='bold')plt.show()
The output generated by countries.head()
will be:
Plot output:
The scatter plot shows a negative correlation between phones
and the Birthrate
. However, there are some indications that a simple linear regression may not be appropriate for this data:
- The relationship between
phones
andBirthrate
is more curved than linear - There is more variation in
phones
for small values ofBirthrate
than for large values
The plot highlights several countries to illustrate residuals - the vertical distances from data points to the regression line. Countries like Bermuda and Australia show large residuals at low birth rates, while countries like Mayotte and Angola show smaller residuals at high birth rates. We can see a lot of variability in the size of residuals for low birth rates, with very minimal variability for higher birth rates.
To better check our regression assumptions, we can fit the regression in Python using the following code and save both the residuals and predicted response values as the objects residuals1
and fitted_values1
, respectively.
import pandas as pd import statsmodels.api as sm# Load datacountries = pd.read_csv('countries.csv')# Clean numeric valuescountries['Birthrate'] = countries['Birthrate'].str.replace(',', '.').astype(float) countries['Phones (per 1000)'] = countries['Phones (per 1000)'].str.replace(',', '.').astype(float)# Rename columns to avoid spaces and parentheses in formulacountries = countries.rename(columns={'Phones (per 1000)': 'phones_per_1000'})# Fit regression modelmodel1 = sm.OLS.from_formula('phones_per_1000 ~ Birthrate', data=countries).fit()# Save fitted values and residualsfitted_values1 = model1.predict(countries) residuals1 = countries['phones_per_1000'] - fitted_values1
Now we’ll produce some plots to check the modeling assumptions of normality and homoscedasticity of the residuals:
# Check normality of residualsplt.hist(residuals1)plt.title('Model 1: Histogram of Residuals', fontsize=16, weight='bold')plt.show()# Check variance of residualsplt.scatter(fitted_values1, residuals1)plt.axhline(y=0, color='black', linestyle='-', linewidth=3)plt.title('Model 1: Residuals vs Fitted Values', fontsize=16, weight='bold')plt.show()
In the histogram, we see some right skewing caused by the few very high residuals for countries like Bermuda, indicating we may not be meeting the normality assumption. Perhaps more concerning, the scatter plot of residuals against fitted values shows a wave-like pattern from narrow to wide, rather than the constant spread we look for to indicate that homoscedasticity has been met. We’ve additionally highlighted the same countries in the scatter plot again so we can see how their residuals map out in this plot compared to where we saw them in the original.
How to apply log transformation in Python
Since we see two potential assumption violations, we are going to try a log transformation of the phones
variable and check if it improves our concerns. In Python, we can easily take the log of phones
using the NumPy function .log()
. Let’s add this new variable to our dataset and see how it looks compared to phones
. Note that, generally, when we see log with no specified base in a statistics equation, we can assume the base is e
(the mathematical constant 2.718…). In other words, log with no base means we are taking the natural log, or ln. Also, note that we can only take the log of a variable with values greater than zero, the log of values less than or equal to zero are undefined.
# Clean 'Phones (per 1000)' column (convert commas to dots and cast to float)countries['Phones (per 1000)'] = countries['Phones (per 1000)'].str.replace(',', '.').astype(float)# Apply log safely (filter out non-positive values to avoid math errors)countries = countries[countries['Phones (per 1000)'] > 0]# Apply log transformationcountries['log_phones'] = np.log(countries['Phones (per 1000)'])# Print the resultprint(countries.head())
Output:
We can see that this transformation has drastically reduced the range of values for our dependent variable. Let’s run a second model predicting log_phones
from Birthrate
and see what else has changed.
# Fit regression modelmodel2 = sm.OLS.from_formula('log_phones ~ Birthrate', data=countries).fit()# Save fitted values and residualsfitted_values2 = model2.predict(countries)residuals2 = countries.log_phones - fitted_values2
If we examine the scatter plot of log_phones
against Birthrate
, we can see a big change in the appearance of our data:
While there’s some crowding in the upper left-hand corner, the pattern now appears much more linear and more evenly spaced about the regression line. Specifically, countries that had larger residuals earlier (like Bermuda and Australia) are now much closer to the line and each other vertically. Likewise, countries that had small residuals earlier (like Mayotte and Angola) are now further from the line and each other vertically. This change is reflected in both the histogram of the residuals (now much less skewed) and the scatter plot of the residuals versus the fitted values (now much more evenly spaced across the line y = 0).
How to interpret a regression model
While it’s great that our new variable seems to be better meeting our model assumptions, how do we interpret the coefficients in our model now that logs are involved? First, let’s look at the output of the model predicting log_phones
from Birthrate
and write out the regression equation:
print(model2.params)# Output:# Intercept 7.511024# Birthrate -0.130456# dtype: float64
We can always interpret the coefficient on Birthdate
in the traditional way: for every increase of one birth per 1000 people, the natural log of phones
decreases by 0.13 phones per 1000 people. While this is accurate, it’s not very informative about the relationship between phones
and Birthdate
. To examine this relationship, we need to do a little math with logs and exponentiation.
To get a more direct relationship between phones
and Birthdate
, we first have to exponentiate the coefficient on Birthdate
. This means we raise e to the power of the coefficient on Birthdate
. We may write this as e-0.13, or more simply as exp(-0.13), and we can use NumPy to compute this in Python. In short, we’re doing this because exponentiating both sides of the regression equation cancels out the log on phones
.
import numpy as npnp.exp(-0.13)# Output# 0.8780954309205613
Then we also subtract 1 to change our coefficient into an easily readable percentage change:
np.exp(-0.13)-1# Output:# -0.1219045690794387
We are now ready to interpret this coefficient: for every additional birth per 1000 people, the number of phones per 1000 people decreases by about 12.2 PERCENT. Our interpretation changes from the traditional additive relationship, where increases in the predictor are associated with differences in UNITS of the outcome, to a multiplicative relationship, where increases in the predictor are associated with differences in the PERCENTAGE of the outcome.
We also see this change in the interpretation of the intercept: rather than the arithmetic mean, the exponentiated intercept exp(7.51) is the geometric mean number of phones for countries with a birth rate of 0. The arithmetic mean is computed by SUMMING values, while the geometric mean is computed by MULTIPLYING values.
Conclusion
Log transformations help improve linear regression models by addressing skewed data and non-constant variance. In our example, applying a log transformation to the dependent variable led to a model that better met key regression assumptions and offered a clearer interpretation.
To practice techniques like this with hands-on code, explore Codecademy’s Linear Regression in Python course, a great next step for building reliable and interpretable models.
Frequently asked questions
1. When should I use a log transformation?
Use a log transformation when your dependent variable is highly skewed, shows exponential growth, or violates linear regression assumptions like homoscedasticity and normality. It helps stabilize variance and linearize relationships.
2. What is the use of log transformation in image processing?
In image processing, log transformations are used to enhance details in darker regions of an image while compressing the range of brighter intensities. This is particularly helpful for improving contrast in low-light areas.
3. How does log transformation reduce skewness?
Log transformation compresses larger values and expands smaller ones, bringing extreme values closer to the center. This reduces positive skew and helps achieve a more symmetrical (often bell-shaped) distribution.
4. What is the purpose of log transformation in time series analysis?
In time series analysis, log transformation is often used to stabilize variance over time, especially in series with exponential trends. It helps improve model fit and interpretability by converting multiplicative relationships into additive ones.
5. What is the difference between log transformation and logarithmic transformation?
Log transformation and logarithmic transformation refer to the same mathematical process - applying a logarithmic function to transform data. The terms are used interchangeably in statistics and data science.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
Linear Regression with scikit-learn: A Step-by-Step Guide Using Python
Discover the fundamentals of linear regression and learn how to build linear regression and multiple regression models using the sklearn library in Python. - Article
Introduction to Regression Analysis
This article is a brief introduction to the formal theory (otherwise known as Math) behind regression analysis. - Article
Exploratory Data Analysis with Data Visualization
Explore how to use data visualization techniques with Seaborn and Matplotlib for Exploratory Data Analysis (EDA). Learn to analyze datasets with univariate, bivariate, and multivariate visualizations to uncover patterns and insights.
Learn more on Codecademy
- Free course
Linear Regression in Python
Learn how to fit, interpret, and compare linear regression models in Python.Intermediate6 hours - Free course
Multiple Linear Regression
Learn how to build and interpret linear regression models with more than one predictor variable.Beginner Friendly3 hours - Free course
Simple Linear Regression
Learn how to fit and interpret linear regression with a single predictor variableBeginner Friendly2 hours