Log Transformations (And More)
When fitting a linear regression model, we use interaction and polynomial terms to capture complex relationships and improve predictive accuracy. We create these new terms by multiplying predictors together or by raising them to higher exponential powers and then add our new predictors to our model. These are examples of transformations of predictor variables, but sometimes we may want to transform the response (dependent) variable instead. This article will specifically explore when it might make sense to perform a log transformation of the response variable to improve a multiple linear regression model and how to interpret the resulting regression equation.
When to use a log transform
Using a logarithm to transform the response variable may make sense if we notice either or both of the following when checking the assumptions for linear regression:
- The residuals appear skewed, violating the normality assumption. This can happen if the relationship we are trying to model is non-linear.
- There appears to be a pattern or asymmetry in the plot of residuals vs. fitted values, violating the homoscedasticity assumption. This can (also) happen due to a non-linear relationship or if there is more variation in the outcome variable for particular values of a predictor.
Sometimes, violated regression assumptions (as described above) indicate that we simply should not use a linear regression model; but if transforming the response variable appears to correct these violations, we may be justified in (carefully) proceeding!
As an example, we’ll use a dataset called
countries which is a cleaned subset of larger dataset from Kaggle. This dataset contains variables for 221 countries for the years 1970-2017, including the following:
birth_rate– a country’s birth rate as births per 1000 people
phones– a country’s number of phones per 1000 people
Though the concepts in this article certainly apply to multiple linear regression, we’ll use a simple linear regression as an example. Let’s say we are interested in predicting
birth_rate using a linear model. First, let’s read in the CSV dataset, examine the first few observations, and look at a scatter plot of number of phones versus birth rate.
import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltcountries = pd.read_csv('countries.csv')print(countries.head())# Scatter plot with regression linesns.lmplot(x='birth_rate', y='phones', ci=None, data=countries)plt.title('Number of Phones vs Birth Rate', fontsize=16, weight='bold')plt.show()
The scatter plot shows a negative correlation between
birth_rate. However, there are some indications that a simple linear regression may not be appropriate for this data:
- The relationship between
birth_rateis more curved than linear
- There is more variation in
phonesfor small values of
birth_ratethan for large values
To highlight this, we’ve circled some countries in the plot and have drawn arrows from the points down to the regression line – these are the residuals for these points. We can see a lot of variability in the size of residuals for low birth rates, with very minimal variability for higher birth rates.
To better check our regression assumptions, we can fit the regression in Python using the following code and save both the residuals and predicted response values as the objects
import statsmodels.api as sm# Fit regression modelmodel1 = sm.OLS.from_formula('phones ~ birth_rate', data=countries).fit()# Save fitted values and residuals'fitted_values1' = model1.predict(countries)'residuals1' = countries.phones - fitted_values1
Now we’ll produce some plots to check the modeling assumptions of normality and homoscedasticity of the residuals.
# Check normality of residualsplt.hist(residuals1)plt.title('Model 1: Histogram of Residuals', fontsize=16, weight='bold')plt.show()# Check variance of residualsplt.scatter(fitted_values1, residuals1)plt.axhline(y=0, color='black', linestyle='-', linewidth=3)plt.title('Model 1: Residuals vs Fitted Values', fontsize=16, weight='bold')plt.show()
In the histogram, we see some right skewing caused by the few very high residuals for countries like Bermuda, indicating we may not be meeting the normality assumption. Perhaps more concerning, the scatter plot of residuals against fitted values shows a wave-like pattern from narrow to wide, rather than the constant spread we look for to indicate that homoscedasticity has been met. We’ve additionally highlighted the same countries in the scatter plot again so we can see how their residuals map out in this plot compared to where we saw them in the original.
Log transformation in Python
Since we see two potential assumption violations, we are going to try a log transformation of the
phones variable and check if it improves our concerns. In Python, we can easily take the log of
phones using the NumPy function
.log(). Let’s add this new variable to our dataset and see how it looks compared to
phones. Note that, generally, when we see log with no specified base in a statistics equation, we can assume the base is e (the mathematical constant 2.718…). In other words, log with no base means we are taking the natural log, or ln. Also, note that we can only take the log of a variable with values greater than zero; the log of values less than or equal to zero are undefined.
import numpy as np# Save log_phones to datasetcountries['log_phones'] = np.log(countries.phones)print(countries.head())
We can see that this transformation has drastically reduced the range of values for our dependent variable. Let’s run a second model predicting
birth_rate and see what else has changed.
# Fit regression modelmodel2 = sm.OLS.from_formula('log_phones ~ birth_rate', data=countries).fit()# Save fitted values and residuals'fitted_values2' = model2.predict(countries)'residuals2' = countries.log_phones - fitted_values2
If we examine the scatter plot of
birth_rate, we can see a big change in the appearance of our data:
While there’s some crowding in the upper lefthand corner, the pattern now appears much more linear and more evenly spaced about the regression line. Specifically, countries that had larger residuals earlier (like Bermuda and Australia) are now much closer to the line and each other vertically. Likewise, countries that had small residuals earlier (like Mayotte and Angola) are now further from the line and each other vertically. This change is reflected in both the histogram of the residuals (now much less skewed) and the scatter plot of the residuals versus the fitted values (now much more evenly spaced across the line y = 0).
While it’s great that our new variable seems to be better meeting our model assumptions, how do we interpret the coefficients in our model now that logs are involved? First, let’s look at the output of the model predicting
birth_rate and write out the regression equation:
print(model2.params)# Output:# Intercept 7.511024# birth_rate -0.130456
We can always interpret the coefficient on
birth_rate in the traditional way: for every increase of one birth per 1000 people, the natural log of
phones decreases by 0.13 phones per 1000 people. While this is accurate, it’s not very informative about the relationship between
birth_rate. To examine this relationship, we need to do a little math with logs and exponentiation.
To get a more direct relationship between
birth_rate, we first have to exponentiate the coefficient on
birth_rate. This means we raise e to the power of the coefficient on
birth_rate. We may write this as e-0.13, or more simply as exp(-0.13), and we can use NumPy to compute this in Python. In short, we’re doing this because exponentiating both sides of the regression equation cancels out the log on
phones, but we’ll save the more thorough explanation for the bonus section at the end of this article.
import numpy as npnp.exp(-0.13)# Output# 0.8780954309205613
Then we also subtract 1 to change our coefficient into an easily readable percentage change:
np.exp(-0.13)-1# Output:# -0.1219045690794387
We are now ready to interpret this coefficient: for every additional birth per 1000 people, the number of phones per 1000 people decreases by about 12.2 PERCENT. Our interpretation changes from the traditional additive relationship, where increases in the predictor are associated with differences in UNITS of the outcome, to a multiplicative relationship, where increases in the predictor are associated with differences in the PERCENTAGE of the outcome.
We also see this change in the interpretation of the intercept: rather than the arithmetic mean, the exponentiated intercept exp(7.51) is the geometric mean number of phones for countries with a birth rate of 0. The arithmetic mean is computed by SUMMING values, while the geometric mean is computed by MULTIPLYING values.
Log transformations of the dependent variable are a way to overcome issues with meeting the requirements of normality and homoscedasticity of the residuals for multiple linear regression. Unfortunately, a log transformation won’t fix these issues in every case (it may even make things worse!), so it’s important to reassess normality and homoscedasticity after making the transformation and running the new model. Log transformations can also be performed on predictors, and there are other dependent variable transformations available as well (e.g., square-rooting). To learn more about some of these transformations, check out the Penn State Statistics Department’s website.
Bonus: Logs in more detail
Why did taking the log of the dependent variable help?
As we recall from the scatter plot of
birth_rate, there were a lot of large positive residuals for lower birth rates and a lot of smaller residuals for higher birth rates. Taking the log of
phones brought the large residuals lower and the small residuals higher, which gave us a more even spread with less extremes. But why did this happen? Let’s take a quick look at what happens as e is raised to higher exponents. Note that we use 2.718 as an approximation of e here.
As we can see from the table, every time the power e is raised to increases, the output nearly triples. This means the difference in the outputs between low powers is smaller than the difference in outputs between larger powers. Taking the log of the output column “undoes” this process, returning the corresponding value in the power column (e.g., log(2.718) = 1, log(7.388) = 2, etc.).
In terms of our dataset, the output column is like the raw
phones values, and the power column is the new
log_phones variable. Big differences in the upper values of
phones translate to the same size jump on the
log_phones scale as small differences in the lower values of
phones. Thus, translated to the log scale, the large values of
phones (like those of Bermuda and Australia) pull in, while the small values of
phones (like those of Mayotte and Angola) spread out.
Why do we interpret the exponentiated coefficients on the predictors as percentage differences of the dependent variable?
Let’s say birth_rate0 is a value of
birth_rate and phones0 is the value of
phones at birth_rate0 such that:
Let’s also say phones1 is the value of
birth_rate is increased by 1 birth from birth_rate0. Then,
Next, we distribute the -0.13 and substitute log(phones0) for 7.51 - 0.13*birth_rate0. Then we subtract log(phones0) from both sides to isolate the
birth_rate coefficient of -0.13.
Finally, under the quotient rule, we find that our coefficient on
birth_rate is equal to a single log. We exponentiate both sides to find our exponentiated coefficient on
birth_rate is equal to a simple quotient that gives the percentage change in the
phones variable between phones0 and phones1.