Data Visualizations for Messy Data
Learn how to work around problems with visualizing messy and missing data.Introduction
Data visualization tutorials generally use pre-processed data. But what about datasets in the wild? What do we do about missing data? Or outliers that largely skew visualizations? What do we do when there are too many observations to be interpretable in a scatterplot? This article will introduce some of the methods we can use to work around these problems.
Let’s say we are new real estate agents who want to use data to better understand the relationship between the price and the number of bedrooms in a home. We will be using a dataset we have called housing
from Kaggle on USA Housing Listings.
Missing data
Incomplete observations — or missing data — are generally ignored by plotting functions in commonly-used Python libraries, such as matplotlib and seaborn. Therefore, we may want to remove those rows or impute the missing values before plotting. We can check for missing data using .info()
:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384977 entries, 0 to 384976
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 region 384977 non-null object
1 price 384977 non-null int64
2 type 384977 non-null object
3 sqfeet 384977 non-null int64
4 beds 384977 non-null int64
5 baths 384977 non-null float64
6 cats_allowed 384977 non-null int64
7 dogs_allowed 384977 non-null int64
8 smoking_allowed 384977 non-null int64
9 wheelchair_access 384977 non-null int64
10 electric_vehicle_charge 384977 non-null int64
11 comes_furnished 384977 non-null int64
12 laundry_options 305951 non-null object
13 parking_options 244290 non-null object
14 lat 383059 non-null float64
15 long 383059 non-null float64
16 state 384977 non-null object
dtypes: float64(3), int64(9), object(5)
memory usage: 49.9+ MB
None
Based on this output, we may be concerned about the columns laundry_options
and parking_options
because they have more missing values than other columns.
Preliminary view
Let’s take a first look at two variables and see what issues we run into. Here is a plot of price vs. area in square feet:
It doesn’t look like there are many points on this plot, even though there should be over 300,000 points. The 1e6
and 1e9
on the x- and y- axes, respectively, indicate that the scale and range for both features is incredibly large. For example, we have at least one housing listing that costs almost 3,000,000,000 dollars per month. Dealing with these outliers is the first thing we will have to do in order to more effectively visualize the data.
Plotting with outliers
We can whittle down each feature in the plot to cut out outliers until we have a better feel for the data. It can take some trial and error to find the right values, so let’s start by limiting price
to less than $10,000,000 and sqfeet
to less than 2,000,000:
housing2 = housing[(housing.price < 10000000) & (housing.price>0)]housing2 = housing2[(housing2.sqfeet < 2000000) & (housing2.sqfeet>0)]sns.scatterplot(housing2['sqfeet'], housing2['price'])
This scatterplot is a little bit better. We can see more points showing in the bottom left-hand side of the plot. Let’s get closer to that cluster of points: let’s limit both price
and sqfeet
to values less than 20,000:
housing2 = housing[(housing.price < 20000) & (housing.price>0)]housing2 = housing2[(housing2.sqfeet < 20000) & (housing2.sqfeet>0)]sns.scatterplot(housing2['sqfeet'], housing2['price'])
Now we are starting to see all of the points! There is still a lot of white space on the right-hand side, so let’s limit our data one more time, this time limiting both price
and sqfeet
to values less than 3,000:
## limit price and sqfeet to < 3000housing2 = housing[(housing.price < 3000) & (housing.price>0)]housing2 = housing2[(housing2.sqfeet < 3000) & (housing2.sqfeet>0)]sns.scatterplot(housing2['sqfeet'], housing2['price'])
Now we can really see the bulk of the points from our dataset. However there are still so many points here that they are all printed on top of one another. This means that we cannot visualize the density of the points and therefore the overall relationship between price and area.
Visualizing many data points
When there are too many data points to visualize, one thing we can do is take a random subset of the data. This will mean fewer dots and because it is a random subset, it should still be approximately generalizable to the full dataset. Let’s try using a random 5% of the data:
perc = 0.05housing_sub = housing2.sample(n = int(housing2.shape[0]*perc))sns.scatterplot(housing_sub['sqfeet'], housing_sub['price'])
There’s still a lot of overlap, but we can actually see the positive linear association between area and price that was difficult to visualize originally.
We can still improve upon this. We can try making each point smaller to better see places of higher concentration of plotted points:
sns.scatterplot(housing_sub['sqfeet'], housing_sub['price'], s = 5)
This plot is better than the previous one because, at a glance, we can see the higher concentration of points in the 500 to 1500 sqfeet
range and the 500 to 2000 price
range. However this still doesn’t give us a great understanding of just how many points are in this middle cluster. Rather than plotting the points smaller, we may want to make them more see-through. This way, we can interpret color intensity to understand the overlap:
sns.scatterplot(housing_sub['sqfeet'], housing_sub['price'], alpha = 0.2)
We can see that the bottom section of the plot is darker than the top section. This is due to many more points overlapping each other at the lower price
levels and fewer points overall as price
increases.
We also might consider plotting a LOWESS (Locally Weighted Scatterplot Smoothing) smoother over our data points. This will draw a line through the approximate average price for each value of sqfeet
:
sns.lmplot(x='sqfeet', y='price', data = housing_sub, line_kws={'color': 'black'}, lowess=True)
Though the individual points are more difficult to read, the line gives us information about the relationship between these two features.
Visualizing discrete variables
Let’s say we wanted to look at the relationship between beds
and baths
in our data set. We can easily plot the scatterplot:
sns.scatterplot('beds', 'baths', data = housing_sub)
While this plot tells us each combination of number of beds and bathrooms in our data set, it doesn’t tell us how many observations there are. This is because both features are discrete values, in this case meaning limited to whole numbers for beds
and half numbers for bath
. So every data point that represents 3 beds and 2 bathrooms is plotted at the exact same spot as the others, perfectly overlapping to look like one point.
Adding a jitter adjusts the spread of points along either (or both) axes in order to more easily see some many points there are in each group:
sns.lmplot('beds', 'baths', data = housing_sub, x_jitter = .15, y_jitter = .15, fit_reg = False)
We can look at this plot and learn a lot more than the previous one. For example, we know that there are fewer points at every bath
level when beds
is equal to 6 compared to 5.
Log transformation
Sometimes when data are on a log scale, it can be hard to visualize the distribution of the values. Features with positive values that are highly right-skewed are prime candidates for log transformation. Let’s look at the distribution of price
from our dataset:
sns.displot(housing.price)
Here we can see one tall peak on the left-hand side, and a very long right-tail along the x-axis. While we could try to trim down the price
values like before, it might be beneficial to try plotting the distribution of log price instead:
log_price = housing.price[housing.price>0]log_price = np.log(log_price)sns.displot(log_price)plt.xlabel('log price')
This histogram provides a lot more information than the data in the original form. We can even limit the plot to just be between 5 and 10 to see the distribution more clearly:
sns.displot(log_price)plt.xlabel('log price')plt.xlim(5,10)
This plot indicates that log price is unimodal and approximately normally distributed. This is helpful knowledge if we want to build a model to predict prices in the future.
Conclusion
Making interpretable data visualizations is not always as easy as just plotting all of the data. Oftentimes, visualizations require some additional steps, such as jittering, making points smaller or more opaque, or transforming the data. Following these steps will help you to make more dynamic and interpretable visualizations in the future.