Another way to explore the relationship between a quantitative and categorical variable in more detail is by inspecting overlapping histograms. In the code below, setting
alpha = .5 ensures that the histograms are see-through enough that we can see both of them at once. We have also used
normed=True make sure that the y-axis is a density rather than a frequency (note: the newest version of matplotlib renamed this parameter
density instead of
plt.hist(scores_GP , color="blue", label="GP", normed=True, alpha=0.5) plt.hist(scores_MS , color="red", label="MS", normed=True, alpha=0.5) plt.legend() plt.show()
By inspecting this histogram, we can clearly see that the entire distribution of scores at GP (not just the mean or median) appears slightly shifted to the right (higher) compared to the scores at MS. However, there is also still a lot of overlap between the scores, suggesting that the association is relatively weak.
Note that there are only 46 students at MS, but there are 349 students at GP. If we hadn’t used
normed = True, our histogram would have looked like this, making it impossible to compare the distributions fairly:
While overlapping histograms and side by side boxplots can convey similar information, histograms give us more detail and can be useful in spotting patterns that were not visible in a box plot (eg., a bimodal distribution). For example, the following set of box plots and overlapping histograms illustrate the same hypothetical data:
While the box plots and means/medians appear similar, the overlapping histograms illuminate the differences between these two distributions of scores.
Your lists from the previous exercise (
scores_rural) have been created for you in script.py. Use them to create an overlaid histogram of scores for students who live in urban and rural locations.
Remember to use different colors for each histogram, set
normed = True,
alpha = 0.5, and use the labels
Based on the overlaid histogram, do you think there is an association between these two variables?