Another way to explore the relationship between a quantitative and categorical variable in more detail is by inspecting overlapping histograms. In the code below, setting alpha = .5
ensures that the histograms are see-through enough that we can see both of them at once. We have also used normed=True
make sure that the y-axis is a density rather than a frequency (note: the newest version of matplotlib renamed this parameter density
instead of normed
):
plt.hist(scores_GP , color="blue", label="GP", normed=True, alpha=0.5) plt.hist(scores_MS , color="red", label="MS", normed=True, alpha=0.5) plt.legend() plt.show()
By inspecting this histogram, we can clearly see that the entire distribution of scores at GP (not just the mean or median) appears slightly shifted to the right (higher) compared to the scores at MS. However, there is also still a lot of overlap between the scores, suggesting that the association is relatively weak.
Note that there are only 46 students at MS, but there are 349 students at GP. If we hadn’t used normed = True
, our histogram would have looked like this, making it impossible to compare the distributions fairly:
While overlapping histograms and side by side boxplots can convey similar information, histograms give us more detail and can be useful in spotting patterns that were not visible in a box plot (eg., a bimodal distribution). For example, the following set of box plots and overlapping histograms illustrate the same hypothetical data:
While the box plots and means/medians appear similar, the overlapping histograms illuminate the differences between these two distributions of scores.
Instructions
Your lists from the previous exercise (scores_urban
and scores_rural
) have been created for you in script.py. Use them to create an overlaid histogram of scores for students who live in urban and rural locations.
Remember to use different colors for each histogram, set normed = True
, alpha = 0.5
, and use the labels 'Urban'
and 'Rural'
, respectively.
Based on the overlaid histogram, do you think there is an association between these two variables?