If we want to compare two different distributions, we can put multiple histograms on the same plot. This could be useful, for example, in comparing the heights of a bunch of men and the heights of a bunch of women. However, it can be hard to read two histograms on top of each other. For example, in this histogram, we can’t see all of the blue plot, because it’s covered by the orange one:
We have two ways we can solve a problem like this:
use the keyword
alpha, which can be a value between 0 and 1. This sets the transparency of the histogram. A value of 0 would make the bars entirely transparent. A value of 1 would make the bars completely opaque.plt.hist(a, range=(55, 75), bins=20, alpha=0.5) plt.hist(b, range=(55, 75), bins=20, alpha=0.5)
This would make both histograms visible on the plot:
use the keyword
histtypewith the argument
'step'to draw just the outline of a histogram:plt.hist(a, range=(55, 75), bins=20, histtype='step') plt.hist(b, range=(55, 75), bins=20, histtype='step')
which results in a chart like:
Another problem we face is that our histograms might have different numbers of samples, making one much bigger than the other. We can see how this makes it difficult to compare qualitatively, by adding a dataset
b with a much bigger
a = normal(loc=64, scale=2, size=10000) b = normal(loc=70, scale=2, size=100000) plt.hist(a, range=(55, 75), bins=20) plt.hist(b, range=(55, 75), bins=20) plt.show()
The result is two histograms that are very difficult to compare:
To solve this, we can normalize our histograms using
density=True. This command divides the height of each column by a constant such that the total shaded area of the histogram sums to 1.
a = normal(loc=64, scale=2, size=10000) b = normal(loc=70, scale=2, size=100000) plt.hist(a, range=(55, 75), bins=20, alpha=0.5, density=True) plt.hist(b, range=(55, 75), bins=20, alpha=0.5, density=True) plt.show()
Now, we can more easily see the differences between the blue set and the orange set:
We’ve provided another dataset in the file sales_times_s2.csv that represents the 371 sales at MatplotSip’s first location from 8am to 10pm on the same day. This data has the same structure as the sales times data from store 1, with an
card_no, and a
time. Take a look at the data in the
csv and familiarize yourself with it.
Using script.py, we’ve imported the times into a list called
sales_times2. You can see how we did this in script.py, but you’ll only be interacting with the lists
sales_times2 in histogram.py, so don’t worry if you don’t understand the conversion from
csv to list.
Plot the histogram of times from the second location on top of the one from the last exercise.
Notice that the histogram we plotted second completely obscures the first histogram we plotted.
Modify the transparency value of both histograms to be
0.4 so that we can see the separate histograms better.
Normalize both the histograms so that we can compare the patterns between them despite the differences in sample size.