Sometimes we want to get a feel for a large dataset with many samples beyond knowing just the basic metrics of mean, median, or standard deviation. To get more of an intuitive sense for a dataset, we can use a histogram to display all the values.
A histogram tells us how many values in a dataset fall between different sets of numbers (i.e., how many numbers fall between 0 and 10? Between 10 and 20? Between 20 and 30?). Each of these questions represents a bin, for instance, our first bin might be between 0 and 10.
All bins in a histogram are always the same size. The width of each bin is the distance between the minimum and maximum values of each bin. In our example, the width of each bin would be 10.
Each bin is represented by a different rectangle whose height is the number of elements from the dataset that fall within that bin.
Here is an example:
To make a histogram in Matplotlib, we use the command plt.hist
. plt.hist
finds the minimum and the maximum values in your dataset and creates 10 equally-spaced bins between those values.
The histogram above, for example, was created with the following code:
plt.hist(dataset) plt.show()
If we want more than 10 bins, we can use the keyword bins
to set how many bins we want to divide the data into.
The keyword range
selects the minimum and maximum values to plot. For example, if we wanted to take our data from the last example and make a new histogram that just displayed the values from 66 to 69, divided into 40 bins (instead of 10), we could use this function call:
plt.hist(dataset, range=(66,69), bins=40)
which would result in a histogram that looks like this:
Histograms are best for showing the shape of a dataset. For example, you might see that values are close together, or skewed to one side. With this added intuition, we often discover other types of analysis we want to perform.
Instructions
We’ve provided data in the file sales_times.csv and loaded it into a list called sales_times
. You can see how we did this in the script.py file. This set represents the 270 sales at MatplotSip’s first location from 8am to 10pm on a certain day.
Make a histogram out of this data in histogram.py using the plt.hist
function.
Use the bins
keyword to create 20 bins instead of the default 10.