On to the next chart: what if we want to understand how data is distributed? Matplotlib has a graph function for that, of course! Let’s use
plt.hist() to make a histogram, a graph that shows the spread of one variable of continuous data.
plt.hist() takes the following parameters:
x: the value being distributed (like shoe size, or height) – note that it should not be an aggregated value
bins: specifies how many bins to make (e.g. 10) OR where the edges of the bins are (as a list of values)
range: the lower and upper range of the bins. If unspecified, set to the min and max values for
color: sets the color of the bars
To make a histogram of women’s shoe sales by size, with bins every half-size, we might use the code:
plt.hist(x = df.shoe_size, bins = 12, range = (5, 12), color = 'dodgerblue')
The imaginary shoe size data spans from size 5.5 to 11.5, so this range will help the graph appear balanced and not squeezed.
As with most of our other graph functions, only the data portion (
x) is strictly necessary, but adjusting other parameters makes a much more useful graph.
For this Jupyter exercise, our data comes from the cold waters of Maine, where lobster fishing is a crucial industry. There are minimum- and maximum-size regulations in place that limit which lobsters may be caught and which are thrown back into the ocean. Any lobster too small or too big is thrown back. Let’s see if this human intervention in the population shows up in the size distribution of lobsters caught and tagged during a field experiment.
Run the Setup cells above. In the cell below, run the code to check out the first few rows of
hist_data to see what we’re working with. (Note:
carapace_length refers to the length of the lobster’s shell.)
Now that we’ve looked at some of the data in table form, let’s see it as a histogram. Write the code to visualize the distribution of lobster shell lengths. (Remember the shells are also called ‘carapaces.’) Set the
bins to 10.
Change the number of
bins to 20. Let’s also make this graph more readable: title the graph “Lobsters tagged by size”, and label the x and y axes “Carapace length (mm)” and “Number tagged”, respectively.
Set the bins equal to 30. Let’s also give the distribution a little breathing room on either side by specifying a
(75, 155) – just slightly outside of our minimum and maximum x-values. Additionally, set the histogram
'gold'… because why not?
OPTIONAL: Play around with the number of bins. How many bins is too many? At what numbers do we gain and lose the understanding that this is a bimodal distribution?