On to the next chart: what if we want to understand how data is distributed? Matplotlib has a graph function for that, of course! Let’s use plt.hist()
to make a histogram, a graph that shows the spread of one variable of continuous data.
plt.hist()
takes the following parameters:
x
: the value being distributed (like shoe size, or height) – note that it should not be an aggregated valuebins
: specifies how many bins to make (e.g. 10) OR where the edges of the bins are (as a list of values)range
: the lower and upper range of the bins. If unspecified, set to the min and max values forx
color
: sets the color of the bars
To make a histogram of women’s shoe sales by size, with bins every half-size, we might use the code:
plt.hist(x = df.shoe_size, bins = 12, range = (5, 12), color = 'dodgerblue')
The imaginary shoe size data spans from size 5.5 to 11.5, so this range will help the graph appear balanced and not squeezed.
As with most of our other graph functions, only the data portion (x
) is strictly necessary, but adjusting other parameters makes a much more useful graph.
For this Jupyter exercise, our data comes from the cold waters of Maine, where lobster fishing is a crucial industry. There are minimum- and maximum-size regulations in place that limit which lobsters may be caught and which are thrown back into the ocean. Any lobster too small or too big is thrown back. Let’s see if this human intervention in the population shows up in the size distribution of lobsters caught and tagged during a field experiment.
Instructions
Run the Setup cells above. In the cell below, run the code to check out the first few rows of hist_data
to see what we’re working with. (Note: carapace_length
refers to the length of the lobster’s shell.)
Now that we’ve looked at some of the data in table form, let’s see it as a histogram. Write the code to visualize the distribution of lobster shell lengths. (Remember the shells are also called ‘carapaces.’) Set the bins
to 10.
Change the number of bins
to 20. Let’s also make this graph more readable: title the graph “Lobsters tagged by size”, and label the x and y axes “Carapace length (mm)” and “Number tagged”, respectively.
Set the bins equal to 30. Let’s also give the distribution a little breathing room on either side by specifying a range
of (75, 155)
– just slightly outside of our minimum and maximum x-values. Additionally, set the histogram color
to 'gold'
… because why not?
OPTIONAL: Play around with the number of bins. How many bins is too many? At what numbers do we gain and lose the understanding that this is a bimodal distribution?