When beginning to work with a dataset, one of the first pieces of information you might want to investigate is the spread — is the data close together or far apart? One of the tools in our statistics toolbelt to do this is the descriptive statistic variance:
By finding the variance of a dataset, we can get a numeric representation of the spread of the data. If you want to take a deeper dive into how to calculate variance, check out our variance course.
But what does that number really mean? How can we use this number to interpret the spread?
It turns out, using variance isn’t necessarily the best statistic to use to describe spread. Luckily, there is another statistic — standard deviation — that can be used instead.
In this lesson, we’ll be working with two datasets. The first dataset contains the heights (in inches) of a random selection of players from the NBA. The second dataset contains the heights (in inches) of a random selection of users on the dating platform OkCupid — let’s hope these users were telling the truth about their height!
Run the code to see a histogram of the datasets. Look at the console to see the mean and variance of each dataset. Try to answer the following questions:
- What does it mean for the OkCupid dataset to have a larger variance than the NBA dataset?
- What are the units of the mean? Is someone who is 80 inches tall taller than the average of either group? Which group(s)?
- In this example, the units of variance are inches squared. Can you interpret what it means for the variance of the NBA dataset to be