Finding the mean, median, and mode of a dataset is a good way to start getting an understanding of the general shape of your data
However, those three descriptive statistics only tell part of the story. Consider the two datasets below:
dataset_one <- c(-4, -2, 0, 2, 4) dataset_two <- c(-400, -200, 0, 200, 400)
These two datasets have the same mean and median — both of those values happen to be
0. If we only reported these two statistics, we would not be communicating any meaninful difference between these two datasets.
This is where variance comes into play. Variance is a descriptive statistic that describes how spread out the points in a data set are.
Run the code and take a look at the two histograms that get created. Also look at the mean of each dataset.
These two histograms show the test grades of students from two different teacher’s classes. While the datasets have very similar means, their variances are very different. Think about the following questions:
- Which dataset looks the most spread out?
- When looking at the spread of a histogram, why are the units on the x-axis so important?
- What does the spread of the data tell you about these two teachers that you might not understand if you only looked at the mean?