As we can see, the mean is a helpful way to quickly understand different parts of our data. However, the mean is highly influenced by the specific values in our data set. What happens when one of those values is significantly different from the rest?
Values that don’t fit within the majority of a dataset are known as outliers. It’s important to identify outliers because if they go unnoticed, they can skew our data and lead to error in our analysis (like determining the mean). They can also be useful in pointing out errors in our data collection.
When we’re able to identify outliers, we can then determine if they were due to an error in sample collection or whether or not they represent a significant but real deviation from the mean.
Suppose we want to determine the average height for 3rd graders. We measure several students at the local school, but accidentally measure one student in centimeters rather than in inches. If we’re not paying attention, our dataset could end up looking like this:
[50, 50, 51, 49, 48, 127]
In this case, 127 would be an outlier.
Some outliers aren’t the result of a mistake. For instance, suppose that one of our 3rd graders had skipped a grade and was actually a year younger than everyone else in the class:
[50, 50, 51, 49, 48, 45]
She might be significantly shorter at 45”, but her height would still be an outlier.
Suppose that another student was just unusually tall for his age:
[50, 50, 51, 49, 48, 58.5]
His height of 58.5” would also be an outlier.
Explore the interactive visualization by clicking and dragging the circle* to change the value of the cluster of outliers.
How does the value of the outliers affect the mean? What happens to the mean when the outliers are more similar to the rest of the set? What happens when the cluster is outside the expected range?
- If you don’t see the small blue circle on the axis, try opening this lesson in Chrome or refreshing.