Suppose you want to know the average height of an oak tree in your local park. On Monday, you measure
10 trees and get an average height of
32 ft. On Tuesday, you measure
12 different trees and reach an average height of
35 ft. On Wednesday, you measure the remaining
11 trees in the park, whose average height is
31 ft. The average height for all
33 trees in your local park is
The collection of individual height measurements on Monday, Tuesday, and Wednesday are each called samples. A sample is a subset of the entire population (all the oak trees in the park). The mean of each sample is a sample mean and it is an estimate of the population mean.
Note: the sample means (
35 ft., and
31 ft.) were all close to the population mean (
32.8 ft.), but were all slightly different from the population mean and from each other.
For a population, the mean is a constant value no matter how many times it’s recalculated. But with a set of samples, the mean will depend on exactly which samples are selected. From a sample mean, we can then extrapolate the mean of the population as a whole. There are three main reasons we might use sampling:
- data on the entire population is not available
- data on the entire population is available, but it is so large that it is unfeasible to analyze
- meaningful answers to questions can be found faster with sampling
In the workspace, we’ve generated a random population of size
300 that follows a normal distribution with a mean of
65. Update the value of
population_mean to store the
population. Does it closely match your expectation?
Let’s look at how the means of different samples can vary within the same population.
The code in the notebook generates 5 random samples from
sample_1 is displayed and
sample_1_mean has been calculated.
"Not calculated" strings with calculations of the means for
Look at the population mean and the sample means. Are they all the same? All different? Why?