Suppose you want to know the average height of an oak tree in your local park. On Monday, you measure 10
trees and get an average height of 32
ft. On Tuesday, you measure 12
different trees and reach an average height of 35
ft. On Wednesday, you measure the remaining 11
trees in the park, whose average height is 31
ft. The average height for all 33
trees in your local park is 32.8
ft.
The collection of individual height measurements on Monday, Tuesday, and Wednesday are each called samples. A sample is a subset of the entire population (all the oak trees in the park). The mean of each sample is a sample mean and it is an estimate of the population mean.
Note: the sample means (32
ft., 35
ft., and 31
ft.) were all close to the population mean (32.8
ft.), but were all slightly different from the population mean and from each other.
For a population, the mean is a constant value no matter how many times it’s recalculated. But with a set of samples, the mean will depend on exactly which samples are selected. From a sample mean, we can then extrapolate the mean of the population as a whole. There are three main reasons we might use sampling:
- data on the entire population is not available
- data on the entire population is available, but it is so large that it is unfeasible to analyze
- meaningful answers to questions can be found faster with sampling
Instructions
In the workspace, we’ve generated a random population of size 300
that follows a normal distribution with a mean of 65
. Update the value of population_mean
to store the mean()
of population
. Does it closely match your expectation?
Let’s look at how the means of different samples can vary within the same population.
The code in the notebook generates 5 random samples from population
. sample_1
is displayed and sample_1_mean
has been calculated.
Replace the "Not calculated"
strings with calculations of the means for sample_2
, sample_3
, sample_4
, and sample_5
.
Look at the population mean and the sample means. Are they all the same? All different? Why?