Well done! You’ve calculated the variance of a data set. The full equation for the variance is as follows:
Let’s dissect this equation a bit.
- Variance is usually represented by the symbol sigma squared.
- We start by taking every point in the dataset — from point number
1to point number
N— and finding the difference between that point and the mean.
- Next, we square each difference to make all differences positive.
- Finally, we average those squared differences by adding them together and dividing by
N, the total number of points in the dataset.
All of this work can be done quickly using Python’s NumPy library. The
var() function takes a list of numbers as a parameter and returns the variance of that dataset.
import numpy as np dataset = [3, 5, -2, 49, 10] variance = np.var(dataset)
We’ve imported the same two datasets from the beginning of the lesson. Run the code to see a histogram of the two datasets. This time, the histograms are plotted on the same graph to help visualize the difference in spread.
Which dataset do you expect to have a larger variance?
Scroll down in the code to find where we’ve definied
teacher_two_variance. Set those variables equal to the variance of each dataset using the