Beyond visualizing relationships, we can also use summary statistics to quantify the strength of certain associations. Covariance is a summary statistic that describes the strength of a linear relationship. A linear relationship is one where a straight line would best describe the pattern of points in a scatter plot.
Covariance can range from negative infinity to positive infinity. A positive covariance indicates that a larger value of one variable is associated with a larger value of the other. A negative covariance indicates a larger value of one variable is associated with a smaller value of the other. A covariance of 0 indicates no linear relationship. Here are some examples:
To calculate covariance, we can use the cov()
function from NumPy, which produces a covariance matrix for two or more variables. A covariance matrix for two variables looks something like this:
variable 1 | variable 2 | |
---|---|---|
variable 1 | variance(variable 1) | covariance |
variable 2 | covariance | variance(variable 2) |
In python, we can calculate this matrix as follows:
cov_mat_price_sqfeet = np.cov(housing.price, housing.sqfeet) print(cov_mat_price_sqfeet) #output: [[184332.9 57336.2] [ 57336.2 122045.2]]
Notice that the covariance appears twice in this matrix and is equal to 57336.2
.
Instructions
Use the cov()
function from NumPy to calculate the covariance matrix for the sqfeet
variable and the beds
variable. Save the covariance matrix as cov_mat_sqfeet_beds
Print out the value stored in the variable cov_mat_sqfeet_beds
.
Look at the covariance matrix you just printed and find the covariance of sqfeet
and beds
. Save that number as a variable named cov_sqfeet_beds
.