If we want to understand whether the outcomes of two categorical variables are associated, we can use a Chi-Square test. It is useful in situations like:
- An A/B test where half of users were shown a green submit button and the other half were shown a purple submit button. Was one group more likely to click the submit button?
- People under and over age 40 were given a survey asking “Which of the following three products is your favorite?” Did these age groups have significantly different preferences?
SciPy, we can use the function
chi2_contingency() to perform a Chi-Square test. The input to
chi2_contingency is a contingency table, which can be created using the
crosstab() function as follows:
#create table: import pandas as pd table = pd.crosstab(variable_1, variable_2) #run the test: from scipy.stats import chi2_contingency chi2, pval, dof, expected = chi2_contingency(table)
For example, suppose we want to know whether gender is associated with the probability of a website visitor making a purchase. The null hypothesis is that there’s no association between the variables (eg. males, females, and non-binary people are all equally likely to make a purchase on the website, so gender and purchase-status are not associated). If the p-value is below our chosen threshold (often 0.05), we reject the null hypothesis and can conclude there is a statistically significant association between the two variables (eg. men, women, and non-binary people appear to have different probabilities of making a purchase, so gender is associated with purchase-status).
The management at the VeryAnts ant store wants to know if their two most popular species of ants, the Leaf Cutter and the Harvester, vary in popularity between 1st, 2nd, and 3rd graders.
We have provided a dataset named
ants with a sample of 108 sales to 1st, 2nd, and 3rd grade teachers. The dataset has two columns:
Grade (equal to
Ant (equal to
'Leaf Cutter' or
Use this data to create a contingency table of the
Ant columns, and save the table as
chi2_contingency() function from SciPy to run a Chi-Square test using the contingency table you just created (saved as
table). Save the p-value as
pval and print it out.
Are certain types of ants more popular among specific grades (is there an association between grade and ant type)? Using a significance threshold of 0.05, indicate your answer by changing the value of
True if there is a significant association between these variables and