ANOVA and Hypothesis Testing
ANOVA (Analysis of Variance) is a statistical method used to compare the means of multiple groups in a dataset to determine if there are significant differences between them.
ANOVA tests the null hypothesis, which assumes that all group means are equal. If the test finds a significant difference, it suggests that at least one group mean is different from the others.
This method is commonly used in hypothesis testing to evaluate relationships between variables and identify patterns in the data.
In Python, the statsmodels
library provides tools to perform ANOVA and other hypothesis tests. These tools are widely used in data analysis to uncover meaningful insights about relationships within a dataset.
Syntax
The basic syntax for performing ANOVA using statsmodels
is as follows:
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Define the model
model = ols('dependent_variable ~ C(independent_variable)', data=dataset).fit()
# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=1)
print(anova_table)
import statsmodels.api as sm
: Imports thestatsmodels
library.from statsmodels.formula.api import ols
: This imports theols
(Ordinary Least Squares) for linear regression.dependent_variable ~ C(independent_variable)
: This formula defines the relationship between thedependent
andindependent
variables in the formula. TheC()
function treats the independent variable as categorical.data=dataset
: This specifies the dataset for analysis. It must be a structured data format, such as a Pandas DataFrame, where the variables in the formula are columns in the dataset.sm.stats.anova_lm(model, typ=1)
: This function performs an ANOVA analysis on the fitted model:model
: This is the fitted model created by theols
function.typ=1
: This specifies the type of sum of squares to use in the ANOVA calculation. Type 1 is a sequential sum of squares, which evaluates each variable in the order it appears in the formula.
Example
The following example shows how to evaluate differences in average test scores across three teaching methods:
import pandas as pdimport statsmodels.api as smfrom statsmodels.formula.api import ols# Sample datadata = pd.DataFrame({'Score': [85, 89, 76, 71, 80, 78, 93, 95, 88],'Method': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']})# Define the modelmodel = ols('Score ~ C(Method)', data=data).fit()# Perform ANOVAanova_table = sm.stats.anova_lm(model, typ=1)print(anova_table)
The following table presents the output of the ANOVA
analysis for this example, showing that the teaching method has a statistically significant effect on the scores, with a p-value
of 0.027
:
Contribute to Docs
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.
Learn Python on Codecademy
- Career path
Data Scientist: Machine Learning Specialist
Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.Includes 27 CoursesWith Professional CertificationBeginner Friendly90 hours - Course
Learn Python 3
Learn the basics of Python 3.12, one of the most powerful, versatile, and in-demand programming languages today.With CertificateBeginner Friendly23 hours