scipy.stats
The scipy.stats
module is part of the broader SciPy library for scientific computing in Python. It provides functionality for working with various probability distributions, conducting hypothesis tests, and computing descriptive statistics. By leveraging scipy.stats
, data scientists and analysts can quickly explore their data, model it using theoretical distributions, and draw meaningful conclusions through statistical inference.
Probability Distributions
scipy.stats
provides a wide range of distributions (e.g., Normal, Exponential, Binomial) with methods to work with them. For example, for the Normal distribution:
stats.norm.pdf(x) # Probability Density Function
stats.norm.cdf(x) # Cumulative Distribution Function
stats.norm.rvs(size=n) # Generate random samples
pdf
: Returns the probability density function (PDF) value at a given point for continuous distributions..cdf
: Gives the probability that a random variable is less than or equal to a certain value.rvs
: Draws random samples from the specified distribution.
These methods can be used with other distributions available in scipy.stats
by replacing norm with the desired distribution (e.g., expon
, binom
).
Descriptive Statistics
Compute common statistical measures with both numpy
and scipy.stats
:
np.mean(data)
np.median(data)
stats.mode(data)
stats.describe(data)
mean()
: Computes the average value of the data.median()
: Finds the middle value separating the higher and lower halves of the data.mode()
: Returns the most frequently occurring value (for multi-modal data, it returns the smallest mode).describe()
: Provides a quick summary of the data, including count, min, max, mean, variance, skewness, and kurtosis.
Note: While
mean
andmedian
are part ofnumpy
,mode
anddescribe
belong toscipy.stats
.
Hypothesis Testing
Perform a variety of statistical tests to assess differences or relationships:
stats.ttest_ind(group1, group2) # Independent t-test
stats.chisquare(observed, expected) # Chi-square test
stats.mannwhitneyu(group1, group2) # Mann-Whitney U test
ttest_ind()
: Checks if the means of two independent samples differ significantly.chisquare()
: Compares observed frequencies to expected frequencies for a goodness-of-fit test.mannwhitneyu()
: Tests for differences in the distribution of two independent samples (non-parametric).
Correlation and Regression
Evaluate relationships between variables:
stats.pearsonr(x, y) # Pearson correlation
stats.spearmanr(x, y) # Spearman rank correlation
stats.kendalltau(x, y) # Kendall’s Tau correlation
pearsonr()
: Measures linear correlation between two datasets.spearmanr()
: Measures rank-based correlation, less sensitive to non-linear relationships.kendalltau()
: Measures the association between two measured quantities using rank correlation.
Contribute to Docs
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.
Learn SciPy on Codecademy
- Career path
Data Scientist: Machine Learning Specialist
Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.Includes 27 CoursesWith Professional CertificationBeginner Friendly90 hours - Free course
Learn Python 2
Learn the basics of the world's fastest growing and most popular programming language used by software engineers, analysts, data scientists, and machine learning engineers alike.Beginner Friendly17 hours