Normal Distribution
The normal distribution, otherwise known as the Gaussian distribution, is one of the most signifcant probability distributions used for continuous data. It is defined by two parameters:
- Mean (μ): The central value of the distribution.
- Standard Deviation (σ): Computes the amount of variation or dispersion in the data.
Mathematically, the probability density function (PDF) used for the normal distribution is:
Where:
x
is the random variableμ
is the meanσ
is the standard deviatione
is Euler’s number (approximately 2.71828)π
is Pi (approximately 3.14159)
Key Properties
- Bell-shaped and Symmetric: The distribution is perfectly symmetrical around its mean.
- Mean, Median, and Mode are Equal: All three measures of central tendency have the same value.
- Empirical Rule (68-95-99.7 Rule):
- Approximately 68% of the given data falls within 1 standard deviation of the mean
- Approximately 95% falls within 2 standard deviations
- Approximately 99.7% falls within 3 standard deviations
- Standardized Form: Any normal distribution can be converted to a standard normal distribution (
μ=0
,σ=1
) using the formulaz = (x-μ)/σ
.
Applications
The normal distribution is broadly used in various fields:
- Finance: Modeling stock returns
- Natural Sciences: Measurement errors
- Social Sciences: IQ scores, heights, and other human characteristics
- Machine Learning: Assumptions in many algorithms
- Quality Control: Manufacturing processes
Example
The following code creates a sample of 1,000 normally distributed data points with a mean of 70
and a standard deviation of 10
, and displays this data in a 2×2 grid of plots for analysis:
import numpy as npimport matplotlib.pyplot as pltfrom scipy import statsimport seaborn as sns# Set seed for reproducibilitynp.random.seed(42)# Generate random data from a normal distribution# Parameters: mean=70, standard deviation=10, size=1000data = np.random.normal(70, 10, 1000)# Create visualizationsplt.figure(figsize=(12, 8))# Histogram with density curveplt.subplot(2, 2, 1)sns.histplot(data, kde=True, stat="density")plt.title('Histogram with Density Curve')plt.xlabel('Value')plt.ylabel('Density')# Q-Q plot to check normalityplt.subplot(2, 2, 2)stats.probplot(data, plot=plt)plt.title('Q-Q Plot')# Box plotplt.subplot(2, 2, 3)sns.boxplot(x=data)plt.title('Box Plot')plt.xlabel('Value')# Verify the empirical ruleplt.subplot(2, 2, 4)mean = np.mean(data)std = np.std(data)within_1_std = np.sum((mean - std <= data) & (data <= mean + std)) / len(data) * 100within_2_std = np.sum((mean - 2*std <= data) & (data <= mean + 2*std)) / len(data) * 100within_3_std = np.sum((mean - 3*std <= data) & (data <= mean + 3*std)) / len(data) * 100bars = plt.bar(['1σ', '2σ', '3σ'], [within_1_std, within_2_std, within_3_std])plt.axhline(y=68, color='r', linestyle='-', label='68% (theoretical)')plt.axhline(y=95, color='g', linestyle='-', label='95% (theoretical)')plt.axhline(y=99.7, color='b', linestyle='-', label='99.7% (theoretical)')plt.title('Empirical Rule Verification')plt.xlabel('Standard Deviation Range')plt.ylabel('Percentage of Data (%)')plt.legend()plt.tight_layout()plt.show()# Statistical summaryprint("Statistical Summary:")print(f"Mean: {np.mean(data):.2f}")print(f"Median: {np.median(data):.2f}")print(f"Standard Deviation: {np.std(data):.2f}")print(f"Skewness: {stats.skew(data):.4f}")print(f"Kurtosis: {stats.kurtosis(data):.4f}")print("\nEmpirical Rule Verification:")print(f"Data within 1 standard deviation: {within_1_std:.2f}% (theoretical: 68%)")print(f"Data within 2 standard deviations: {within_2_std:.2f}% (theoretical: 95%)")print(f"Data within 3 standard deviations: {within_3_std:.2f}% (theoretical: 99.7%)")
The output of the above code will be:
Statistical Summary:Mean: 70.19Median: 70.25Standard Deviation: 9.79Skewness: 0.1168Kurtosis: 0.0662Empirical Rule Verification:Data within 1 standard deviation: 68.60% (theoretical: 68%)Data within 2 standard deviations: 95.60% (theoretical: 95%)Data within 3 standard deviations: 99.70% (theoretical: 99.7%)
The histogram with density curve shows the bell-shaped curve characteristic of normal distributions:
Q-Q plot compares the data quantiles against theoretical normal distribution quantiles to check if the data follows a normal distribution (points following the diagonal line indicate normality):
Box plot visualizes the central tendency and spread of the data:
Bar chart tests whether the data follows the 68-95-99.7 rule by calculating the percentage of data points that fall within 1, 2, and 3 standard deviations:
All contributors
- Anonymous contributor
Contribute to Docs
- Learn more about how to get involved.
- Edit this page on GitHub to fix an error or make an improvement.
- Submit feedback to let us know how we can improve Docs.
Learn Data Science on Codecademy
- Career path
Data Scientist: Machine Learning Specialist
Machine Learning Data Scientists solve problems at scale, make predictions, find patterns, and more! They use Python, SQL, and algorithms.Includes 27 CoursesWith Professional CertificationBeginner Friendly95 hours - Course
Learn Python 3
Learn the basics of Python 3.12, one of the most powerful, versatile, and in-demand programming languages today.With CertificateBeginner Friendly23 hours