Codecademy Logo

Histograms

Matplotlib Function To Create Histogram

In Python, the pyplot.hist() function in the Matplotlib pyplot library can be used to plot a histogram. The function accepts a NumPy array, the range of the dataset, and the number of bins as input.

import numpy as np
from matplotlib import pyplot as plt
# numpy array
data_array = np.array([1,1,1,1,1,2,3,3,3,4,4,5,5,6,7])
# plot histogram
plt.hist(data_array, range = (1,7), bins = 7)

Mean of a Dataset

The mean, or average, of a dataset is calculated by adding all the values in the dataset and then dividing by the number of values in the set.

For example, for the dataset [1,2,3], the mean is 1+2+3 / 3 = 2.

Histogram Bins

In a histogram, the range of the data is divided into sub-ranges represented by bins. The width of the bin is calculated by dividing the range of the dataset by the number of bins, giving each bin in a histogram the same width.

What is a Histogram?

A Histogram is a plot that displays the spread, or distribution of a dataset. In a histogram, the data is split into intervals, called bins. Each bin shows the number of data points that are contained within that bin.

Histogram Bin Count

In a histogram, the bin count is the number of data points that fall within the bin’s range.

Histogram’s X and Y Axis

A histogram is a graphical representation of the distribution of numerical data. In a histogram, the bin ranges are on the x-axis and the counts are on the y-axis.

An example histogram.

The title of the histogram is 'Exercise Class Age Distribution'.

Along the x-axis, labeled 'Ages', the bin ranges are each 10 years, starting with age 20 and ending at age 70. The 5 bin ranges are as follows: 20-30, 30-40, 40-50, 50-60, and 60-70.

Along the y-axis, labeled 'Count', the values are whole numbers.  The values start at 0 and end at 7.

The first bin range, ages 20-30, has a count of 7. The second bin range, ages 30-40, and the third bin range, ages 40-50, each have a count of 4. The fourth bin range, ages 50-60, has a count of 3. The last bin range, ages 60-70, has a count of 2.

Each bin range is the shape of a rectangle starting at count 0 and ending at the bin ranges' respective count. These rectangles are all the same color blue with a black outline. Also, there is no white space between the bin ranges.

Unimodal Distribution

Modality describes the number of peaks in a dataset. A unimodal distribution in a histogram means there is one distinct peak indicating the most frequent value in a histogram.

An example of unimodal distribution in a histogram.

The example unimodal distribution histogram does not have any labels or units. Instead, it has numerous blue rectangles with light blue outlines next to each other. These rectangles form a bell shape where the center (top) of the bell shape illustrates the one distinct peak of a unimodal distribution.

Left-Skewed Dataset

A left-skewed dataset has a long left tail with one prominent peak to the right. The median of this dataset is greater than the mean of this dataset.

An example of a left-skewed dataset. The example is titled 'Skew-Left'.

The example is a histogram without labels or units. There is a key in the upper left corner of the histogram for the mean, which is a dark blue line, and the median, which is a green line. The histogram has a long left tail with a peak to the right. The histogram is light blue and there are two lines drawn on top of the histogram. The first line from the left is a dark blue line (mean) at the bottom of the peak, but on the right side of the long tail. The second line from the left is a green line (median) directly to the left of the center of the peak.

Multimodal Dataset

If a histogram has more than two peaks, then the dataset is referred to as multimodal.

An example of a multimodal dataset.

There is a histogram labeled 'Multimodal Distribution'. There are many thin blue bars that make up the histogram. The histogram has three peaks. Two of the peaks are about the same height and are left of the center of the histogram. The third peak is taller and is on the far right side of the histogram.

Bimodal Dataset

A bimodal dataset has two distinct peaks. This typically happens when the dataset contains two different populations.

An example of a bimodal data set.

There is a histogram shown that is labeled 'Bimodal Distribution'. There are many bars on the histogram and the histogram has two distinct peaks, one peak on the left side of the histogram, and one peak on the right side of the histogram.

Uniform Dataset

A uniform dataset does not have any distinct peaks.

As seen in the histogram below, uniform datasets have approximately the same number of values in each group represented by a bar - there is no obvious clustering.

An example of a uniform dataset.

 There is a histogram shown labeled 'Uniform Distribution'. Each of the many bars on the histogram are about the same width and are also similar in height.

Right-skewed Dataset

In a histogram, if the prominent peak lies to the left with the tail extending to the right, then it is called a right-skewed dataset. In this case, the median is less than the mean of the dataset.

 An example of a right-skewed data set.

There is a histogram that is labeled Skew-Right. The histogram has a peak on the left and a long tail to the right. There is a key in the upper right corner of the histogram with a blue line labeled mean and a green line labeled median. Layered on top of the light blue histogram bars, there is a solid green line slightly to the right of the histogram's peak. There is a solid blue line at the bottom of the histogram's peak on the right.

Symmetric Distribution in Histogram

In a histogram, the distribution of the data is symmetric if it has one prominent peak and equal tails to the left and the right. The Median and the Mean of a symmetric dataset are similar.

An example of symmetric distribution in a histogram. The example is titled 'Symmetric'.

The example is a histogram without labels or units. It also has a key in the upper right for the mean, indicated by a dark blue line, and median, indicated by a green line. The histogram is light blue with one peak and equal tails to the left and right of the peak. From left to right: there is a dark blue line (mean) and green line (median) next to each other at the center of the histogram's peak.

Dataset Outliers

An outlier is a data point that differs significantly from the rest of the values in a dataset.

For example, in the dataset [1, 2, 3, 4, 100] the value 100 is an outlier because it lies a large distance from the rest of the data.

Spread of a Dataset

The spread of a dataset is the dispersion from the dataset’s center. The descriptive statistics that describe the spread are range, variance and standard deviation.

For example, for the dataset [1, 4, 7, 10], the range of the dataset would be the maximum value of the set - the minimum value of the set, or 10 - 1 = 9.

Peak of Unimodal Distribution

The center of a dataset is the peak of a unimodal distribution. The statistics that describe the center of a dataset are the mean and median.

Learn more on Codecademy