Aggregation refers to using one value to describe multiple datapoints. Calculating an average is the classic example of aggregation, because we use one value (the average) to describe the “center” of multiple datapoints.
Aggregations like the average are also called summary statistics because they summarize an entire group of data using a statistic.
NumPy is a Python library for mathematical computations, including summary statistics.
The standard convention to import NumPy is to use the alias np
:
import numpy as np
The pandas library is built on top of NumPy, which means we can apply NumPy functions to pandas objects like Series and DataFrames.
Some common NumPy aggregation functions include
np.mean()
, np.median()
for the mean and mediannp.max()
, np.min()
for maximum and minimum valuesnp.sum()
to sum all the values in an arrayFor example, let’s use NumPy to calculate the mean test score in scores
:
student_ID | test_score |
---|---|
001 | 87 |
002 | 93 |
003 | 90 |
004 | 96 |
005 | 85 |
np.mean(scores['test_score'])# Output: 90.2
Pandas provides built-in methods to aggregate DataFrame columns.
Common built-in summary methods include:
.mean()
returns the mean.median()
returns the median.std()
returns the standard deviation.max()
and .min()
return the maximum and minimum values respectively.nunique()
returns the count of unique values.count()
returns the count of non-null values.sum()
returns the sum# Summarize a single columndf['col'].summary_method()# Summarize multiple columnsdf[['col1', 'col2']].summary_method()
.groupby()
MethodThe pandas .groupby()
method splits a DataFrame into groups corresponding to the unique values in a column.
Here is the df
referred to in the code snippet:
date | home_team | away_team | win_margin |
---|---|---|---|
2021-09-24 | El Salvador | Guatemala | -2 |
2019-07-07 | United States | Mexico | -1 |
2021-06-27 | El Salvador | Guatemala | 0 |
2016-03-25 | El Salvador | Honduras | 0 |
# Split df into groups based on each home_teamdf.groupby('home_team')# Split df into groups based on home_team vs away_team matchupsdf.groupby(['home_team', 'away_team'])
Calling an aggregation method in Pandas after performing a groupby will apply that aggregation to the individual groups in the groupby.
For example, the code snippet uses .sum()
to sum the total number of home_score
goals by each home_team
in df
:
home_team | home_score |
---|---|
England | 3 |
Brazil | 3 |
South Korea | 2 |
England | 2 |
# Applying the aggregation function .sum()df.groupby('home_team')['home_score'].sum()# Output:# home_team# Brazil 3# England 5# South Korea 2
.agg()
MethodPandas’ .agg()
method is a flexible way to apply aggregation functions to groups of a groupby
.
The input to .agg()
is a dictionary where
For example, the code snippet applies sum
to compute the total number of home_score
goals by each home_team
in df
:
home_team | home_score |
---|---|
England | 3 |
Brazil | 3 |
South Korea | 2 |
England | 2 |
# Using .agg() to apply aggregation functionsdf.groupby('home_team').agg({'home_score':'sum'})# Output:# home_team# Brazil 3# England 5# South Korea 2
.pivot_table()
MethodThe pandas method pd.pivot_table()
transforms a DataFrame into wide format that is more human-readable using the input parameters:
index
is used to label the rows of the tablecolumns
is used to label the columns of the tablevalues
is used to fill the tableaggfunc
is the aggregation function applied to values
For example, say df
contains sales data for two products in two regions:
Product | Region | Sales |
---|---|---|
A | Asia | 1000 |
B | Asia | 1500 |
A | Asia | 800 |
B | North America | 1200 |
A | North America | 1100 |
B | North America | 1700 |
The code snippet pivots df
to calculate the mean sales of each product in each region:
Product | A | B |
---|---|---|
Region | ||
Asia | 900 | 1500 |
North America | 1100 | 1450 |
# Pivot df to calculate the mean sales of each product in each regionpivot_table = pd.pivot_table(df,values='Sales',index='Region',columns='Product',aggfunc='mean')
Split-apply-combine or SAC is a common workflow in data science to answer data questions.
SAC involves a three-step process where we
An inner merge combines two DataFrames by identifying matching values. For example, here are two DataFrames
X | Y | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
In an inner merge on the Category
columns, only the matching category B
would be retained, together with the corresponding data from both DataFrames:
Category | Value_x | Value_y |
---|---|---|
B | 2 | 3 |
The pandas code is
pd.merge(left = X,right = Y,left_on = 'Category',right_on = 'Category',how = 'inner')
In Pandas, a left or right merge combines two DataFrames, preserving all rows on either the left or the right DataFrame respectively.
The type is specified using the how
keyword:
pd.merge(left = X,right = Y,left_on = 'Category',right_on = 'Category',how = 'right')
For example, here are two DataFrames:
X | Y | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Performing a right merge, we maintain all rows of the right DataFrame:
Category | Value_x | Value_y |
---|---|---|
B | 3 | |
C | 4 |
and then add on any existing data from the left DataFrame:
Category | Value_x | Value_y |
---|---|---|
B | 2.0 | 3 |
C | NaN | 4 |
In Pandas, an outer merge combines two (or more) DataFrames, maintaining all rows from both DataFrames. To perform an outer merge, specify outer
as the how
argument:
pd.merge(left = X,right = Y,left_on = 'Category',right_on = 'Category',how = 'outer')
For example, here are two DataFrames
X | Y | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
To outer merge them, we match values to the different categories, filling in NaN
s where no value exists:
Category | Value_x | Value_y |
---|---|---|
A | 1 | NaN |
B | 2 | 3 |
C | NaN | 4 |