Seaborn can also calculate aggregate statistics for large datasets. To understand why this is helpful, we must first understand what an aggregate is.
An aggregate statistic, or aggregate, is a single number used to describe a set of data. One example of an aggregate is the average, or mean of a data set. There are many other aggregate statistics as well.
Suppose we have a grade book with columns student
, assignment_name
, and grade
, as shown below.
student | assignment_name | grade |
---|---|---|
Amy | Assignment 1 | 75 |
Amy | Assignment 2 | 82 |
Bob | Assignment 1 | 99 |
Bob | Assignment 2 | 90 |
Chris | Assignment 1 | 72 |
Chris | Assignment 2 | 66 |
… | … | … |
To calculate a student’s current grade in the class, we need to aggregate the grade data by student. To do this, we’ll calculate the average of each student’s grades, resulting in the following data set:
student | grade |
---|---|
Amy | 78.5 |
Bob | 94.5 |
Chris | 69 |
… | … |
On the other hand, we may be interested in understanding the relative difficulty of each assignment. In this case, we would aggregate by assignment, taking the average of all student’s scores on each assignment:
assignment_name | grade |
---|---|
Assignment 1 | 82 |
Assignment 2 | 79.3 |
… | … |
In both of these cases, the function we used to aggregate our data was the average or mean, but there are many types of aggregate statistics including:
- Median
- Mode
- Standard Deviation
In Python, you can compute aggregates fairly quickly and easily using Numpy, a popular Python library for computing. You’ll use Numpy in this exercise to compute aggregates for a DataFrame.
Instructions
To calculate aggregates using Numpy, you’ll first need to import the Numpy library at the top of script.py.
Type the following at the top of your file:
import numpy as np
Next, take a minute to understand the data you’ll analyze. The DataFrame gradebook
contains the complete gradebook for a hypothetical classroom. Use print
to examine gradebook
.
Select all rows from the gradebook
DataFrame where assignment_name
is equal to Assignment 1
. Save the result to the variable assignment1
.
Check out the DataFrame you just created. Print assignment1
.
Now use Numpy to calculate the median grade in assignment1
.
Use np.median()
to calculate the median of the column grade
from assignment1
and save it to asn1_median
.
Display asn1_median
using print
. What is the median grade on Assignment 1?