In this lesson, you'll learn how to calculate quantiles using NumPy

Quantiles are points that split a dataset into groups of equal size. For example, let's say you just took a test and wanted to know whether you're in the top 10% of the class. One way to determine this would be to split the data into ten groups with an equal number of datapoints in each group and see which group you fall into.

<img src="https://content.codecademy.com/courses/statistics/quantiles/deciles.svg" alt="Thirty students grades split into ten groups of three."> 

There are nine values that split the dataset into ten groups of equal size &mdash; each group has 3 different test scores in it. 

Those nine values that split the data are quantiles! Specifically, they are the 10-quantiles, or deciles.

You can find any number of quantiles. For example, if you split the dataset into 100 groups of equal size, the 99 values that split the data are the 100-quantiles, or percentiles.

The <a href="https://www.codecademy.com/courses/quartiles-quantiles-and-interquartile-range/lessons/quartiles/exercises/quartiles">quartiles</a> are some of the most commonly used quantiles. The quartiles split the data into four groups of equal size.

In this lesson, we'll show you how to calculate quantiles using NumPy and discuss some of the most commonly used quantiles.

Quantiles

The NumPy library has a function named `quantile()` that will quickly calculate the quantiles of a dataset for you.

`quantile()` takes two parameters. The first is the dataset that you are using. The second parameter is a single number or a list of numbers between `0` and `1`. These numbers represent the places in the data where you want to split.

For example, if you only wanted the value that split the first 10% of the data apart from the remaining 90%, you could use this code:

```py
import numpy as np

dataset = [5, 10, -20, 42, -9, 10]
ten_percent = np.quantile(dataset, 0.10)
```

`ten_percent` now holds the value `-14.5`. This result _technically_ isn't a quantile, because it isn't splitting the dataset into groups of equal sizes &mdash; this value splits the data into one group with 10% of the data and another with 90%. 

However, it would still be useful if you were curious about whether a data point was in the bottom 10% of the dataset.


Quantiles in NumPy

In the last exercise, we found a single "quantile" &mdash; we split the first 23% of the data away from the remaining 77%.

However, quantiles are usually a set of values that split the data into groups of equal size. For example, you wanted to get the 5-quantiles, or the four values that split the data into five groups of equal size, you could use this code:

```py
import numpy as np

dataset = [5, 10, -20, 42, -9, 10]
ten_percent = np.quantile(dataset, [0.2, 0.4, 0.6, 0.8])
```
Note that we had to do a little math in our head to make sure that the values `[0.2, 0.4, 0.6, 0.8]` split the data into groups of equal size. Each group has 20% of the data.

<img src="https://content.codecademy.com/courses/statistics/quantiles/even.svg" alt="The data is split into 5 groups where each group has 4 datapoints.">

If we used the values `[0.2, 0.4, 0.7, 0.8]`, the function would return the four values at those split points. However, those values wouldn't split the data into five equally sized groups. One group would only have 10% of the data and another group would have 30% of the data!

<img src="https://content.codecademy.com/courses/statistics/quantiles/uneven.svg" alt="The data is split into groups of uneven size. One group has 6 data points and one group only has 2.">

Many Quantiles

One of the most common quantiles is the 2-quantile. This value splits the data into two groups of equal size. Half the data will be above this value, and half the data will be below it. This is also known as the <a href="https://www.codecademy.com/courses/learn-statistics-with-python/lessons/median/exercises/introduction">median</a>!

<img src="https://content.codecademy.com/courses/statistics/quantiles/median.svg" alt="Ten points are below the median and ten points are above the median.">

The 4-quantiles, or the <a href="https://www.codecademy.com/courses/quartiles-quantiles-and-interquartile-range/lessons/quartiles/exercises/quartiles">quartiles</a>, split the data into four groups of equal size. We found the quartiles in the previous exercise. 
Options

<img src="https://content.codecademy.com/courses/statistics/quantiles/quartiles.svg" alt="Quartiles split a dataset of 20 points into 4 groups with 5 points each">

Finally, the percentiles, or the values that split the data into 100 groups, are commonly used to compare new data points to the dataset. You might hear statements like "You are above the 80th percentile in height". This means that your height is above whatever value splits the first 80% of the data from the remaining 20%.


Common Quantiles

Nice work! Here are some of the major takeaways about quantiles:

* Quantiles are values that split a dataset into groups of equal size.
* If you have `n` quantiles, the dataset will be split into `n+1` groups of equal size.
* The median is a quantile. It is the only 2-quantile. Half the data falls below the median and half falls above the median.
* Quartiles and percentiles are other common quantiles. Quartiles split the data into 4 groups while percentiles split the data into 100 groups.

Quantiles Review

In this course, you will learn how to calculate three different statistics that will help you articulate your understanding of the spread of a dataset.

In this course, you will learn how to calculate three important descriptive statistics that describe the spread of the data.

Quartiles, Quantiles, and Interquartile Range

In this lesson, you will learn how to find the median of a dataset by hand, and using Python's NumPy. 

We will also discuss the strengths and limitations of using median as a descriptive statistic of a dataset.

In this lesson, you will learn how to find the *median* of a dataset &mdash; a common measure of a dataset's center. Each of the next three exercises will cover the following topics:
- Manually finding the median of a dataset
- Using Python's NumPy library to find the median of a dataset
- Interpreting what it means for a dataset to have similar and different median and mean values

In the lesson, we will use a dataset of the 100 greatest novels, determined by a French literary magazine, Le Monde. From the dataset, you will use the median to answer the question: 

*When are great authors most likely to publish their best work?*

If you are not familiar with mean, also known as average, we recommend that you learn about it in our lesson on <a href="https://www.codecademy.com/courses/learn-statistics-with-python/lessons/average/exercises/introduction" target="_blank">average</a>.




Introduction

The formal definition for the median of a dataset is:

*The value that, assuming the dataset is ordered from smallest to largest, falls in the middle. If there are an even number of values in a dataset, you either report both of the middle two values or their average.*

There are always two steps to finding the median of a dataset:
1. Order the values in the dataset from smallest to largest
2. Identify the number(s) that fall(s) in the middle 

#### Example One: Even Number of Values

Say we have a dataset with the following ten numbers: 
```tex
24,\ 16,\ 30,\ 10,\ 12,\ 28,\ 38,\ 2,\ 4,\ 36
```

The first step is to order these numbers from smallest to largest:

```tex
2,\ 4,\ 10,\ 12,\ [16,\ 24],\ 28,\ 30,\ 36,\ 38
```

Because this dataset has an even number of values, there are two medians: `16` and `24` &mdash; `16` has four datapoints to the left, and `24` has four datapoints to the right.        

Although you can report both values as the median, people often average them. If you averaged `16` and `24`, you could report the median as `20`.

#### Example Two: Odd Number of Values

If we added another value (say, `24`) to the dataset and sorted it, we would have: 
```tex
2,\ 4,\ 10,\ 12,\ 16,\ [24],\ 24,\ 28,\ 30,\ 36,\ 38
```
The new median is equal to `24`, because there are 5 values to the left of it, and 5 values to the right of it.


Median

Finding the median of a dataset becomes increasingly time-consuming as the size of your dataset increases &mdash; imagine finding the median of an unsorted dataset with 10,000 observations.

The NumPy `.median()` function can do the work of sorting, then finding the median for you. In the example below, we use `np.median()` to calculate the median of a dataset with ten values:

```py
example_array = np.array([24, 16, 30, 10, 12, 28, 38, 2, 4, 36, 42])

example_median = np.median(example_array)

print(example_median)
```

The code above prints the median of the dataset, `24`. The mean of this dataset is `22`. It's worth noting these two values are close to one another, but not equal.


Median NumPy

In this lesson, you learned how to find the median of a dataset in two steps:
1. Sort the dataset
2. Identify the one or two numbers that fall in the middle of the sorted dataset

You also learned how to calculate the median using NumPy:
```py
np.median(my_array)
```

#### Discussion

Take a look at the histogram to the right. It displays the author age distribution with vertical lines for the mean (red) and median (yellow).

Do you feel like the median of our dataset, 40.5, provides us enough information to claim when authors publish their greatest work?

We argue it does not.

Although the median is a good measure of the dataset's center, we cannot make a definitive claim about when authors publish their greatest work &mdash; the youngest author published at 18 and the oldest at 76. It would be irresponsible to say anything but, "it seems to be possible at almost any age."

Notice that the mean and the median are nearly equal. This is not a surprising result, as both statistics are a measure of the dataset's center. However, it's worth noting that these results will not always be so close.

In the instructions below, we've written a brief explanation that puts median in the context of our problem.

Review and Discussion

In this project, you will use your knowledge of mean, median and mode to make conclusions about three boroughs in New York City: Manhattan, New York City, and Queens.

We've imported data about one-bedroom apartments in three of New York City's boroughs: Brooklyn, Manhattan, and Queens. We saved the values to:

- `brooklyn_one_bed`
- `manhattan_one_bed`
- `queens_one_bed`

In this project, we only care about the price of apartments, so we saved the price of apartments in each borough to:

- `brooklyn_price`
- `manhattan_price`
- `queens_price`


If you want to see what these arrays look like, you can use print statements to see them in the output terminal.

Before starting the next few steps, delete any `print()` statements you've added.

Find the average value of one-bedroom apartments in Brooklyn and save the value to `brooklyn_mean`.

Find the average value of one-bedroom apartments in Manhattan and save the value to `manhattan_mean`.

Find the average value of one-bedroom apartments in Queens and save the value to `queens_mean`.

Find the median value of one-bedroom apartments in Brooklyn and save the value to `brooklyn_median`.

Find the median value of one-bedroom apartments in Manhattan and save the value to `manhattan_median`.

Find the median value of one-bedroom apartments in Queens and save the value to `queens_median`.

Find the mode value of one-bedroom apartments in Brooklyn and save the value to `brooklyn_mode`.

Find the mode value of one-bedroom apartments in Manhattan and save the value to `manhattan_mode`.

Find the mode value of one-bedroom apartments in Queens and save the value to `queens_mode`.

Now what?

We don't find the mean, median, and mode of a dataset for the sake of it.

The point is to make inferences from our data. What can you say about the housing prices in Brooklyn, Queens, and Manhattan? Besides, "It's really expensive to live in any of them."

Take a minute to think through it. We added our thoughts to the hint.

Did you make any assumptions when you drew inferences in the previous task?

If so, what assumptions did you make? We added our thoughts to the hint.



Finally, think about what the histogram for each dataset will look like. 

If you have the time, take a minute to make a rough sketch of the histograms for the cost of a one-bedroom apartment in Brooklyn, Manhattan, and Queens. 

You can see someone else's attempt at a sketch of the Brooklyn histogram.

![Brooklyn Sketch](https://content.codecademy.com/courses/statistics/central-tendency/brooklyn-histogram.png)

When you're finished, open the hint to take a look at the actual histograms for Brooklyn, Manhattan, and Queens.

Central Tendency for Housing Data

In this lesson, you will learn how to find the mode of a dataset manually, and using Pythons SciPy library.

In this lesson, you will learn how to find the *mode* of a dataset. Each of the next three exercises will cover the following:
- Manually finding the mode of a dataset
- Using Python's SciPy library to find the mode
- Comparing mode to mean and median values

In the lesson, we will use a dataset of the 100 greatest novels, determined by a French literary magazine, Le Monde. From the dataset, you will use the mode to answer the question: 

*What is the most common age for a great author to publish their best work?*

If you are not familiar with mean, also known as average, or median, we recommend that you learn about it in our lessons on <a href="https://www.codecademy.com/courses/learn-statistics-with-python/lessons/average/exercises/introduction" target="_blank">average</a> and <a href="https://www.codecademy.com/courses/learn-statistics-with-python/lessons/median/exercises/introduction" target="_blank">median</a>.




The formal definition for the mode of a dataset is:

*The most frequently occurring observation in the dataset. A dataset can have multiple modes if there is more than one value with the same maximum frequency.*

While you may be able to find the mode of a small dataset by simply looking through it, if you have trouble, we recommend you follow these two steps:
1. Find the frequency of every unique number in the dataset
2. Determine which number has the highest frequency

#### Example

Say we have a dataset with the following ten numbers: 
```tex
24,\ 16,\ 12,\ 10,\ 12,\ 28,\ 38,\ 12,\ 28,\ 24
```

Let's find the frequency of each number: 

<div class="narrative-table-container">

|*24*|*16*|*12*|*10*|*28*|*38*|
|-|-|-|-|-|-|
|2|1|3|1|2|1|

</div>
    

From the table, we can see that our mode is `12`, the most frequent number in our dataset. 


Mode

Finding the mode of a dataset becomes increasingly time-consuming as the size of your dataset increases &mdash; imagine finding the mode of a dataset with 10,000 observations.

The SciPy `stats.mode()` function can do the work of finding the mode for you. In the example below, we import `stats` then use `stats.mode()` to calculate the mode of a dataset with ten values:


#### Example: One Mode

```py
from scipy import stats

example_array = np.array([24, 16, 12, 10, 12, 28, 38, 12, 28, 24])

example_mode = stats.mode(example_array)
```

The code above calculates the mode of the values in `example_array` and saves it to `example_mode`. 

The result of `stats.mode()` is an object with the mode value, and its count.
```
>>> example_mode
ModeResult(mode=array([12]), count=array([3]))
```


#### Example: Two Modes

If there are multiple modes, the `stats.mode()` function will always return the smallest mode in the dataset.

Let's look at an array with two modes, `12` and `24`:

```py
from scipy import stats

example_array = np.array([24, 16, 12, 10, 12, 24, 38, 12, 28, 24])

example_mode = stats.mode(example_array)
```

The result of `stats.mode()` is an object with the smallest mode value, and its count.
```
>>> example_mode
ModeResult(mode=array([12]), count=array([3]))
```

Mode SciPy

In this lesson, you learned how to find the mode of a dataset in two steps:
1. Find the frequency of every unique number in the dataset
2. Determine which number has the highest frequency

You also learned how to calculate the mode using SciPy:
```py
stats.mode(my_array)
```

#### Discussion

In this lesson, you found that 38 was the most common age, at publication, for an author from the Le Monde survey. How does this number compare to your guess from the beginning of the lesson?

The mode is close to the median and mean of the dataset, but it is not in the tallest bucket. This should not be surprising, as the histogram indicates the data is centered between the ages of 30 and 50 &mdash; there is a higher chance of a mode in that range than outside of it.

The mode is not always this close to the median and mean, and often will not be in the tallest bucket.

Look at the 25-30 year-old bin. There are nine observations in it. If all the values in that bin happened to be 27, then the dataset's mode would be 27. Although unlikely, it is possible. Below, we show what this would look like:

![Mode set to 27](https://content.codecademy.com/courses/statistics/mode/mode-at-27.png)

Based on this graph, it is fair to say the mode may not always be a great measure of where the data is centered. Simply put, mode is a measure of the most frequent observation in the dataset, and is not an indication of the tallest bin in a histogram.

In the instructions below, we've written a brief explanation that puts mode in the context of our problem.



In this lesson, you will learn how to calculate the average of a dataset by hand, and use Python's NumPy library to calculate to calculate it for you. We will also discuss the strengths and limitations of using average as a summary statistic of a dataset.

Finding the center of a dataset is one of the most common ways to summarize statistical findings. Often, people communicate the center of data using words like, *on average*, *usually*, or *often*.

In this lesson, you will learn how to calculate the *mean* of a dataset, a common measure of a dataset's center. We will use the mean to help us answer the question, 

*When are adults their most creative and productive?*

You could define "creative" and "productive" in a lot of ways, making this question impossible to fully answer by the end of this lesson. However, you will form an informed opinion on the question using data of the one hundred greatest novels of all time. 

We collected the dataset from a survey administered by the French literary magazine, Le Monde. From the dataset, you will calculate the average age of the authors when their books were published.


The *mean*, often referred to as the *average*, is a way to measure the center of a dataset. 

The average of a set is calculated using a two-step process:
1. Add all of the observations in your dataset.
2. Divide the total sum from step one by the number of points in your dataset.

```tex
\bar{x} = \frac{x_1 + x_2 … + x_{n}}{n}
```

The equation above is used to calculate mean. `x1`, `x2`, ... `xn` are observations from a dataset of `n` observations. 


#### Example

Imagine that we wanted to calculate average of a dataset with the following four observations:
```py
data = [4, 6, 2, 8]
```
##### Step One: Calculate the total

```tex

4 + 6 + 2 + 8 = 20

```

##### Step Two: Divide by the number of observations

The total is equal to 20, and the number of observations is equal to 4.
```tex
\frac{20}{4} = 5
```

The average of this dataset is equal to 5.


Calculating Mean

While you've shown that you can calculate the average yourself, it becomes time-consuming as the size of your dataset increases &mdash; imagine adding all of the numbers in a dataset with 10,000 observations.

The NumPy `.average()` or `.mean()` function can do the work of adding and dividing for you. In the example below, we use `np.average()` to calculate the average of a dataset with ten values.

```py
example_array = np.array([24, 16, 30, 10, 12, 28, 38, 2, 4, 36])

example_average = np.average(example_array)

print(example_average)
```

The code above calculates the average of `example_array` and saves the value to `example_average`. The resulting average of this array is `20`. 


NumPy Average

In this lesson, you learned how to calculate the average of a dataset using the formula:

```tex
\bar{x} = \frac{x_1 + x_2 … + x_{n}}{n}
```

and the NumPy function:
```py
np.average(my_array)
```

---

Circling back to the original question, do you feel like the average of our dataset, 42.12, provides us enough information to claim when someone is their most creative and productive?

Take a look at the histogram and mean (in red) to the right as you consider this question. 

We would say, **No**. Though we could argue against its use for a few reasons, below, we've highlighted two:
- The date of publication is not necessarily an author's most creative year. When did they start authoring the book? What factors impacted their writing during those years?
- The average age of the publishing dates for 100 authors may not accurately measure peak creativity in other professions. The average age of painters or sculptors may be very different.

So, what kind of information does the average provide us, and why would we use the average to describe something when we could display a histogram? 

The most important outcome is that we're able to use a single number as a measure of centrality. Although histograms provide more information, they are not a concise or precise measure of centrality &mdash; the reader must interpret it for themselves.


Mean

The sum of the values in the dataset must be divided by the number of values in the dataset (`n`).

The values in the dataset should be multiplied together.

You need to take the square root of all values in the dataset.

The mean of a dataset, is the value that, assuming the dataset is ordered from smallest to largest, falls in the middle.

The median of a dataset is the value that, assuming the dataset is ordered from smallest to largest, falls in the middle. If there are an even number of values in a dataset, the middle two values are the median.

The median of a dataset is calculated by adding all of the values of the set together, then dividing the sum by the number of values in the set.

The median of a dataset is the most frequently occurring observation in the dataset. A dataset can have multiple medians if there is more than one value with the same maximum frequency.

The median of a dataset is a measure of its spread. It is calculated by finding the average of the squared differences between every observation and the mean. The resulting value is in units squared.

\{6, 8, 9, 22, 90, 45, 2, 22, 45, 8, 22, 6, 7\}

\{15, 8, 9, 15, 12, 13, 2, 15, 13, 8, 13, 6, 7\}

Test your central tendency knowledge with this quiz on mean, median and mode.

Mean, Median, and Mode

The difference between the `i`th data point and the mean.

The sum of the difference between every data point and the mean.

The difference between the median and the mean of a dataset.

This is the entire equation for variance.

Squaring the difference results in a positive number. This prevents data points above and below the mean canceling each other out.

Squaring the difference makes the units of variance units squared, which is easier to interpret.

You don’t square the difference between every data point and the mean. Instead, you average the distance between every data point and the mean and _then_ square the result.

By squaring the difference, you calculate the standard deviation, which is an easier statistic to compare to other descriptive statistics like the mean.

Square the difference between each data point and the mean.

Divide the variance by the number of points in the dataset

A datapoint that is `3.5` standard deviations below the mean.

A datapoint that is `3` standard deviations above the mean.

A datapoint that is `1` standard deviation below the mean.

Test your understanding of the descriptive statistics variance and standard deviation.

Variance and Standard Deviation

Find the best time to visit London by examining weather data.

All of the weather data is stored in a variable named `london_data`. 

Print the first few rows of the dataset by calling `print(london_data.head())`.

Take a look at the browser to see the columns of this dataset. Here are two questions to ask yourself:
* How often were measurements taken?
* Which columns might be the most useful when thinking about planning a trip.

If you want to see different rows of the data, you can try something like this:

```py
print(london_data.iloc[100:200])
```

This will print rows 100 through 199.

Comment out these print statements after looking through the results.

Let's also take a look at how many data points we have. Print `len(london_data)`

Now that we've seen what the data looks like, let's dive into one of the more promising columns &mdash; `"TemperatureC"`. This column stores the temperature in Celsius.

To get a single column from a DataFrame, you can use this syntax:

```py
one_column = london_data["column_name"]
```

Create a variable named `temp` and set it equal to the `"TemperatureC"` column of `london_data`.

We can now calculate descriptive statistics about this column. To begin, find the average temperature in London in 2015. Store it in a variable named `average_temp`.

Calculate the variance of the temperature column and store the results in the variable `temperature_var`. Print the results.

Calculate the standard deviation of the temperature column and store a variable named `temperature_standard_deviation`. Print this variable.

How would the variance and standard deviation help you plan a trip?

The statistics we just calculated aren't very helpful when trying to plan a vacation since they describe the weather throughout an entire year.

If we could find a way to use the rows from only a certain month, that might help us find the best month to plan our trip.

Once again, print `london_data.head()` to see the first few columns of our DataFrame. Which column will help us get only the data points from January? In the browser you can scroll to the right to see more columns.

We want to filter by the `"month"` column! The following line of code will create a variable that gets the temperature from the rows where `"month"` is `6`. These will be all of the rows from the month of June.

```py
june = london_data.loc[london_data["month"] == 6]["TemperatureC"]
```

Create this variable for June.


Create a variable named `july` that contains all of the data points from July. The code to do this should look very similar to your code that created the June variable. This time, we're interested in month `7`.

Calculate and print the mean temperature in London for both June and July using the `np.mean()` function.

What do these numbers tell you? If you wanted to visit London on the month that was, on average, cooler, which month would you pick? Look at the hint to see our thoughts!

Calculate and print the standard deviation of temperature in London for both June and July. Remember, the function you should use is `np.std()`.

What do these numbers tell you? How might the standard deviation change your decision on when to visit London? Click on the hint to see our thoughts.

If you want to quickly see the mean and standard deviation of every month, use this block of code. 

```py
for i in range(1, 13):
  month = london_data.loc[london_data["month"] == i]["TemperatureC"]
  print("The mean temperature in month "+str(i) +" is "+ str(np.mean(month)))
  print("The standard deviation of temperature in month "+str(i) +" is "+ str(np.std(month)) +"\n")
```

During which month would you most like to visit? If you wanted to pick the month with the least variable temperature, which one would you pick?


By looking at the mean and standard deviation of the temperature in London during each month of the year, we can get a sense of the best time to visit.

Looking at the spread of the data is an important statistic to consider if you are particularly sensitive to extreme days. For example, if you pick a month with a large standard deviation, you might have one day that is relatively cold while the following day is very hot.

Take some time to see if you can find more insights in this dataset. Here are some ideas we have for you:
* Look at columns other than `"TemperatureC"`. Can you find something interesting about the humidity or the air pressure? Can you find the rainiest month? London is notoriously rainy!
* Filter based on`"hour"`. Similar to how you filtered based on the month, are there certain hours that have higher variance than others?

Variance in Weather

In this lesson, you will learn how to calculate and interpret the variance of a dataset.

Finding the mean, median, and mode of a dataset is a good way to start getting an understanding of the general shape of your data

However, those three descriptive statistics only tell part of the story. Consider the two datasets below:

```py
dataset_one = [-4, -2, 0, 2, 4]
dataset_two = [-400, -200, 0, 200, 400]
```

These two datasets have the same mean and median &mdash; both of those values happen to be `0`. If we only reported these two statistics, we would not be communicating any meaningful difference between these two datasets.

This is where *variance* comes into play. Variance is a descriptive statistic that describes how spread out the points in a data set are.


Variance

Now that you have learned the importance of describing the spread of a dataset, let's figure out how to mathematically compute this number.

How would you attempt to capture the spread of the data in a single number? 

Let's start with our intuition &mdash; we want the variance of a dataset to be a large number if the data is spread out, and a small number if the data is close together.

<img src="https://content.codecademy.com/courses/statistics/variance/two_histograms.svg" alt = "Two histograms. One with a large spread and one with a smaller spread.">

A lot of people may initially consider using the range of the data. But that only considers two points in your entire dataset. Instead, we can include every point in our calculation by finding the difference between every data point and the mean. 

<img src="https://content.codecademy.com/courses/statistics/variance/difference.svg" alt="The difference between the mean and four different points.">

If the data is close together, then each data point will tend to be close to the mean, and the difference will be small. If the data is spread out, the difference between every data point and the mean will be larger.

Mathematically, we can write this comparison as 

```tex
\text{difference} = X - \mu
```
Where `X` is a single data point and the Greek letter `mu` is the mean.


Distance From Mean

We now have five different values that describe how far away each point is from the mean. That seems to be a good start in describing the spread of the data. But the whole point of calculating variance was to get one number that describes the dataset. We don't want to report five values &mdash; we want to combine those into one descriptive statistic.

To do this, we'll take the average of those five numbers. By adding those numbers together and dividing by `5`, we'll end up with a single number that describes the average distance between our data points and the mean.

Note that we're not _quite_ done yet &mdash; our final answer is going to look a bit strange here. There's a small problem that we'll fix in the next exercise.


Average Distances

We're almost there! We have one small problem with our equation. Consider this very small dataset:

```py
[-5, 5]
```
The mean of this dataset is `0`, so when we find the difference between each point and the mean we get `-5 - 0 = -5` and `5 - 0 = 5`.

When we take the average of `-5` and `5` to get the variance, we get `0`! 

Now think about what would happen if the dataset were `[-200, 200]`. We'd get the same result! That can't possibly be right &mdash; the dataset with `200` is much more spread out than the dataset with `5`, so the variance should be much larger!

The problem here is with negative numbers. Because one of our data points was `5` units below the mean and the other was `5` units above the mean, they canceled each other out! 

When calculating variance, we don't care if a data point was above or below the mean &mdash; all we care about is how far away it was. To get rid of those pesky negative numbers, we'll square the difference between each data point and the mean.

Our equation for finding the difference between a data point and the mean now looks like this:

```tex
\text{difference} = (X - \mu)^2
```


Square The Differences

Well done! You've calculated the variance of a data set. The full equation for the variance is as follows:

```tex
\sigma^2 = \frac{\sum_{i=1}^{N}{(X_i -\mu)^2}}{N}
```
Let's dissect this equation a bit. 
* Variance is usually represented by the symbol sigma squared. 
* We start by taking every point in the dataset &mdash; from point number `1` to point number `N` &mdash; and finding the difference between that point and the mean. 
* Next, we square each difference to make all differences positive.
* Finally, we average those squared differences by adding them together and dividing by `N`, the total number of points in the dataset.

All of this work can be done quickly using Python's NumPy library. The `var()` function takes a list of numbers as a parameter and returns the variance of that dataset.

```py
import numpy as np

dataset = [3, 5, -2, 49, 10]
variance = np.var(dataset)
```


Variance In NumPy

Great work! In this lesson you've learned about variance and how to calculate it.

In the example used in this lesson, the importance of variance was highlighted by showing data from test scores in classes taught by two different teachers. What story does variance tell? What conclusions can we draw from this statistic?

<img src = "https://content.codecademy.com/courses/statistics/variance/teachers.png" alt = "The histogram of scores from two different teacher's classes">

In the class with low variance, it seems like the teacher strives to make sure all students have a firm understanding of the subject, but nobody is exemplary.

In the class with high variance, the teacher might focus more of their attention on certain students. This might enable some students to ace their tests, but other students get left behind.

If we only looked at statistics like mean, median, and mode, these nuances in the data wouldn't be represented.

Review

In this lesson you will learn how to calculate and interpret the standard deviation of a dataset.

When beginning to work with a dataset, one of the first pieces of information you might want to investigate is the spread &mdash; is the data close together or far apart? One of the tools in our statistics toolbelt to do this is the descriptive statistic _variance_:

```tex
\sigma^2 = \frac{\sum_{i=1}^{N}{(X_i -\mu)^2}}{N}
```

By finding the variance of a dataset, we can get a numeric representation of the spread of the data. But what does that number really mean? How can we use this number to interpret the spread?

It turns out, using variance isn't necessarily the best statistic to use to describe spread. Luckily, there is another statistic &mdash; standard deviation &mdash; that can be used instead.

In this lesson, we'll be working with two datasets. The first dataset contains the heights (in inches) of a random selection of players from the NBA. The second dataset contains the heights (in inches) of a random selection of users on the dating platform OkCupid &mdash; let's hope these users were telling the truth about their height!

Variance Recap

Variance is a tricky statistic to use because its units are different from both the mean and the data itself. For example, the mean of our NBA dataset is `77.98` inches. Because of this, we can say someone who is `80` inches tall is about two inches taller than the average NBA player.

However, because the formula for variance includes _squaring_ the difference between the data and the mean, the variance is measured in _units squared_. This means that the variance for our NBA dataset is `13.32` inches squared.

This result is hard to interpret in context with the mean or the data because their units are different. This is where the statistic _standard deviation_ is useful.

Standard deviation is computed by taking the square root of the variance. `sigma` is the symbol commonly used for standard deviation. Conveniently, `sigma` squared is the symbol commonly used for variance:

```tex
\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum_{i=1}^{N}{(X_i -\mu)^2}}{N}}
```

In Python, you can take the square root of a number using `** 0.5`:

```py
num = 25
num_square_root = num ** 0.5
```


Standard Deviation

There is a NumPy function dedicated to finding the standard deviation of a dataset &mdash; we can cut out the step of first finding the variance. The NumPy function `std()` takes a dataset as a parameter and returns the standard deviation of that dataset:

```py
import numpy as np

dataset = [4, 8, 15, 16, 23, 42]
standard_deviation = np.std(dataset)
```


Standard Deviation in NumPy

Now that we're able to compute the standard deviation of a dataset, what can we do with it? 

Now that our units match, our measure of spread is easier to interpret. By finding the number of standard deviations a data point is away from the mean, we can begin to investigate how unusual that datapoint truly is. In fact, you can usually expect around 68% of your data to fall within one standard deviation of the mean, 95% of your data to fall within two standard deviations of the mean, and 99.7% of your data to fall within three standard deviations of the mean. 

<img src="https://content.codecademy.com/courses/statistics/variance/normal_curve.svg" alt="A histogram showing where the standard deviations fall">
If you have a data point that is over three standard deviations away from the mean, that's an incredibly unusual piece of data!


Using Standard Deviation

In the last exercise you saw that Lebron James was `0.55` standard deviations above the mean of NBA player heights. He's taller than average, but compared to the other NBA players, he's not absurdly tall.

However, compared to the OkCupid dating pool, he is extremely rare! He's almost three full standard deviations above the mean. You'd expect only about `0.15%` of people on OkCupid to be more than 3 standard deviations away from the mean.

This is the power of standard deviation. By taking the square root of the variance, the standard deviation gives you a statistic about spread that can be easily interpreted and compared to the mean.


Use histograms to find the best time of year to visit Acadia, Maine.

Use Matplotlib to create a flights histogram.



If you haven't done so already, set the range of your histogram to `(0, 365)`


Set the number of bins in your plot to 365, so you have a separate bin for each day of the year.



Add an x-label, y-label, and title to your figure.



Now, we're going to set up our figure so it displays two plots at once. Above your `plt.hist()` line, add the following:

```py
plt.figure(1)
plt.subplot(211)
```



Between the last line for plotting your histogram and the show command, add `plt.subplot(212)`.

At this point, your code should look something like:

```py
plt.figure(1)
plt.subplot(211)

plt.hist(flights, range=(0, 365), bins=365)
plt.title("Flights by Day")
plt.xlabel("Day of the Year")
plt.ylabel("Flight Count")

plt.subplot(212)
# Next histogram below


plt.show()
```


Under `plt.subplot(212)`, use `plt.hist()` to make a histogram that displays the number of flowers that begin to bloom each day of the year.


Label the y-axis of the plot. In the hint, we've added our code.

Right before calling `plt.show()` call `plt.tight_layout()`. This will prevent the labels from overlapping with the graphs.

How would you use these histograms to help inform customers when they should go to Acadia, Maine. For example, if someone said they wanted to visit Acadia while there weren't many people there, but while flowers were blooming, when would you recommend?

Check the hint to see how we answered this question.

Traveling to Acadia

Statistics is often pitched as a way to find certainty through data. As you'll learn in this lesson, the power of statistics is more often used to communicate that certainty doesn't really exist. Instead, it provides tools to communicate how uncertain we are about a problem.

There's no better tool to visualize the uncertainty and chaos in data than a histogram. A histogram displays the distribution of your underlying data.

Histograms reveal, through numbers, interpretable trends in your data. They don't provide a yes or no answer, but are often used as a starting point for discussion and informing an answer to your data.

Whether you're brand new to statistics or know it well, this lesson will teach you how to use histograms to visualize your data and inform decision-making.


The purpose of a histogram is to summarize data that you can use to inform a decision or explain a distribution.

While a histogram is one of the most useful tools for communicating trends, people often use the *average* of a dataset to make broad claims about its underlying trends.

While the average value of data may be useful to interpret where the data is centered, it can also be misleading. 

### Average grocery store time

Throughout this lesson, you will analyze grocery store data to understand how a histogram may be a better tool for communicating trends in your data. You will use this data to determine the best times for a manager to staff their store.

Let's start by looking at our data and calculating the average time that a person will enter the grocery store.



Summarizing Your Data

Histograms are helpful for understanding how your data is distributed. While the average time a customer may arrive at the grocery store is 3 pm, the manager knows 3 pm is not the busiest time of day.

Before identifying the busiest times of the day, it's important to understand the extremes of your data: the minimum and maximum values in your dataset. With the minimums and maximums, you can calculate the *range*.

The range of your data is the difference between the maximum value and the minimum value in your dataset. 

```tex
range = max(data)\ -\ min(data)
``` 

#### Exercise Class Example

In the example below, we have a NumPy array with the ages of people in an exercise class. Before looking at the data, let's think about what minimum, maximum, and range values are reasonable for a group of people in an exercise class:

- The minimum cannot be below 0, because people don't have negative ages
- The maximum is probably lower than 122 (the oldest person ever).

Now, let's take a look at our data.

```py
exercise_ages = np.array([22, 27, 45, 62, 34, 52, 42, 22, 34, 26])
```
The minimum age in `exercise_ages` is 22, the maximum age is 62, and the range is 40.

You can use the following Python commands to verify this result:

```py
min_age = np.amin(exercise_ages) # Answer is 22
max_age = np.amax(exercise_ages) # Answer is 62
age_range = max_age - min_age
```


Range

In the previous exercise, you found that the earliest transaction time is close to 0, and the latest transaction is close to 24, making your range nearly 24 hours.

Now, we have the information we need to start building our histogram. The two key features of a histogram are *bins* and  *counts*. 

##### Bins

A bin is a sub-range of values that falls within the range of a dataset. In the grocery store example, a valid bin may be from 0 hours to 6 hours. This bin includes all times from just after midnight (0) until 6 am (6).

Additionally, all bins in a histogram must be the same width. 

If the range of values in our dataset is from 0 to 24, and the first bin in our grocery store example is from 0 to 6, can you figure out the minimums and maximums of the other bins?

The grocery store bins are:
- 0 to 6 hours
- 6 to 12 hours
- 12 to 18 hours
- 18 to 24 hours

Bins and Count I

A *count* is the number of values that fall within a bin's range. For example, if 100 customers arrive at your grocery store between midnight (0) and 6 am (6), your count for that bin is equal to 100.

##### Exercise Class Example

In the example below, we have an array with ten values in it, each representing the age of a person in an exercise class. We want to calculate the number of students who are in their 20s, 30s, 40s, 50s, and 60s.

```py
exercise_ages = np.array([20, 27, 45, 69, 34, 52, 42, 22, 34, 26])
```

**Question:** What is our range?
- **Answer**: Between 20 and 69. 

**Question:** How many bins do we need? 
- **Answer:** The bins are 20s, 30s, 40s, 50s, and 60s, so we need five bins, each covering ten years.

The table below shows the number of people in each binned age group:

<div class="narrative-table-container">

|*20-29*|*30-39*|*40-49*|*50-59*|*60-69*|
|-|-|-|-|-|
|4|2|2|1|1|

</div>




Bins and Count II

While counting the number of values in a bin is straightforward, it is also time-consuming. How long do you think it would take you to count the number of values in each bin for:
- an exercise class of 50 people?
- a grocery store with 300 loaves of bread?

Most of the data you will analyze with histograms includes far more than ten values. 

For these situations, we can use the `numpy.histogram()` function. In the example below, we use this function to find the counts for a twenty-person exercise class.

```py
exercise_ages = np.array([22, 27, 45, 62, 34, 52, 42, 22, 34, 26, 24, 65, 34, 25, 45, 23, 45, 33, 52, 55])

np.histogram(exercise_ages, range = (20, 70), bins = 5)
```
Below, we explain each of the function's inputs:
- `exercise_ages` is the input array
- `range = (20, 70)` &mdash; is the range of values we expect in our array. Range includes everything from 20, up until but not including 70.
- `bins = 5` is the number of bins. Python will automatically calculate equally-sized bins based on the range and number of bins. 

Below, you can see the output of the `numpy.histogram()` function:

```bash
(array([7, 4, 4, 3, 2]), array([20., 30., 40., 50., 60., 70.]))
```

The first array, `array([7, 4, 4, 3, 2])`, is the counts for each bin. The second array, `array([20., 30., 40., 50., 60., 70.])`, includes the minimum and maximum values for each bin: 
- **Bin 1:**  20 to <30
- **Bin 2:** 30 to <40
- **Bin 3:** 40 to <50
- **Bin 4:** 50 to <60
- **Bin 5:** 60 to <70


Histograms

At this point, you've learned how to find the numerical inputs to a histogram. Thus far the size of our datasets and bins have produced results that we can interpret. This becomes increasingly difficult as the number of bins in a histogram increases.

Because of this, histograms are typically viewed graphically, with bin ranges on the x-axis and counts on the y-axis. The figure below shows the graphical representation of the histogram for our exercise class example from last exercise. Notice, there are five equally-spaced bars, with each displaying a count for an age range. Compare the graph to the table, just below it.

<img src="https://content.codecademy.com/courses/statistics/histograms/age-distribution.png" alt="Histogram">


<div class="narrative-table-container">

|*20-29*|*30-39*|*40-49*|*50-59*|*60-69*|
|-|-|-|-|-|
|7|4|4|3|2|

</div>




Histograms are an easy way to visualize trends in your data. When I look at the above graph, I think, "More people in the exercise class are in their twenties than any other decade. Additionally, the histogram is skewed, indicating the class is made of more younger people than older people."

We created the plot above using the `matplotlib.pyplot` package. We imported the package using the following code:

```py
from matplotlib import pyplot as plt
```

We plotted the histogram with the following code. Notice, the `range` and `bins` arguments are the same as we used in the last exercise:

```py
plt.hist(exercise_ages, range = (20, 70), bins = 5, edgecolor='black')

plt.title("Decade Frequency")
plt.xlabel("Ages")
plt.ylabel("Count")

plt.show()
```

In the code above, we used the `plt.hist()` function to create the plot, then added a title, x-label, and y-label before showing the graph with `plt.show()`.


Plotting a Histogram

The figure below displays the graph that you created in the last exercise:

<img src="https://content.codecademy.com/courses/statistics/histograms/customer-frequency.png" alt="Histogram">

This histogram is helpful for our store manager. The last six hours of the day are the busiest &mdash; from 6 pm until midnight. Does this mean the manager should staff their grocery store with the most employees between 6 pm and midnight? 

To the manager, this doesn't make much sense. The manager knows the store is busy when many people get off work, but the rush certainly doesn't continue later than 9 pm. 

The issue with this histogram is that we have too few bins. When plotting a histogram, it's essential to select bins that fully capture the trends in the underlying data. Often, this will require some guessing and checking. There isn't much of a science to selecting bin size.

How many bins do you think makes sense for this example?
I would try 24 because there are 24 hours in a day.


Finding your Best Bin Size

In this lesson, you learned what a histogram is, how to calculate it using NumPy, and how to plot one with Matplotlib.

The example that we used throughout this lesson, finding the busiest times of day at a grocery store, shows how histograms can be used to derive meaning, and inform decisions with our data.

Looking at the hourly customer count histogram was more helpful to inform staffing decisions than knowing the average time that a customer arrives at the grocery store. 

Although average is an important, and quick way to summarize your data, histograms tell you much more of the story.

When you want to see the distribution of the data.

When you want to know the center of your data.

When you want to capture details of your data with a single number.

When you want to show the total number of values in a dataset.

The exact value of the minimum in your dataset.

All bins must have the same number of data samples in it.

Each bin in a histogram must be the same width.

Bins separate subsets of your data, based on numerical proximity.

The number of bins in a histogram can be set manually.

Decrease the range of the histogram to 12.

```py
plt.hist(times, range=(0, 24), bins = 4)
```

```py
plt.hist(times, range=(0, 24), bins = 5)
```

```py
plt.hist(times, range=(0, 4), bins = 24)
```

```py
plt.hist(times, range=(0, 48), bins = 4)
```

new_array = np.array([1, 2, 2, 6, 9, 10, 11, 3, 6])

This quiz contains questions related to introductory level histogram concepts.

Two, with one centered at 20 and one centered at 40.

Three, with one centered at 20, one centered at 40, and one at 60.

There is no skew. The distribution is symmetric.

![histogram](https://content.codecademy.com/courses/statistics/histograms/histograms-ii/unimodal.png)

![histogram](https://content.codecademy.com/courses/statistics/histograms/histograms-ii/skew.png)

![histogram](https://content.codecademy.com/courses/statistics/histograms/histograms-ii/multimodal.png)

An outlier is a data point that is far away from the rest of the dataset.

The maximum and minimum values of a dataset are always the outliers of a dataset.

All points that are more than one standard deviation away from the mean are outliers.

Outliers are visualized as points in a histogram.

Test your knowledge of histograms by taking this quiz on describing a dataset using histograms.

Describe a Histogram

Use histograms to explain exam score distributions in a math class.

**Exam 1:** Use the following information, and the distribution to the right to fill in the Exam 1 blanks in **summary.txt**.

**Average:** 80

**Median:** 80

Check the hint for the answer and an interpretation of the results.




**Exam 2:** Use the following information, and the distribution to the right to fill in the Exam 2 blanks in **summary.txt**.

**Average:** 82

**Median:** 84

Check the hint for the answer and an interpretation of the results.

**Exam 3:** Use the following information, and the distribution to the right to fill in the Exam 3 blanks in **summary.txt**.

**Average:** 77

**Median:** 80

Check the hint for the answer and an interpretation of the results.

**Final Exam:** Use the following information, and the distribution to the right to fill in the Final Exam blanks in **summary.txt**.

**Average:** 80

**Median:** 80

Check the hint for the answer and an interpretation of the results.

Describe Exam Grade Distributions

In this lesson, you will learn how to describe a statistical distribution by considering its center, shape, spread, and outliers.

At this point, you should be familiar with what a histogram displays. In this lesson, we're going to build on those skills by learning the best way to describe a statistical distribution.

While many people know the functions to plot a histogram, few spend the time to learn how to fully, and concisely communicate what it means. 

In this lesson, you will learn how to interpret a distribution using the following five features of a dataset:

- Center
- Spread
- Skew
- Modality
- Outliers

If you're one for mnemonics, maybe this will help:

**C**ream **S**hoes are **S**tylish, **M**odern, and **O**utstanding.

Throughout this lesson, we will use <a href="https://data.cms.gov/summary-statistics-on-use-and-payments/medicare-geographic-comparisons/medicare-geographic-variation-by-national-state-county" target="_blank">data from the United States Health and Human Services Department</a> to compare the cost of the same medical procedure at over 2,000 hospitals across the country.

One of the most common ways to summarize a dataset is to communicate its center. In this lesson, we will use average and median as our measures of centrality. Take the Codecademy lessons on <a href="https://www.codecademy.com/courses/learn-statistics-with-python/lessons/average/exercises/introduction?action=resume_content_item" target="_blank">average</a> and <a href="https://www.codecademy.com/courses/learn-statistics-with-python/lessons/median/exercises/introduction?action=resume_content_item" target="_blank">median</a> if you're interested in how to calculate them by hand or using NumPy functions.

The figure below shows the average and median ages of a dataset of 100 authors. As expected, the average and median values are near the center of the distribution.

![title](https://content.codecademy.com/courses/statistics/histograms/histograms-ii/mean-median.png)

While it's good practice to communicate both the average and median values, the average is generally more common.

Center

Once you've found the center of your data, you can shift to identifying the extremes of your dataset: the minimum and maximum values. These values, taken with the mean and median, begin to indicate the shape of the underlying dataset. Take the histogram below as an example:

![title](https://content.codecademy.com/courses/statistics/histograms/histograms-ii/mean-median.png)

The minimum value of this data is 18, and the maximum value is 76.

You can calculate the range using the following:

```tex
range = max - min
```

The range of this dataset is
 
```tex
range = 76 - 18 = 58
```

    


Spread

Once you have the center and range of your data, you can begin to describe its shape. The skew of a dataset is a description of the data's symmetry.

A dataset with one prominent peak, and similar tails to the left and right is called symmetric. The median and mean of a symmetric dataset are similar.

![histogram](https://content.codecademy.com/courses/learn-pandas/distribution-types-ii-symmetric.svg)

A histogram with a tail that extends to the right is called a right-skewed dataset. The median of this dataset is less than the mean.

![histogram](https://content.codecademy.com/courses/learn-pandas/distribution-types-ii-skew-right.svg)


A histogram with one prominent peak to the right, and a tail that extends to the left is called a left-skewed dataset. The median of this dataset is greater than the mean.

![histogram](https://content.codecademy.com/courses/learn-pandas/distribution-types-ii-skew-left.svg)


Skew

The modality describes the number of peaks in a dataset. Thus far, we have only looked at datasets with one distinct peak, known as _unimodal_. This is the most common.

![histogram](https://content.codecademy.com/courses/numpy/distribution_type_i/unimodal_new.svg)

A _bimodal_ dataset has two distinct peaks. 

![histogram](https://content.codecademy.com/courses/numpy/distribution_type_i/bimodal_new.svg)

A _multimodal_ dataset has more than two peaks. The histogram below displays three peaks.

![histogram](https://content.codecademy.com/courses/numpy/distribution_type_i/multimodal_new.svg)

You may also see datasets with no obvious clustering. Datasets such as these are called _uniform distributions_.

![histogram](https://content.codecademy.com/courses/numpy/distribution_type_i/uniform_new.svg)

Modality

An outlier is a data point that is far away from the rest of the dataset. Outliers do not have a formal definition, but are easy to determine by looking at histogram. The histogram below shows an example of an outlier. There is one datapoint that is much larger than the rest.

![title](https://content.codecademy.com/courses/statistics/histograms/histograms-ii/outlier.png)

If you see an outlier in your dataset, it's worth reporting and investigating. This data can often indicate an error in your data or an interesting insight.


Outliers

In this lesson, you learned a framework for describing the distribution of a dataset, which includes the following five features:
- Center
- Spread
- Skew
- Modality 
- Outliers

If you're curious, while the dataset in this study showed the average amount charged by hospitals for a given condition, the amount paid is far less. This is often due to nature of the United States healthcare system, where private insurers negotiate down the charges levied by hospitals.

The small histogram on the left side of the plot displays the actual amount of money paid to the hospital.

In this lesson you will learn how to calculate and interpret the quartiles of a dataset.

A common way to communicate a high-level overview of a dataset is to find the values that split the data into four groups of equal size.

By doing this, we can then say whether a new datapoint falls in the first, second, third, or fourth quarter of the data.

<img src="https://content.codecademy.com/courses/statistics/quantiles/quartiles.svg" alt = "20 data points, with three lines splitting the data into 4 groups of 5.">

The values that split the data into fourths are the _quartiles_. 

Those values are called the first quartile (Q1), the second quartile (Q2), and the third quartile (Q3)

In the image above, Q1 is `10`, Q2 is `13`, and Q3 is `22`. Those three values split the data into four groups that each contain five datapoints.

In this lesson, you will learn to calculate the quartiles by hand, and by using Python's NumPy library.

Quartiles


We'll come back to the music dataset in a bit, but let's first practice on a small dataset.

Let's begin by finding the second quartile (Q2). Q2 happens to be exactly the <a href="https://www.codecademy.com/courses/learn-statistics-with-python/lessons/median/exercises/introduction">median</a>. Half of the data falls below Q2 and half of the data falls above Q2.

The first step in finding the quartiles of a dataset is to sort the data from smallest to largest. For example, below is an unsorted dataset:

```tex
[8, 15, 4, -108, 16, 23, 42]
```

After sorting the dataset, it looks like this:

```tex
[-108, 4, 8, 15, 16, 23, 42]
```

Now that the list is sorted, we can find Q2. In the example dataset above, Q2 (and the median) is `15` &mdash; there are three points below `15` and three points above `15`.

### Even Number of Datapoints

You might be wondering what happens if there is an even number of points in the dataset. For example, if we remove the `-108` from our dataset, it will now look like this:

```tex
[4, 8, 15, 16, 23, 42]
```

Q2 now falls somewhere between `15` and `16`. There are a couple of different strategies that you can use to calculate Q2 in this situation. One of the more common ways is to take the average of those two numbers. In this case, that would be `15.5`. 

Recall that you can find the average of two numbers by adding them together and dividing by two.


The Second Quartile

Now that we've found Q2, we can use that value to help us find Q1 and Q3. Recall our demo dataset:

```tex
[-108, 4, 8, 15, 16, 23, 42]
```
In this example, Q2 is `15`. To find Q1, we take all of the data points smaller than Q2 and find the median of _those_ points. In this case, the points smaller than Q2 are:

```tex
[-108, 4, 8]
```
The median of that smaller dataset is `4`. That's Q1!

To find Q3, do the same process using the points that are larger than Q2. We have the following points:

```tex
[16, 23, 42]
```
The median of _those_ points is `23`. That's Q3! We now have three points that split the original dataset into groups of four equal sizes.


Q1 and Q3

You just learned a commonly used method to calculate the quartiles of a dataset. However, there is another method that is equally accepted that results in different values! 

Note that there is no universally agreed upon method of calculating quartiles, and as a result, two different tools might report different results.

The second method includes Q2 when trying to calculate Q1 and Q3. Let's take a look at an example:

```tex
[-108, 4, 8, 15, 16, 23, 42]
```

Using the first method, we found Q1 to be `4`. When looking at all of the points below Q2, we excluded Q2. Using this second method, we _include_ Q2 in each half.

For example, when calculating Q1 using this new method, we would now find the median of this dataset:

```tex
[-108, 4, 8, 15]
```
Using this method, Q1 is `6`.

Method Two: Including Q2

### Why Learn Statistics? 

Statistics is a tool used to communicate our understanding of data. It helps us understand the world better, make assertions, and communicate our confidence in the statements we are making.

### Take-Away Skills 
This course focuses on a number of different descriptive statistics including the mean, median, mode, standard deviation, and variance of different datasets. Not only will you learn how to calculate these statistics, but you will learn how to interpret them. By getting an understanding of what these statistics represent, you will be able to better describe your own datasets.

Learn how to calculate and interpret several descriptive statistics using the Python library NumPy.