This lesson covers the different types of hypothesis tests and the situations they are most appropriate for.

Say you work for a major social media website. Your boss comes to you with two questions:
* does the demographic of users on your site match the company's expectation?
* did the new interface update affect user engagement?

With terabytes of user data at your hands, you decide the best way to answer these questions is with statistical hypothesis tests!

_Statistical hypothesis testing_ is a process that allows you to evaluate if a change or difference seen in a dataset is "real", or if it’s just a result of random fluctuation in the data.

Hypothesis testing can be an integral component of any decision making process. It provides a framework for evaluating how confident one can be in making conclusions based on data. Some instances where this might come up include:
* a professor expects an exam average to be roughly 75%, and wants to know if the actual scores line up with this expectation. Was the test actually too easy or too hard?
* a product manager for a website wants to compare the time spent on different versions of a homepage. Does one version make users stay on the page significantly longer?

In this lesson, you will cover the fundamental concepts that will help you run and evaluate hypothesis tests:
* Sample and Population Mean
* P-Values
* Significance Level
* Type I and Type II Errors

You will then learn about three different hypothesis tests you can perform to answer the kinds of questions discussed above:
* One Sample T-Test
* Two Sample T-Test
* ANOVA (Analysis of Variance)

Let's get started!

Introduction

Suppose you want to know the average height of an oak tree in your local park. On Monday, you measure `10` trees and get an average height of `32` ft. On Tuesday, you measure `12` different trees and reach an average height of `35` ft. On Wednesday, you measure the remaining `11` trees in the park, whose average height is `31` ft. The average height for all `33` trees in your local park is `32.8` ft.

The collection of individual height measurements on Monday, Tuesday, and Wednesday are each called samples. A *sample* is a subset of the entire population (all the oak trees in the park). The mean of each sample is a *sample mean* and it is an estimate of the *population mean*.

Note: the sample means (`32` ft., `35` ft., and `31` ft.) were all close to the population mean (`32.8` ft.), but were all slightly different from the population mean and from each other.

For a population, the mean is a constant value no matter how many times it's recalculated. But with a set of samples, the mean will depend on exactly which samples are selected. From a sample mean, we can then extrapolate the mean of the population as a whole. There are three main reasons we might use sampling:

- data on the entire population is not available
- data on the entire population is available, but it is so large that it is unfeasible to analyze
- meaningful answers to questions can be found faster with sampling

Sample Mean and Population Mean - I

In the previous exercise, the sample means you calculated closely approximated the population mean. This won't always be the case!

Consider a tailor of school uniforms at a school for students aged `11` to `13`. The tailor needs to know the average height of all the students in order to know which sizes to make the uniforms.

The tailor measures the heights of a random sample of `20` students out of the `300` in the school. The average height of the sample is `57.5` inches. Using this sample mean, the tailor makes uniforms that fit students of this height, some smaller, and some larger.

After delivering the uniforms, the tailor starts to receive some feedback &mdash; many of the uniforms are too small! They go back to take measurements on the rest of the students, collecting the following data:
* 11 year olds average height: `56.7` inches
* 12 year olds average height: `59` inches
* 13 year olds average height: `62.8` inches
* All students average height (population mean): `59.5` inches

The original sample mean was off from the population mean by `2` inches! How did this happen?

The random sample of `20` students was skewed to one direction of the total population. More `11` year olds were chosen in the sample than is representative of the whole school, bringing down the average height of the sample. This is called a _sampling error_, and occurs when a sample is not representative of the population it comes from. How do you get an average sample height that looks more like the average population height, and reduce the chance of a sampling error?

Selecting only `20` students for the sample allowed for the chance that only younger, shorter students were included. This is a natural consequence of the fact that a sample has less data than the population to which it belongs. If the sample selection is poor, then you will have a sample mean seriously skewed from the population mean.

There is one surefire way to mitigate the risk of having a skewed sample mean — take a larger set of samples! The sample mean of a larger sample set will more closely approximate the population mean, and reduce the chance of a sampling error.

Sample Mean and Population Mean - II

You begin the statistical hypothesis testing process by defining a _hypothesis_, or an assumption about your population that you want to test. A hypothesis can be written in words, but can also be explained in terms of the sample and population means you just learned about.

Say you are developing a website and want to compare the time spent on different versions of a homepage. You could run a hypothesis test to see if version A or B makes users stay on the page significantly longer. Your hypothesis might be:

`"The average time spent on homepage A is greater than the average time spent on homepage B."`

While this is a fine hypothesis to make, data analysts are often very hesitant people. They don't like to make bold claims without having data to back them up! Thus when constructing hypotheses for a hypothesis test, you want to formulate a null hypothesis. A _null hypothesis_ states that there is no difference between the populations you are comparing, and it implies that any difference seen in the sample data is due to sampling error. A null hypothesis for the same scenario is as follows:

`"The average time spent on homepage A is the same as the average time spent on homepage B."`

You could also restate this in terms of population mean:

`"The population mean of time spent on homepage A is the same as the population mean of time spent on homepage B."`

After collecting some sample data on how users interact with each homepage, you can then run a hypothesis test using the data collected to determine whether your null hypothesis is true or false, or can be rejected (i.e. there is a difference in time spent on homepage A or B).

Hypothesis Formulation

Suppose you want to know if students who study history are more interested in volleyball than students who study chemistry. Before doing anything else to answer your original question, you come up with a null hypothesis: `"History and chemistry students are interested in volleyball at the same rates."`

To test this hypothesis, you need to design an experiment and collect data. You invite `100` history majors and `100` chemistry majors from your university to join an extracurricular volleyball team. After one week, `34` history majors sign up (`34%`), and `39` chemistry majors sign up (`39%`). More chemistry majors than history majors signed up, but is this a “real”, or significant difference? Can you conclude that students who study chemistry are more interested in volleyball than students who study history?

In your experiment, the `100` history and `100` chemistry majors at your university are samples of their respective populations (all history and chemistry majors). The sample means are the percentages of history majors (`34%`) and chemistry majors (`39%`) that signed up for the team, and the difference in sample means is `39%` - `34%` = `5%`. The population means are the percentage of history and chemistry majors worldwide that would sign up for an extracurricular volleyball team if given the chance.

You want to know if the difference you observed in these sample means (`5%`) reflects a difference in the population means, or if the difference was caused by sampling error, and the samples of students you chose do not represent the greater populations of history and chemistry students.

Restating the null hypothesis in terms of the population means yields the following:

`"The percentage of all history majors who would sign up for volleyball is the same as the percentage of all chemistry majors who would sign up for volleyball, and the observed difference in sample means is due to sampling error."`

This is the same as saying, “If you gave the same volleyball invitation to every history and chemistry major in the world, they would sign up at the same rate, and the sample of `200` students you selected are not representative of their populations.”

Designing an Experiment

When using automated processes to make decisions, you need to be aware of how this automation can lead to mistakes. Computer programs can be as fallible as the humans who design them. Because of this, there is a responsibility to understand what can go wrong and what can be done to contain these foreseeable problems.

In statistical hypothesis testing, there are two types of error. A _Type I error_ occurs when a hypothesis test finds a correlation between things that are not related. This error is sometimes called a "false positive" and occurs when the null hypothesis is rejected even though it is true.

For example, consider the history and chemistry major experiment from the previous exercise. Say you run a hypothesis test on the sample data you collected and conclude that there is a significant difference in interest in volleyball between history and chemistry majors. You have rejected the null hypothesis that there is no difference between the two populations of students. If, in reality, your results were due to the groups you happened to pick (sampling error), and there actually is no significant difference in interest in volleyball between history and chemistry majors in the greater population, you have become the victim of a false positive, or a Type I error.

The second kind of error, a _Type II error_, is failing to find a correlation between things that are actually related. This error is referred to as a "false negative" and occurs when the null hypothesis is not rejected even though it is false.

For example, with the history and chemistry student experiment, say that after you perform the hypothesis test, you conclude that there is no significant difference in interest in volleyball between history and chemistry majors. You did _not_ reject the null hypothesis. If there actually is a difference in the populations as a whole, and there is a significant difference in interest in volleyball between history and chemistry majors, your test has resulted in a false negative, or a Type II error.


Type I and Type II Errors

You know that a hypothesis test is used to determine the validity of a null hypothesis. Once again, the null hypothesis states that there is no actual difference between the two populations of data. But what result does a hypothesis test actually return, and how can you interpret it?

A hypothesis test returns a few numeric measures, most of which are out of the scope of this introductory lesson. Here we will focus on one: p-values. P-values help determine how confident you can be in validating the null hypothesis. In this context, a _p-value_ is the probability that, assuming the null hypothesis is true, you would see at least such a difference in the sample means of your data.

Consider the experiment on history and chemistry majors and their interest in volleyball from a previous exercise:
* Null Hypothesis: `"History and chemistry students are interested in volleyball at the same rates"`
* Experiment Sample Means: `34%` of history majors and `39%` of chemistry majors sign up for the volleyball class 

Assuming the null hypothesis is true, there is no actual difference in preference for volleyball between all history and chemistry majors, and any difference present in the experiment data is the result of sampling error. Imagine you run a hypothesis test on this experiment data and it returns a p-value of `0.04`. A p-value of `0.04` indicates that you could expect to see a difference of at least `5%` (calculated as `39%` - `34%` = `5%`) in the sample means only 4% of the time.

Essentially, if you ran this same experiment `100` times, you would expect to see as large a difference in the sample means only `4` times given the assumption that there is no actual difference between the populations (i.e. they have the same mean).

Seems like a really small probability, right? Are you thinking about rejecting the null hypothesis you originally stated?

P-Values

While a hypothesis test will return a p-value indicating a level of confidence in the null hypothesis, it does not definitively claim whether you should reject the null hypothesis. To make this decision, you need to determine a threshold p-value for which all p-values below it will result in rejecting the null hypothesis. This threshold is known as the _significance level_.

A higher significance level is more likely to give a false positive, as it makes it "easier" to state that there is a difference in the populations of your data when such a difference might not actually exist. If you want to be very sure that the result is not due to sampling error, you should select a very small significance level.

It is important to choose the significance level before you perform a statistical hypothesis test. If you wait until after you receive a p-value from a test, you might pick a significance level such that you get the result you want to see. For instance, if someone is trying to publish the results of their scientific study in a journal, they might set a higher significance level that makes their results appear statistically significant. Choosing a significance level in advance helps keep everyone honest.

It is an industry-standard to set a significance level of `0.05` or less, meaning that there is a `5%` or less chance that your result is due to sampling error.

Significance Level

Consider the fictional business BuyPie, which sends ingredients for pies to your household so that you can make them from scratch. Suppose that a product manager hypothesizes the average age of visitors to BuyPie.com is `30`. In the past hour, the website had `100` visitors and the average age was `31`. Are the visitors older than expected? Or is this just the result of chance (sampling error) and a small sample size? 

You can test this using a One Sample T-Test. A _One Sample T-Test_ compares a sample mean to a hypothetical population mean. It answers the question "What is the probability that the sample came from a distribution with the desired mean?"

The first step is formulating a null hypothesis, which again is the hypothesis that there is no difference between the populations you are comparing. The second population in a One Sample T-Test is the hypothetical population you choose. The null hypothesis that this test examines can be phrased as follows: `"The set of samples belongs to a population with the target mean".` 

One result of a One Sample T-Test will be a _p-value_, which tells you whether or not you can reject this null hypothesis. If the p-value you receive is less than your significance level, normally `0.05`, you can reject the null hypothesis and state that there is a significant difference.

R has a function called `t.test()` in the `stats` package which can perform a One Sample T-Test for you.

`t.test()` requires two arguments, a distribution of values and an expected mean:

```r
results <- t.test(sample_distribution, mu = expected_mean)
```
* `sample_distribution` is the sample of values that were collected
* `mu` is an argument indicating the desired mean of the hypothetical population
* `expected_mean` is the value of the desired mean

`t.test()` will return, among other information we will not cover here, a p-value &mdash; this tells you how confident you can be that the sample of values came from a distribution with the specified mean.

P-values give you an idea of how confident you can be in a result. Just because you don’t have enough data to detect a difference doesn’t mean that there isn’t one. Generally, the more samples you have, the smaller a difference you can detect. 


One Sample T-Test

Suppose that last week, the average amount of time spent per visitor to a website was `25` minutes. This week, the average amount of time spent per visitor to a website was `29` minutes. Did the average time spent per visitor change (i.e. was there a statistically significant bump in user time on the site)? Or is this just part of natural fluctuations?

One way of testing whether this difference is significant is by using a Two Sample T-Test. A _Two Sample T-Test_ compares two sets of data, which are both approximately normally distributed. 

The null hypothesis, in this case, is that the two distributions have the same mean.

You can use R's `t.test()` function to perform a Two Sample T-Test, as shown below:

```r
results <- t.test(distribution_1, distribution_2)
```

When performing a Two Sample T-Test, `t.test()` takes two distributions as arguments and returns, among other information, a p-value. Remember, the p-value let's you know the probability that the difference in the means happened by chance (sampling error).

Two Sample T-Test

Suppose that you own a chain of stores that sell ants, called VeryAnts. There are three different locations: A, B, and C. You want to know if the average ant sales over the past year are significantly different between the three locations.

At first, it seems that you could perform T-tests between each pair of stores.

You know that the p-value is the probability that you incorrectly reject the null hypothesis on each t-test. The more t-tests you perform, the more likely that you are to get a false positive, a Type I error. 

For a p-value of `0.05`, if the null hypothesis is true, then the probability of obtaining a significant result is `1 – 0.05` = `0.95`. When you run another t-test, the probability of still getting a correct result is `0.95` * `0.95`, or `0.9025`. That means your probability of making an error is now close to `10%`! This error probability only gets bigger with the more t-tests you do.



Dangers of Multiple T-Tests

In the last exercise, you saw that the probability of making a Type I error got dangerously high as you performed more t-tests.

When comparing more than two numerical datasets, the best way to preserve a Type I error probability of `0.05` is to use ANOVA. _ANOVA (Analysis of Variance)_ tests the null hypothesis that all of the datasets you are considering have the same mean. If you reject the null hypothesis with ANOVA, you're saying that at least one of the sets has a different mean; however, it does not tell you which datasets are different.

You can use the `stats` package function `aov()` to perform ANOVA on multiple datasets. `aov()` takes the different datasets combined into a data frame as an argument. For example, if you were comparing scores on a video game between math majors, writing majors, and psychology majors, you could format the data in a data frame `df_scores` as follows:

|group|score|
|-----|-----|
|math major|88|
|math major|81|
|writing major|92|
|writing major|80|
|psychology major|94|
|psychology major|83|

You can then run an ANOVA test with this line:

```r
results <- aov(score ~ group, data = df_scores)
```
Note: `score ~ group` indicates the relationship you want to analyze (i.e. how each `group`, or major, relates to `score` on the video game)

To retrieve the p-value from the results of calling `aov()`, use the `summary()` function:

```r
summary(results)
```

The null hypothesis, in this case, is that all three populations have the same mean score on this video game. If you reject this null hypothesis (if the p-value is less than `0.05`), you can say you are reasonably confident that a pair of datasets is significantly different. After using only ANOVA, however, you can't make any conclusions on which two populations have a significant difference.

Let's look at an example of ANOVA in action.

ANOVA

Before you use numerical hypothesis tests, you need to be sure that the following things are true:

#### 1. The samples should each be normally distributed...ish

Data analysts in the real world often still perform hypothesis tests on datasets that aren't exactly normally distributed. What is more important is to recognize if there is some reason to believe that a normal distribution is especially unlikely. If your dataset is definitively not normal, the numerical hypothesis tests won't work as intended.

For example, imagine you have three datasets, each representing a day of traffic data in three different cities. Each dataset is independent, as traffic in one city should not impact traffic in another city. However, it is unlikely that each dataset is normally distributed. In fact, each dataset probably has two distinct peaks, one at the morning rush hour and one during the evening rush hour. The histogram of a day of traffic data might look something like this:

![histogram](https://content.codecademy.com/courses/learn-hypothesis-testing/lesson_ii/histogram_data_traffic.svg)

In this scenario, using a numerical hypothesis test would be inappropriate.

#### 2. The population standard deviations of the groups should be equal

For ANOVA and Two Sample T-Tests, using datasets with standard deviations that are significantly different from each other will often obscure the differences in group means.

To check for similarity between the standard deviations, it is normally sufficient to divide the two standard deviations and see if the ratio is "close enough" to 1. "Close enough" may differ in different contexts, but generally staying within `10%` should suffice.

#### 3. The samples must be independent

When comparing two or more datasets, the values in one distribution should not affect the values in another distribution. In other words, knowing more about one distribution should not give you any information about any other distribution.

Here are some examples where it would seem the samples are not independent:

* the number of goals scored per soccer player before, during, and after undergoing a rigorous training regimen
* a group of patients' blood pressure levels before, during, and after the administration of a drug

It is important to understand your datasets before you begin conducting hypothesis tests on them so that you know you are choosing the right test.

Assumptions of Numerical Hypothesis Tests

Phew! Nobody said hypothesis testing is easy, but you made it to the end of the lesson. Congratulations! The world of hypothesis testing is vast. There is much more you can learn, and so many applications where you can use them.

Let's review what you've learned in this lesson:
* _Samples_ are subsets of an entire _population_, and the _sample mean_ can be used to approximate the _population mean_
* The _null hypothesis_ is an assumption that there is no difference between the populations you are comparing in a hypothesis test
* _Type I Errors_ occur when a hypothesis test finds a correlation between things that are not related, and _Type II Errors_ occur when a hypothesis test fails to find a correlation between things that are actually related
* _P-Values_ indicate the probability that, assuming the null hypothesis is true, such differences in the samples you are comparing would exist
* The _Significance Level_ is a threshold p-value for which all p-values below it will result in rejecting the null hypothesis
* _One Sample T-Tests_ indicate whether a dataset belongs to a distribution with a given mean 
* _Two Sample T-Tests_ indicate whether there is a significant difference between two datasets 
* _ANOVA (Analysis of Variance)_ allows you to detect if there is a significant difference between one of multiple datasets

Review

Hypothesis Testing with R

Learn about the statistics used to run hypothesis tests. Then, learn how to use R to run different t-tests that compare distributions.

Learn R: Hypothesis Testing

In this lesson, you will learn how to find the median of a dataset by hand, and using base R. 

We will also discuss the strengths and limitations of using median as a descriptive statistic of a dataset.

In this lesson, you will learn how to find the *median* of a dataset &mdash; a common measure of a dataset's center. Each of the next three exercises will cover the following topics:
- Manually finding the median of a dataset
- Using R's median function to find the median of a dataset
- Interpreting what it means for a dataset to have similar and different median and mean values

In the lesson, we will use a dataset of the 100 greatest novels, determined by a French literary magazine, Le Monde. From the dataset, you will use the median to answer the question: 

*When are great authors most likely to publish their best work?*

If you are not familiar with mean, also known as average, we recommend that you learn about it in our lesson on <a href="https://www.codecademy.com/courses/learn-r/lessons/mean-r/exercises/introduction" target="_blank">average</a>.




The formal definition for the median of a dataset is:

*The value that, assuming the dataset is ordered from smallest to largest, falls in the middle. If there are an even number of values in a dataset, you either report both of the middle two values or their average.*

There are always two steps to finding the median of a dataset:
1. Order the values in the dataset from smallest to largest
2. Identify the number(s) that fall(s) in the middle 

#### Example One: Even Number of Values

Say we have a dataset with the following ten numbers: 
```tex
24,\ 16,\ 30,\ 10,\ 12,\ 28,\ 38,\ 2,\ 4,\ 36
```

The first step is to order these numbers from smallest to largest:

```tex
2,\ 4,\ 10,\ 12,\ [16,\ 24],\ 28,\ 30,\ 36,\ 38
```

Because this dataset has an even number of values, there are two medians: `16` and `24` &mdash; `16` has four datapoints to the left, and `24` has four datapoints to the right.        

Although you can report both values as the median, people often average them. If you averaged `16` and `24`, you could report the median as `20`.

#### Example Two: Odd Number of Values

If we added another value (say, `24`) to the dataset and sorted it, we would have: 
```tex
2,\ 4,\ 10,\ 12,\ 16,\ [24],\ 24,\ 28,\ 30,\ 36,\ 38
```
The new median is equal to `24`, because there are 5 values to the left of it, and 5 values to the right of it.


Median

Finding the median of a dataset becomes increasingly time-consuming as the size of your dataset increases &mdash; imagine finding the median of an unsorted dataset with 10,000 observations.

The R `median()` function can do the work of sorting, then finding the median for you. In the example below, we use `median()` to calculate the median of a dataset with ten values:

```R
example_data = c(24, 16, 30, 10, 12, 28, 38, 2, 4, 36, 42)

example_median = median(example_data)

print(example_median)
```

The code above prints the median of the dataset, `24`. The mean of this dataset is `22`. It's worth noting these two values are close to one another, but not equal.


Median in R

In this lesson, you learned how to find the median of a dataset in two steps:
1. Sort the dataset
2. Identify the one or two numbers that fall in the middle of the sorted dataset

You also learned how to calculate the median using R:
```R
median(my_data)
```

#### Discussion

Take a look at the histogram to the right. It displays the author age distribution with vertical lines for the mean (red) and median (blue).

Do you feel like the median of our dataset, 40.5, provides us enough information to claim when authors publish their greatest work?

We argue it does not.

Although the median is a good measure of the dataset's center, we cannot make a definitive claim about when authors publish their greatest work &mdash; the youngest author published at 18 and the oldest at 76. It would be irresponsible to say anything but, "it seems to be possible at almost any age."

Notice that the mean and the median are nearly equal. This is not a surprising result, as both statistics are a measure of the dataset's center. However, it's worth noting that these results will not always be so close.

In the instructions below, we've written a brief explanation that puts median in the context of our problem.

Review and Discussion

The sum of the values in the dataset must be divided by the number of values in the dataset (`n`).

The values in the dataset should be multiplied together.

You need to take the square root of all values in the dataset.

The mean of a dataset, is the value that, assuming the dataset is ordered from smallest to largest, falls in the middle.

The median of a dataset is the value that, assuming the dataset is ordered from smallest to largest, falls in the middle. If there are an even number of values in a dataset, the middle two values are the median.

The median of a dataset is calculated by adding all of the values of the set together, then dividing the sum by the number of values in the set.

The median of a dataset is a measure of its spread. It is calculated by finding the average of the squared differences between every observation and the mean. The resulting value is in units squared.

\{6, 8, 9, 22, 90, 45, 2, 22, 45, 8, 22, 6, 7\}

\{15, 8, 9, 15, 12, 13, 2, 15, 13, 8, 13, 6, 7\}

Test your central tendency knowledge with this quiz on mean, median and mode.

Mean, Median, and Mode in R

In this project, you will use your knowledge of mean, median and mode to make conclusions about three boroughs in New York City: Brooklyn, Manhattan, and Queens.



We've imported data about one-bedroom apartments in three of New York City's boroughs: Brooklyn, Manhattan, and Queens. We saved the values to:

- `brooklyn_one_bed`
- `manhattan_one_bed`
- `queens_one_bed`

In this project, we only care about the price of apartments, so we saved the price of apartments in each borough to:

- `brooklyn_price`
- `manhattan_price`
- `queens_price`


If you want to see what the data stored in these variables looks like, you can type the variable names in a code block. When you click run, the variables (and their contents) will appear in rendered notebook on the right.

Find the average value of one-bedroom apartments in Brooklyn and save the value to `brooklyn_mean`.

Find the average value of one-bedroom apartments in Manhattan and save the value to `manhattan_mean`.

Find the average value of one-bedroom apartments in Queens and save the value to `queens_mean`.

Find the median value of one-bedroom apartments in Brooklyn and save the value to `brooklyn_median`.

Find the median value of one-bedroom apartments in Manhattan and save the value to `manhattan_median`.

Find the median value of one-bedroom apartments in Queens and save the value to `queens_median`.

Find the mode value of one-bedroom apartments in Brooklyn and save the value to `brooklyn_mode`.

Find the mode value of one-bedroom apartments in Manhattan and save the value to `manhattan_mode`.

Find the mode value of one-bedroom apartments in Queens and save the value to `queens_mode`.

Now what?

We don't find the mean, median, and mode of a dataset for the sake of it.

The point is to make inferences from our data. What can you say about the housing prices in Brooklyn, Queens, and Manhattan? Besides, "It's really expensive to live in any of them."

Take a minute to think through it. We added our thoughts to the hint.

Did you make any assumptions when you drew inferences in the previous task?

If so, what assumptions did you make? We added our thoughts to the hint.



Finally, think about what the histogram for each dataset will look like. 

If you have the time, take a minute to make a rough sketch of the histograms for the cost of a one-bedroom apartment in Brooklyn, Manhattan, and Queens. 

You can see someone else's attempt at a sketch of the Brooklyn histogram.

![Brooklyn Sketch](https://content.codecademy.com/courses/statistics/central-tendency/brooklyn-histogram.png)

When you're finished, open the hint to take a look at the actual histograms for Brooklyn, Manhattan, and Queens.

Central Tendency for Housing Data in R

In this lesson, you will learn how to calculate the average of a dataset by hand, and use R  to calculate to calculate it for you. We will also discuss the strengths and limitations of using average as a summary statistic of a dataset.


Finding the center of a dataset is one of the most common ways to summarize statistical findings. Often, people communicate the center of data using words like, *on average*, *usually*, or *often*.

In this lesson, you will learn how to calculate the *mean* of a dataset, a common measure of a dataset's center. We will use the mean to help us answer the question, 

*When are adults their most creative and productive?*

You could define "creative" and "productive" in a lot of ways, making this question impossible to fully answer by the end of this lesson. However, you will form an informed opinion on the question using data of the one hundred greatest novels of all time. 

We collected the dataset from a survey administered by the French literary magazine, Le Monde. From the dataset, you will calculate the average age of the authors when their books were published.


The *mean*, often referred to as the *average*, is a way to measure the center of a dataset. 

The average of a set is calculated using a two-step process:
1. Add all of the observations in your dataset.
2. Divide the total sum from step one by the number of points in your dataset.

```tex
\bar{x} = \frac{x_1 + x_2 … + x_{n}}{n}
```

The equation above is used to calculate mean. `x1`, `x2`, ... `xn` are observations from a dataset of `n` observations. 


#### Example

Imagine that we wanted to calculate average of a dataset with the following four observations:
```R
data <- c(4, 6, 2, 8)
```
##### Step One: Calculate the total

```tex

4 + 6 + 2 + 8 = 20

```

##### Step Two: Divide by the number of observations

The total is equal to 20, and the number of observations is equal to 4.
```tex
\frac{20}{4} = 5
```

The average of this dataset is equal to 5.


Calculating Mean

While you've shown that you can calculate the average yourself, it becomes time-consuming as the size of your dataset increases &mdash; imagine adding all of the numbers in a dataset with 10,000 observations.

The R `mean()` function can do the work of adding and dividing for you. In the example below, we use `mean()` to calculate the average of a dataset with ten values:

```r
example_data <- c(24, 16, 30, 10, 12, 28, 38, 2, 4, 36)

example_average <- mean(example_data)

print(example_average)
```

The code above calculates the average of `example_data` and saves the value to `example_average`. The resulting average of this array is `20`.


Mean in R

In this lesson, you learned how to calculate the average of a dataset using the formula:

```tex
\bar{x} = \frac{x_1 + x_2 … + x_{n}}{n}
```

and the R function:
```R
mean(my_data)
```

---

Circling back to the original question, do you feel like the average of our dataset, 42.12, provides us enough information to claim when someone is their most creative and productive?

Take a look at the histogram and mean (in red) to the right as you consider this question. 

We would say, **No**. Though we could argue against its use for a few reasons, below, we've highlighted two:
- The date of publication is not necessarily an author's most creative year. When did they start authoring the book? What factors impacted their writing during those years?
- The average age of the publishing dates for 100 authors may not accurately measure peak creativity in other professions. The average age of painters or sculptors may be very different.

So, what kind of information does the average provide us, and why would we use the average to describe something when we could display a histogram? 

The most important outcome is that we're able to use a single number as a measure of centrality. Although histograms provide more information, they are not a concise or precise measure of centrality &mdash; the reader must interpret it for themselves.


In this lesson, you will learn how to find the mode of a dataset manually, and using base R.


In this lesson, you will learn how to find the *mode* of a dataset. Each of the next three exercises will cover the following:
- Manually finding the mode of a dataset
- Using R's functions to find the mode
- Comparing mode to mean and median values

In the lesson, we will use a dataset of the 100 greatest novels, determined by a French literary magazine, Le Monde. From the dataset, you will use the mode to answer the question: 

*What is the most common age for a great author to publish their best work?*

If you are not familiar with mean, also known as average, or median, we recommend that you learn about it in our lessons on <a href="https://www.codecademy.com/courses/learn-statistics-with-python/lessons/average/exercises/introduction" target="_blank">average</a> and <a href="https://www.codecademy.com/courses/learn-statistics-with-python/lessons/median/exercises/introduction" target="_blank">median</a>.




The formal definition for the mode of a dataset is:

*The most frequently occurring observation in the dataset. A dataset can have multiple modes if there is more than one value with the same maximum frequency.*

While you may be able to find the mode of a small dataset by simply looking through it, if you have trouble, we recommend you follow these two steps:
1. Find the frequency of every unique number in the dataset
2. Determine which number has the highest frequency

#### Example

Say we have a dataset with the following ten numbers: 
```tex
24,\ 16,\ 12,\ 10,\ 12,\ 28,\ 38,\ 12,\ 28,\ 24
```

Let's find the frequency of each number: 

<div class="narrative-table-container">

|*24*|*16*|*12*|*10*|*28*|*38*|
|-|-|-|-|-|-|
|2|1|3|1|2|1|

</div>
    

From the table, we can see that our mode is `12`, the most frequent number in our dataset. 


Mode

Finding the mode of a dataset becomes increasingly time-consuming as the size of your dataset increases &mdash; imagine finding the mode of a dataset with 10,000 observations.

The R package `DescTools` includes a handy `Mode()` function which can do the work of finding the mode for us. In the example below, we use `Mode()` to calculate the mode of a dataset with ten values:


#### Example: One Mode

```R
library(DescTools)

example_data <- c(24, 16, 12, 10, 12, 28, 38, 12, 28, 24)

example_mode <- Mode(example_data)
```

The code above calculates the mode of the values in `example_data` and saves it to `example_mode`. 

The result of `Mode()` is a vector with the mode value:
```
>>> example_mode
[1] 12
```


#### Example: Two Modes

If there are multiple modes, the `Mode()` function will return them as a vector.

Let's look at a vector with two modes, `12` and `24`:

```R
example_data = c(24, 16, 12, 10, 12, 24, 38, 12, 28, 24)

example_mode = Mode(example_data)
```

The result is:
```R
>>> example_mode
[1] 12 24
```

Mode with DescTools

In this lesson, you learned how to find the mode of a dataset in two steps:
1. Find the frequency of every unique number in the dataset
2. Determine which number has the highest frequency

You also learned how to calculate the mode using DescTools:
```R
Mode(my_array)
```

#### Discussion

In this lesson, you found that 38 was the most common age, at publication, for an author from the Le Monde survey. How does this number compare to your guess from the beginning of the lesson?

The mode is close to the median and mean of the dataset, but it is not in the tallest bucket. This should not be surprising, as the histogram indicates the data is centered between the ages of 30 and 50 &mdash; there is a higher chance of a mode in that range than outside of it.

The mode is not always this close to the median and mean, and often will not be in the tallest bucket.

Look at the 25-30 year-old bin. There are nine observations in it. If all the values in that bin happened to be 27, then the dataset's mode would be 27. Although unlikely, it is possible. Below, we show what this would look like:

![Mode set to 27](https://content.codecademy.com/courses/statistics/mode/mode-at-27.png)

Based on this graph, it is fair to say the mode may not always be a great measure of where the data is centered. Simply put, mode is a measure of the most frequent observation in the dataset, and is not an indication of the tallest bin in a histogram.

In the instructions below, we've written a brief explanation that puts mode in the context of our problem.



Mode in R

The difference between the `i`th data point and the mean.

The sum of the difference between every data point and the mean.

The difference between the median and the mean of a dataset.

This is the entire equation for variance.

Squaring the difference results in a positive number. This prevents data points above and below the mean canceling each other out.

Squaring the difference makes the units of variance units squared, which is easier to interpret.

You don’t square the difference between every data point and the mean. Instead, you average the distance between every data point and the mean and _then_ square the result.

By squaring the difference, you calculate the standard deviation, which is an easier statistic to compare to other descriptive statistics like the mean.

Square the difference between each data point and the mean.

Divide the variance by the number of points in the dataset

A datapoint that is `3.5` standard deviations below the mean.

A datapoint that is `3` standard deviations above the mean.

A datapoint that is `1` standard deviation below the mean.

Test your understanding of the descriptive statistics variance and standard deviation.

Variance and Standard Deviation in R

In this lesson you will learn how to calculate and interpret the standard deviation of a dataset.


When beginning to work with a dataset, one of the first pieces of information you might want to investigate is the spread &mdash; is the data close together or far apart? One of the tools in our statistics toolbelt to do this is the descriptive statistic _variance_:

```tex
\sigma^2 = \frac{\sum_{i=1}^{N}{(X_i -\mu)^2}}{N}
```

By finding the variance of a dataset, we can get a numeric representation of the spread of the data. If you want to take a deeper dive into how to calculate variance, check out our <a href = "https://www.codecademy.com/courses/statistics-variance-and-standard-deviation/lessons/variance/exercises/variance" target = "_blank">variance course</a>.

But what does that number really mean? How can we use this number to interpret the spread?

It turns out, using variance isn't necessarily the best statistic to use to describe spread. Luckily, there is another statistic &mdash; standard deviation &mdash; that can be used instead.

In this lesson, we'll be working with two datasets. The first dataset contains the heights (in inches) of a random selection of players from the NBA. The second dataset contains the heights (in inches) of a random selection of users on the dating platform OkCupid.

Variance Recap

Variance is a tricky statistic to use because its units are different from both the mean and the data itself. For example, the mean of our NBA dataset is `77.98` inches. Because of this, we can say someone who is `80` inches tall is about two inches taller than the average NBA player.

However, because the formula for variance includes _squaring_ the difference between the data and the mean, the variance is measured in _units squared_. This means that the variance for our NBA dataset is `13.32` inches squared.

This result is hard to interpret in context with the mean or the data because their units are different. This is where the statistic _standard deviation_ is useful.

Standard deviation is computed by taking the square root of the variance. `sigma` is the symbol commonly used for standard deviation. Conveniently, `sigma` squared is the symbol commonly used for variance:

```tex
\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum_{i=1}^{N}{(X_i -\mu)^2}}{N}}
```

In R, you can take the square root of a number using `^ 0.5` or `sqrt()`, up to you which one you prefer:

```R
num <- 25
num_square_root <- num ^ 0.5
```


Standard Deviation

There is an R function dedicated to finding the standard deviation of a dataset &mdash; we can cut out the step of first finding the variance. The R function `sd()` takes a dataset as a parameter and returns the standard deviation of that dataset:

```R
dataset <- c(4, 8, 15, 16, 23, 42)
standard_deviation <- sd(dataset)
```


Standard Deviation in R

Now that we're able to compute the standard deviation of a dataset, what can we do with it? 

Now that our units match, our measure of spread is easier to interpret. By finding the number of standard deviations a data point is away from the mean, we can begin to investigate how unusual that datapoint truly is. In fact, you can usually expect around 68% of your data to fall within one standard deviation of the mean, 95% of your data to fall within two standard deviations of the mean, and 99.7% of your data to fall within three standard deviations of the mean. 

<img src="https://content.codecademy.com/courses/statistics/variance/normal_curve.svg" alt="A histogram showing where the standard deviations fall">
If you have a data point that is over three standard deviations away from the mean, that's an incredibly unusual piece of data!


Using Standard Deviation

In the last exercise you saw that Lebron James was `0.55` standard deviations above the mean of NBA player heights. He's taller than average, but compared to the other NBA players, he's not absurdly tall.

However, compared to the OkCupid dating pool, he is extremely rare! He's almost three full standard deviations above the mean. You'd expect only about `0.15%` of people on OkCupid to be more than 3 standard deviations away from the mean.

This is the power of standard deviation. By taking the square root of the variance, the standard deviation gives you a statistic about spread that can be easily interpreted and compared to the mean.


Side by side histograms of the two datasets and their respective stats lines including standard deviation and mean.

In this lesson, you will learn how to calculate and interpret the variance of a dataset.


Finding the mean, median, and mode of a dataset is a good way to start getting an understanding of the general shape of your data

However, those three descriptive statistics only tell part of the story. Consider the two datasets below:

```R
dataset_one <- c(-4, -2, 0, 2, 4)
dataset_two <- c(-400, -200, 0, 200, 400)
```

These two datasets have the same mean and median &mdash; both of those values happen to be `0`. If we only reported these two statistics, we would not be communicating any meaninful difference between these two datasets.

This is where *variance* comes into play. Variance is a descriptive statistic that describes how spread out the points in a data set are.


Variance

Now that you have learned the importance of describing the spread of a dataset, let's figure out how to mathematically compute this number.

How would you attempt to capture the spread of the data in a single number? 

Let's start with our intuition &mdash; we want the variance of a dataset to be a large number if the data is spread out, and a small number if the data is close together.

<img src="https://content.codecademy.com/courses/statistics/variance/two_histograms.svg" alt = "Two histograms. One with a large spread and one with a smaller spread.">

A lot of people may initially consider using the range of the data. But that only considers two points in your entire dataset. Instead, we can include every point in our calculation by finding the difference between every data point and the mean. 

<img src="https://content.codecademy.com/courses/statistics/variance/difference.svg" alt="The difference between the mean and four different points.">

If the data is close together, then each data point will tend to be close to the mean, and the difference will be small. If the data is spread out, the difference between every data point and the mean will be larger.

Mathematically, we can write this comparison as 

```tex
\text{difference} = X - \mu
```
Where `X` is a single data point and the Greek letter `mu` is the mean.


Distance From Mean

We now have five different values that describe how far away each point is from the mean. That seems to be a good start in describing the spread of the data. But the whole point of calculating variance was to get one number that describes the dataset. We don't want to report five values &mdash; we want to combine those into one descriptive statistic.

To do this, we'll take the average of those five numbers. By adding those numbers together and dividing by `5`, we'll end up with a single number that describes the average distance between our data points and the mean.

Note that we're not _quite_ done yet &mdash; our final answer is going to look a bit strange here. There's a small problem that we'll fix in the next exercise.


Average Distances

We're almost there! We have one small problem with our equation. Consider this very small dataset:

```R
c(-5, 5)
```
The mean of this dataset is `0`, so when we find the difference between each point and the mean we get `-5 - 0 = -5` and `5 - 0 = 5`.

When we take the average of `-5` and `5` to get the variance, we get `0`:

```tex
\frac{-5 + 5}{2} = 0
``` 

Now think about what would happen if the dataset were `c(-200, 200)`. We'd get the same result! That can't possibly be right &mdash; the dataset with `200` is much more spread out than the dataset with `5`, so the variance should be much larger!

The problem here is with negative numbers. Because one of our data points was `5` units below the mean and the other was `5` units above the mean, they canceled each other out! 

When calculating variance, if a data point was above or below the mean &mdash; all we care about is how far away it was. To get rid of those pesky negative numbers, we'll square the difference between each data point and the mean.

Our equation for finding the difference between a data point and the mean now looks like this:

```tex
\text{difference} = (X - \mu)^2
```


Square the Differences

Well done! You've calculated the variance of a data set. The full equation for the variance is as follows:

```tex
\sigma^2 = \frac{\sum_{i=1}^{N}{(X_i -\mu)^2}}{N}
```
Let's dissect this equation a bit. 
* Variance is usually represented by the symbol sigma squared. 
* We start by taking every point in the dataset &mdash; from point number `1` to point number `N` &mdash; and finding the difference between that point and the mean. 
* Next, we square each difference to make all differences positive.
* Finally, we average those squared differences by adding them together and dividing by `N`, the total number of points in the dataset.

All of this work can be done quickly using a function we provided. The `variance()` function takes a list of numbers as a parameter and returns the variance of that dataset.

```R
dataset <- c(3, 5, -2, 49, 10)
var <- variance(dataset)
```


Variance in R

Great work! In this lesson you've learned about variance and how to calculate it.

In the example used in this lesson, the importance of variance was highlighted by showing data from test scores in classes taught by two different teachers. What story does variance tell? What conclusions can we draw from this statistic?

<img src = "https://content.codecademy.com/courses/statistics/variance/teachers.png" alt = "The histogram of scores from two different teacher's classes">

In the class with low variance, it seems like the teacher strives to make sure all students have a firm understanding of the subject, but nobody is exemplary.

In the class with high variance, the teacher might focus more of their attention on certain students. This might enable some students to ace their tests, but other students get left behind.

If we only looked at statistics like mean, median, and mode, these nuances in the data wouldn't be represented.

In this project, you will try to find the best time to plan a trip to London by looking at a weather dataset containing over 39,000 pieces of weather-related data!

All of the weather data is stored in a variable named `london_data`. 

Print the first few rows of the dataset by calling `head(london_data)`.

Take a look at the browser to see the columns of this dataset. Here are two questions to ask yourself:
* How often were measurements taken?
* Which columns might be the most useful when thinking about planning a trip.

Comment out these print statements after looking through the dataframe `london_data`.

Let's also take a look at how many rows we have. Print `nrow(london_data)`

Now that we've seen what the data looks like, let's dive into one of the more promising columns &mdash; `TemperatureC`. This column stores the temperature in Celsius.

To get a single column from a DataFrame, you can use this syntax:

```R
one_column <- london_data$column_name
```

Create a variable named `temp` and set it equal to the `TemperatureC` column of `london_data`.

We can now calculate descriptive statistics about this column. To begin, find the average temperature in London in 2015. Store it in a variable named `average_temp`.

Calculate the variance of the temperature column and store the results in the variable `temperature_var`. Print the results.

Calculate the standard deviation of the temperature column and store a variable named `temperature_standard_deviation`. Print this variable.

How would the variance and standard deviation help you plan a trip?

The statistics we just calculated aren't very helpful when trying to plan a vacation since they describe the weather throughout an entire year.

If we could find a way to use the rows from only a certain month, that might help us find the best month to plan our trip.

Once again, print `head(london_data)` to see the first few columns of our DataFrame. Which column will help us get only the data points from January? In the browser you can scroll to the right to see more columns.

We want to filter by the `"month"` column! The following line of code will create a variable that returns the data from the rows where `"month"` is `6`. These will be all of the rows from the month of June.

```py
june <- london_data %>%
  filter(month == "06")
```

Create this variable for June.


Create a variable named `july` that contains all of the data points from July. The code to do this should look very similar to your code that created the June variable. This time, we're interested in month `"07"`.

Calculate and print the mean temperature (`TemperatureC`) in London for both June and July using the `mean()` function.

What do these numbers tell you? If you wanted to visit London on the month that was, on average, cooler, which month would you pick? Look at the hint to see our thoughts!

Calculate and print the standard deviation of temperature in London for both June and July. Remember, the function you should use is `sd()`.

What do these numbers tell you? How might the standard deviation change your decision on when to visit London? Click on the hint to see our thoughts.

If you want to quickly see the mean and standard deviation of every month, use this block of code. 

```R
# Analyze by month
monthly_stats <- london_data %>%
    group_by(month) %>%
    summarize(mean = mean(TemperatureC),
              standard_deviation = sd(TemperatureC))

```

During which month would you most like to visit? If you wanted to pick the month with the least variable temperature, which one would you pick?


By looking at the mean and standard deviation of the temperature in London during each month of the year, we can get a sense of the best time to visit.

Looking at the spread of the data is an important statistic to consider if you are particularly sensitive to extreme days. For example, if you pick a month with a large standard deviation, you might have one day that is relatively cold while the following day is very hot.

Take some time to see if you can find more insights in this dataset. Here are some ideas we have for you:
* Look at columns other than `"TemperatureC"`. Can you find something interesting about the humidity or the air pressure? Can you find the rainiest month? London is notoriously rainy!
* Filter based on`"hour"`. Similar to how you filtered based on the month, are there certain hours that have higher variance than others?

Variance in Weather in R

In this project, we will investigate a dataset containing information about life expectancy in different countries

We've imported a dataset containing the life expectancy in different countries. The data can be found in the variable named `data`.

To begin, let's get a sense of what this data looks like. Inspect the `head()` to see the first 6 rows of the dataset.

Look at the names of the columns. What other pieces of information does this dataset contain?

You may want to comment out this call to `head()` after looking at the data.

Let's isolate the column that contains the life expectancy and store it in a variable named `life_expectancy`. To normally get a single column from a data frame using dplyr, you would use this syntax:

```r
single_column <- data_frame_name %>%
  select(column_name)
```

Many of R's statistical functions, however, require a numeric vector, not a data frame column. In order to utilize the function for calculating quantiles, we need to convert the data frame column into a vector. We can do this by replacing the call to `select()` with the dplyr function `pull()`.

```r
single_column <- data_frame_name %>%
  pull(column_name)
```

Make sure to pay attention to capitalization when using the column name!

We can now use R's statistical functions on that column! Let's use the `quantile()` function to find the quartiles of `life_expectancy`. Store the result in a variable named `life_expectancy_quartiles` and view the results.


Nice work! By looking at those three values you can get a sense of the spread of the data. For example, it seems like some of the data is fairly close together &mdash; a quarter of the data is between `72.5` years and `75.4` years.

Could you predict what the histogram might look like from those three number? Plot the histogram by using the following line of code. Does it look how you expected?

```r
hist(life_expectancy)
```


Let's take a moment to think about the meaning of these quartiles. If your country has a life expectancy of `70` years, does that fall in the first, second, third, or final quarter of the data?

Click on the hint to see the answer!

GDP is a measure of a country's wealth. Let's now use the GDP data to see if life expectancy is affected by this value. 

Let's split the data into two groups based on GDP. If we find the median GDP, we can create two datasets for "low GDP countries" and "high GDP countries.

To start, let's isolate the `GDP` column and store it in a variable named `gdp`. This should be similar to how you isolated the life expectancy column.


We now want to find the median GDP. You can use R's `median()` function, but since the median is also a quantile, we can call `quantile()` using `0.5` as the second argument. 

Store the median in a variable named `median_gdp`. View that variable to see the median.

Let's now grab all of the rows from our original dataset that have a GDP less than or equal to the median. The following code will do that for you:

```r
low_gdp <- data %>%
  filter(GDP <= median_gdp)
```

Do the same for all of the rows that have a GDP higher than the median. Store those rows in a variable named `high_gdp`. 

The line of code should look almost identical to the one above, but you should change the `<=` to `>`.


Now that we've split the data based on the `GDP`, let's grab the values from the `life_expectancy` column to more easily analyze the data. The code to grab the `life_expectancy` values from the low gdp countries is given below: 

```r
low_gdp <- data %>%
  filter(GDP <= median_gdp) %>%
  pull(life_expectancy)
```

Do the same for the high gdp countries, using `pull()` to grab the `life_expectancy` values.

Let's see how the life expectancy of each group compares to each other. 

Find the quartiles of `low_gdp`. Store the quartiles in a variable named `low_gdp_quartiles`. View the results.

Find the quartiles of the high GDP countries and store them in a variable named `high_gdp_quartiles`. This should look very similar to the last line of code you wrote. View the results.


By looking at the quantiles, you should get a sense of the spread and central tendency of these two datasets. But let's plot a histogram of each dataset to really compare them.

In the last code block, add these two lines of code:

```r
hist(low_gdp,col='red')
hist(high_gdp,col='blue')
```


We can now truly see the impact GDP has on life expectancy. 

Once again, consider a country that has a life expectancy of `70` years. If that country is in the top half of GDP countries, is it in the first, second, third, or fourth quarter of the data with respect to life expectancy? What if the country is in the bottom half of GDP countries? Check the hint to see our thoughts.


Life Expectancy By Country

In this lesson you will learn how to calculate and interpret the quartiles of a dataset.

A common way to communicate a high-level overview of a dataset is to find the values that split the data into four groups of equal size.

By doing this, we can then say whether a new datapoint falls in the first, second, third, or fourth quarter of the data.

<img src="https://content.codecademy.com/courses/statistics/quantiles/quartiles.svg" alt = "20 data points, with three lines splitting the data into 4 groups of 5.">

The values that split the data into fourths are the _quartiles_. 

Those values are called the first quartile (Q1), the second quartile (Q2), and the third quartile (Q3)

In the image above, Q1 is `10`, Q2 is `13`, and Q3 is `22`. Those three values split the data into four groups that each contain five datapoints.

In this lesson, you will learn to calculate the quartiles by hand, and by using base R functions.

Quartiles


We'll come back to the music dataset in a bit, but let's first practice on a small dataset.

Let's begin by finding the second quartile (Q2). Q2 happens to be exactly the <a href="https://www.codecademy.com/courses/learn-statistics-with-python/lessons/median/exercises/introduction">median</a>. Half of the data falls below Q2 and half of the data falls above Q2.

The first step in finding the quartiles of a dataset is to sort the data from smallest to largest. For example, below is an unsorted dataset:

```tex
c(8, 15, 4, -108, 16, 23, 42)
```

After sorting the dataset, it looks like this:

```tex
c(-108, 4, 8, 15, 16, 23, 42)
```

Now that the list is sorted, we can find Q2. In the example dataset above, Q2 (and the median) is `15` &mdash; there are three points below `15` and three points above `15`.

### Even Number of Datapoints

You might be wondering what happens if there is an even number of points in the dataset. For example, if we remove the `-108` from our dataset, it will now look like this:

```tex
c(4, 8, 15, 16, 23, 42)
```

Q2 now falls somewhere between `15` and `16`. There are a couple of different strategies that you can use to calculate Q2 in this situation. One of the more common ways is to take the average of those two numbers. In this case, that would be `15.5`. 

Recall that you can find the average of two numbers by adding them together and dividing by two.


The Second Quartile

Now that we've found Q2, we can use that value to help us find Q1 and Q3. Recall our demo dataset:

```tex
c(-108, 4, 8, 15, 16, 23, 42)
```
In this example, Q2 is `15`. To find Q1, we take all of the data points smaller than Q2 and find the median of _those_ points. In this case, the points smaller than Q2 are:

```tex
c(-108, 4, 8)
```
The median of that smaller dataset is `4`. That's Q1!

To find Q3, do the same process using the points that are larger than Q2. We have the following points:

```tex
c(16, 23, 42)
```
The median of _those_ points is `23`. That's Q3! We now have three points that split the original dataset into groups of four equal sizes.


Q1 and Q3

You just learned a commonly used method to calculate the quartiles of a dataset. However, there is another method that is equally accepted that results in different values! 

Note that there is no universally agreed upon method of calculating quartiles, and as a result, two different tools might report different results.

The second method includes Q2 when trying to calculate Q1 and Q3. Let's take a look at an example:

```tex
c(-108, 4, 8, 15, 16, 23, 42)
```

Using the first method, we found Q1 to be `4`. When looking at all of the points below Q2, we excluded Q2. Using this second method, we _include_ Q2 in each half.

For example, when calculating Q1 using this new method, we would now find the median of this dataset:

```tex
c(-108, 4, 8, 15)
```
Using this method, Q1 is `6`.

Method Two: Including Q2

We were able to find quartiles manually by looking at the dataset and finding the correct division points. But that gets much harder when the dataset starts to get bigger. Luckily, there is a function in base R that will find the quartiles for you.

The base R function that we'll be using is named `quantile()`. You can learn more about quantiles in our <a href="https://www.codecademy.com/courses/learn-r/lessons/quantiles-r/exercises/quantiles">quantiles lesson</a>, but for right now all you need to know is that a quartile is a specific kind of quantile.

The code below calculates the third quartile of the given dataset:

```r
dataset <- c(50, 10, 4, -3, 4, -20, 2)
third_quartile <- quantile(dataset, 0.75)
```

The `quantile()` function takes two parameters. The first is the dataset you're interested in. The second is a number between `0` and `1`. Since we calculated the third quartile, we used `0.75` &mdash; we want the point that splits the first 75% of the data from the rest.

For the second quartile, we'd use `0.5`. This will give you the point that 50% of the data is below and 50% is above.

Notice that the dataset doesn't need to be sorted for R's function to work!


Quartiles in R

Great work! You now know how to calculate the quartiles of any dataset by hand and with R. 

Quartiles are some of the most commonly used descriptive statistics. For example, You might see schools or universities think about quartiles when considering which students to accept. Businesses might compare their revenue to other companies by looking at quartiles.

In fact quartiles are so commonly used that the three quartiles, along with the minimum and the maximum values of a dataset, are called the five-number summary of the dataset. These five numbers help you quickly get a sense of the range, centrality, and spread of the dataset.

Quartiles Review


In this lesson, you will learn how to calculate the interquartile range of a dataset.

One of the most common statistics to describe a dataset is the _range_. The range of a dataset is the difference between the maximum and minimum values. While this descriptive statistic is a good start, it is important to consider the impact outliers have on the results:

<img src="https://content.codecademy.com/courses/statistics/quantiles/outliers.svg" alt="A dataset with some outliers.">

In this image, most of the data is between `0` and `15`. However, there is one large negative outlier (`-20`) and one large positive outlier (`40`). This makes the range of the dataset `60` (The difference between `40` and `-20`). That's not very representative of the spread of the majority of the data!

The _interquartile range_ (IQR) is a descriptive statistic that tries to solve this problem. The IQR ignores the tails of the dataset, so you know the range around-which your data is centered. 

In this lesson, we'll teach you how to calculate the interquartile range and interpret it.


Range Review

The interquartile range is the difference between the third quartile (Q3) and the first quartile (Q1). If you need a refresher on quartiles, you can take a look at our <a href="https://www.codecademy.com/courses/quartiles-quantiles-and-interquartile-range-r/lessons/quartiles-r/exercises/quartiles">lesson</a>. 

For now, all you need to know is that the first quartile is the value that separates the first 25% of the data from the remaining 75%.

The third quartile is the opposite &mdash; it separates the first 75% of the data from the remaining 25%.

<img src="https://content.codecademy.com/courses/statistics/quantiles/interquartile.svg" alt="The interquartile range of the dataset is shown to be between Q3 and Q1.">

The interquartile range is the difference between these two values.

In the last exercise, we calculated the IQR by finding the quartiles using R and finding the difference ourselves. The stats library has a function that can calculate the IQR all in one step.

The `IQR()` function takes a dataset as a parameter and returns the Interquartile Range.

```r
dataset = c(4, 10, 38, 85, 193)
interquartile_range = IQR(dataset)
```


IQR in R

Nice work! You can now calculate the Interquartile Range of a dataset using R. The main takeaway of the IQR is that it is a statistic, like the range, that helps describe the spread of the center of the data. 

However, unlike the range, the IQR is robust. A statistic is robust when outliers have little impact on it. For example, the IQRs of the two datasets below are identical, even though one has a massive outlier.

```r
dataset_one = c(6, 9, 10, 45, 190, 200) # IQR is 144.5
dataset_two = c(6, 9, 10, 45, 190, 20000000) # IQR is 144.5
```
By looking at the IQR instead of the range, you can get a better sense of the spread of the _middle_ of the data.

The interquartile range is displayed in a commonly-used graph &mdash; the box plot.

<img src="https://content.codecademy.com/courses/statistics/quantiles/boxplot.png" alt="A box plot">

In a box plot, the ends of the box are Q1 and Q3. So the length of the box is the IQR.


Interquartile Range

The IQR is a robust statistic because outliers have little effect on it.

The IQR is a robust statistic because we are more confident in the value of the IQR than the range.

The IQR is a robust statistic because its units are the same as other desciptive statistics, like the mean.

The IQR is a robust statistic because its value is close to the mean of the dataset.

There are several methods to calculate the value of the quantile. One way is to take the average of the two points that the quantile falls between.

The only accepted way to calculate the value of the quantile is to take the average of the two points that the quantile falls between.

You can’t calculate the value of a quantile if it falls between two points in the dataset.

Test your understanding of quantiles, quartiles, and the interquartile range.

Quantiles, Quartiles, and IQR

In this lesson, you'll learn how to calculate quantiles using R.

Quantiles are points that split a dataset into groups of equal size. For example, let's say you just took a test and wanted to know whether you're in the top 10% of the class. One way to determine this would be to split the data into ten groups with an equal number of datapoints in each group and see which group you fall into.

<img src="https://content.codecademy.com/courses/statistics/quantiles/deciles.svg" alt="Thirty students grades split into ten groups of three."> 

There are nine values that split the dataset into ten groups of equal size &mdash; each group has 3 different test scores in it. 

Those nine values that split the data are quantiles! Specifically, they are the 10-quantiles, or deciles.

You can find any number of quantiles. For example, if you split the dataset into 100 groups of equal size, the 99 values that split the data are the 100-quantiles, or percentiles.

The <a href="https://www.codecademy.com/courses/learn-r/lessons/quartiles-r/exercises/quartiles">quartiles</a> are some of the most commonly used quantiles. The quartiles split the data into four groups of equal size.

In this lesson, we'll show you how to calculate quantiles using R and discuss some of the most commonly used quantiles.

Quantiles

Base R has a function named `quantile()` that will quickly calculate the quantiles of a dataset for you.

`quantile()` takes two parameters. The first is the dataset that you are using. The second parameter is a single number or a vector of numbers between `0` and `1`. These numbers represent the places in the data where you want to split.

For example, if you only wanted the value that split the first 10% of the data apart from the remaining 90%, you could use this code:

```r
dataset <- c(5, 10, -20, 42, -9, 10)
ten_percent <- quantile(dataset, 0.10)
```

`ten_percent` now holds the value `-14.5`. This result _technically_ isn't a quantile, because it isn't splitting the dataset into groups of equal sizes &mdash; this value splits the data into one group with 10% of the data and another with 90%. 

However, it would still be useful if you were curious about whether a data point was in the bottom 10% of the dataset.

Quantiles in R

In the last exercise, we found a single "quantile" &mdash; we split the first 23% of the data away from the remaining 77%.

However, quantiles are usually a set of values that split the data into groups of equal size. For example, you wanted to get the 5-quantiles, or the four values that split the data into five groups of equal size, you could use this code:

```r
dataset <- c(5, 10, -20, 42, -9, 10)
ten_percent <- quantile(dataset, c(0.2, 0.4, 0.6, 0.8))
```
Note that we had to do a little math in our head to make sure that the values `c(0.2, 0.4, 0.6, 0.8)` split the data into groups of equal size. Each group has 20% of the data.

<img src="https://content.codecademy.com/courses/statistics/quantiles/even.svg" alt="The data is split into 5 groups where each group has 4 datapoints.">

If we used the values `c(0.2, 0.4, 0.7, 0.8)`, the function would return the four values at those split points. However, those values wouldn't split the data into five equally sized groups. One group would only have 10% of the data and another group would have 30% of the data!

<img src="https://content.codecademy.com/courses/statistics/quantiles/uneven.svg" alt="The data is split into groups of uneven size. One group has 6 data points and one group only has 2.">

Many Quantiles

One of the most common quantiles is the 2-quantile. This value splits the data into two groups of equal size. Half the data will be above this value, and half the data will be below it. This is also known as the <a href="https://www.codecademy.com/courses/mean-median-mode-r-stats/lessons/median-r/exercises/introduction">median</a>!

<img src="https://content.codecademy.com/courses/statistics/quantiles/median.svg" alt="Ten points are below the median and ten points are above the median.">

The 4-quantiles, or the <a href="https://www.codecademy.com/courses/quartiles-quantiles-and-interquartile-range-r/lessons/quartiles-r/exercises/quartiles">quartiles</a>, split the data into four groups of equal size. We found the quartiles in the previous exercise.

<img src="https://content.codecademy.com/courses/statistics/quantiles/quartiles.svg" alt="Quartiles split a dataset of 20 points into 4 groups with 5 points each">

Finally, the percentiles, or the values that split the data into 100 groups, are commonly used to compare new data points to the dataset. You might hear statements like "You are above the 80th percentile in height". This means that your height is above whatever value splits the first 80% of the data from the remaining 20%.


Common Quantiles

Nice work! Here are some of the major takeaways about quantiles:

* Quantiles are values that split a dataset into groups of equal size.
* If you have `n` quantiles, the dataset will be split into `n+1` groups of equal size.
* The median is a quantile. It is the only 2-quantile. Half the data falls below the median and half falls above the median.
* Quartiles and percentiles are other common quantiles. Quartiles split the data into 4 groups while percentiles split the data into 100 groups.

Quantiles Review

The mean of a subset of our population which is hopefully, but not necessarily, representative of the overall average.

The total average of all data from a dataset.

A randomly selected group of data points from our population.

How a sample acts when it’s had a bad day.

Local honey has no effect on allergies, any relationship between consuming local honey and allergic outbreaks is due to chance.

Local honey cures allergies, eating local honey will lower the amount of allergic outbreaks.

Local honey causes allergies, eating local honey will raise the amount of allergic outbreaks.

In a hypothesis test, a p-value is the probability that the null hypothesis is true.

In a hypothesis test, a p-value is a statistical value that is greater than 0.05.

A value, selected before a hypothesis test, that will tell us whether a test is significant or not.

A survey on preferred ice cream flavors not establishing a clear favorite when the majority of people prefer chocolate.

Practice what you've learned about hypothesis testing with R in this quiz!

### Why Learn Statistics with R? 

This course is a great introduction to both fundamental statistics concepts and the R programming language. R is used by professionals in the Data Analysis and Data Science fields as part of their daily work. 



### Take-Away Skills 

In this course, you will dive into the world of statistics using the popular language R. Starting with a handful of descriptive statistics like mean, median, mode, variance, and IQR, you will use R to better describe our datasets. You’ll then learn how to make hypotheses and test those hypotheses. By learning these skills in R, you will grow your ability to describe and analyze your data! 


Learn how to implement statistical models and run hypothesis tests in R.