In this lesson, you will learn how to calculate and interpret the variance of a dataset.


Finding the mean, median, and mode of a dataset is a good way to start getting an understanding of the general shape of your data

However, those three descriptive statistics only tell part of the story. Consider the two datasets below:

```R
dataset_one <- c(-4, -2, 0, 2, 4)
dataset_two <- c(-400, -200, 0, 200, 400)
```

These two datasets have the same mean and median &mdash; both of those values happen to be `0`. If we only reported these two statistics, we would not be communicating any meaninful difference between these two datasets.

This is where *variance* comes into play. Variance is a descriptive statistic that describes how spread out the points in a data set are.


Variance

Now that you have learned the importance of describing the spread of a dataset, let's figure out how to mathematically compute this number.

How would you attempt to capture the spread of the data in a single number? 

Let's start with our intuition &mdash; we want the variance of a dataset to be a large number if the data is spread out, and a small number if the data is close together.

<img src="https://content.codecademy.com/courses/statistics/variance/two_histograms.svg" alt = "Two histograms. One with a large spread and one with a smaller spread.">

A lot of people may initially consider using the range of the data. But that only considers two points in your entire dataset. Instead, we can include every point in our calculation by finding the difference between every data point and the mean. 

<img src="https://content.codecademy.com/courses/statistics/variance/difference.svg" alt="The difference between the mean and four different points.">

If the data is close together, then each data point will tend to be close to the mean, and the difference will be small. If the data is spread out, the difference between every data point and the mean will be larger.

Mathematically, we can write this comparison as 

```tex
\text{difference} = X - \mu
```
Where `X` is a single data point and the Greek letter `mu` is the mean.


Distance From Mean

We now have five different values that describe how far away each point is from the mean. That seems to be a good start in describing the spread of the data. But the whole point of calculating variance was to get one number that describes the dataset. We don't want to report five values &mdash; we want to combine those into one descriptive statistic.

To do this, we'll take the average of those five numbers. By adding those numbers together and dividing by `5`, we'll end up with a single number that describes the average distance between our data points and the mean.

Note that we're not _quite_ done yet &mdash; our final answer is going to look a bit strange here. There's a small problem that we'll fix in the next exercise.


Average Distances

We're almost there! We have one small problem with our equation. Consider this very small dataset:

```R
c(-5, 5)
```
The mean of this dataset is `0`, so when we find the difference between each point and the mean we get `-5 - 0 = -5` and `5 - 0 = 5`.

When we take the average of `-5` and `5` to get the variance, we get `0`:

```tex
\frac{-5 + 5}{2} = 0
``` 

Now think about what would happen if the dataset were `c(-200, 200)`. We'd get the same result! That can't possibly be right &mdash; the dataset with `200` is much more spread out than the dataset with `5`, so the variance should be much larger!

The problem here is with negative numbers. Because one of our data points was `5` units below the mean and the other was `5` units above the mean, they canceled each other out! 

When calculating variance, if a data point was above or below the mean &mdash; all we care about is how far away it was. To get rid of those pesky negative numbers, we'll square the difference between each data point and the mean.

Our equation for finding the difference between a data point and the mean now looks like this:

```tex
\text{difference} = (X - \mu)^2
```


Square the Differences

Well done! You've calculated the variance of a data set. The full equation for the variance is as follows:

```tex
\sigma^2 = \frac{\sum_{i=1}^{N}{(X_i -\mu)^2}}{N}
```
Let's dissect this equation a bit. 
* Variance is usually represented by the symbol sigma squared. 
* We start by taking every point in the dataset &mdash; from point number `1` to point number `N` &mdash; and finding the difference between that point and the mean. 
* Next, we square each difference to make all differences positive.
* Finally, we average those squared differences by adding them together and dividing by `N`, the total number of points in the dataset.

All of this work can be done quickly using a function we provided. The `variance()` function takes a list of numbers as a parameter and returns the variance of that dataset.

```R
dataset <- c(3, 5, -2, 49, 10)
var <- variance(dataset)
```


Variance in R

Great work! In this lesson you've learned about variance and how to calculate it.

In the example used in this lesson, the importance of variance was highlighted by showing data from test scores in classes taught by two different teachers. What story does variance tell? What conclusions can we draw from this statistic?

<img src = "https://content.codecademy.com/courses/statistics/variance/teachers.png" alt = "The histogram of scores from two different teacher's classes">

In the class with low variance, it seems like the teacher strives to make sure all students have a firm understanding of the subject, but nobody is exemplary.

In the class with high variance, the teacher might focus more of their attention on certain students. This might enable some students to ace their tests, but other students get left behind.

If we only looked at statistics like mean, median, and mode, these nuances in the data wouldn't be represented.

Review

You will calculate the variance and standard deviation of datasets about student grades and the heights of OkCupid users and NBA players. In the project you'll find the best time to travel to London by looking at a weather dataset.



In this module, you will learn how to quantify the spread of the dataset by calculating the variance and standard deviation in R.

Learn R: Variance and Standard Deviation

The mean of a subset of our population which is hopefully, but not necessarily, representative of the overall average.

The total average of all data from a dataset.

A randomly selected group of data points from our population.

How a sample acts when it’s had a bad day.

Local honey has no effect on allergies, any relationship between consuming local honey and allergic outbreaks is due to chance.

Local honey cures allergies, eating local honey will lower the amount of allergic outbreaks.

Local honey causes allergies, eating local honey will raise the amount of allergic outbreaks.

In a hypothesis test, a p-value is the probability that the null hypothesis is true.

In a hypothesis test, a p-value is a statistical value that is greater than 0.05.

A value, selected before a hypothesis test, that will tell us whether a test is significant or not.

A survey on preferred ice cream flavors not establishing a clear favorite when the majority of people prefer chocolate.

A doctor’s test for malaria coming back positive for someone who doesn’t have malaria.

A way of quantifying the truth of a statement.

A test that will prove a scientific theorem, statistically.

A test where you compare three or more numeric data sets, hoping to find the variance.

Practice what you've learned about hypothesis testing with R in this quiz!

Hypothesis Testing with R

In this project we will be using statistical techniques to make statements and draw conclusions about a blood tranfusion company's userbase.

Familiar want to know if their most basic package, the Vein Pack, actually has a significant impact on the subscribers. It would be a marketing goldmine if it could show that subscribers to the Vein Pack live longer than the general public.

Lifespans of Vein Pack users are given in the variable `vein_lifespans`, which has been loaded into `notebook.Rmd`. View `vein_lifespans`.

You'd like to find out if the average lifespan of a Vein Pack subscriber is _significantly different_ from the average life expectancy of `71` years. 

Begin by finding mean lifespan of users of the Vein Pack, and save the result to `vein_lifespans_mean`. View `vein_lifespans_mean`.

Now find the standard deviation of the lifespan of users of the Vein Pack. Save the result to `vein_lifespans_sd`, and view it.

Use a One Sample T-Test to compare `vein_lifespans` to the average life expectancy, `71`. Save the result into a variable called `vein_pack_test`, and view it.

Are the results significant? Check the p-value of `vein_pack_test`. If it's less than `0.05`, the standard threshold, you've got significance!

In order to differentiate Familiar's product lines, they would like you to compare the lifespan data for the Vein Pack with their more premium product, the Artery Pack.

Lifespans of Artery Pack users are given in the variable `artery_lifespans`, which has been loaded into `notebook.Rmd`. View `artery_lifespans`.

Before you run a Two Sample T-test to compare the Vein Pack and the Artery pack, you want to learn more about the data collected on users of the Artery pack.

Begin by finding mean lifespan of users of the Artery Pack, and save the result to `artery_lifespans_mean`. View `artery_lifespans_mean`.

Now find the standard deviation of the lifespan of users of the Artery Pack. Save the result to `artery_lifespans_sd`, and view it.

Use a Two Sample T-Test to compare `vein_lifespans` to `artery_lifespans`. Save the result into a variable called `package_comparison_results`, and view it.

Are the results significant? Check the p-value of `package_comparison_results`. If it's less than `0.05`, the standard threshold, you've got significance!

Blood Transfusion Analysis

This lesson covers the different types of hypothesis tests and the situations they are most appropriate for.

Say you work for a major social media website. Your boss comes to you with two questions:
* does the demographic of users on your site match the company's expectation?
* did the new interface update affect user engagement?

With terabytes of user data at your hands, you decide the best way to answer these questions is with statistical hypothesis tests!

_Statistical hypothesis testing_ is a process that allows you to evaluate if a change or difference seen in a dataset is "real", or if it’s just a result of random fluctuation in the data.

Hypothesis testing can be an integral component of any decision making process. It provides a framework for evaluating how confident one can be in making conclusions based on data. Some instances where this might come up include:
* a professor expects an exam average to be roughly 75%, and wants to know if the actual scores line up with this expectation. Was the test actually too easy or too hard?
* a product manager for a website wants to compare the time spent on different versions of a homepage. Does one version make users stay on the page significantly longer?

In this lesson, you will cover the fundamental concepts that will help you run and evaluate hypothesis tests:
* Sample and Population Mean
* P-Values
* Significance Level
* Type I and Type II Errors

You will then learn about three different hypothesis tests you can perform to answer the kinds of questions discussed above:
* One Sample T-Test
* Two Sample T-Test
* ANOVA (Analysis of Variance)

Let's get started!

Introduction

Suppose you want to know the average height of an oak tree in your local park. On Monday, you measure `10` trees and get an average height of `32` ft. On Tuesday, you measure `12` different trees and reach an average height of `35` ft. On Wednesday, you measure the remaining `11` trees in the park, whose average height is `31` ft. The average height for all `33` trees in your local park is `32.8` ft.

The collection of individual height measurements on Monday, Tuesday, and Wednesday are each called samples. A *sample* is a subset of the entire population (all the oak trees in the park). The mean of each sample is a *sample mean* and it is an estimate of the *population mean*.

Note: the sample means (`32` ft., `35` ft., and `31` ft.) were all close to the population mean (`32.8` ft.), but were all slightly different from the population mean and from each other.

For a population, the mean is a constant value no matter how many times it's recalculated. But with a set of samples, the mean will depend on exactly which samples are selected. From a sample mean, we can then extrapolate the mean of the population as a whole. There are three main reasons we might use sampling:

- data on the entire population is not available
- data on the entire population is available, but it is so large that it is unfeasible to analyze
- meaningful answers to questions can be found faster with sampling

Sample Mean and Population Mean - I

In the previous exercise, the sample means you calculated closely approximated the population mean. This won't always be the case!

Consider a tailor of school uniforms at a school for students aged `11` to `13`. The tailor needs to know the average height of all the students in order to know which sizes to make the uniforms.

The tailor measures the heights of a random sample of `20` students out of the `300` in the school. The average height of the sample is `57.5` inches. Using this sample mean, the tailor makes uniforms that fit students of this height, some smaller, and some larger.

After delivering the uniforms, the tailor starts to receive some feedback &mdash; many of the uniforms are too small! They go back to take measurements on the rest of the students, collecting the following data:
* 11 year olds average height: `56.7` inches
* 12 year olds average height: `59` inches
* 13 year olds average height: `62.8` inches
* All students average height (population mean): `59.5` inches

The original sample mean was off from the population mean by `2` inches! How did this happen?

The random sample of `20` students was skewed to one direction of the total population. More `11` year olds were chosen in the sample than is representative of the whole school, bringing down the average height of the sample. This is called a _sampling error_, and occurs when a sample is not representative of the population it comes from. How do you get an average sample height that looks more like the average population height, and reduce the chance of a sampling error?

Selecting only `20` students for the sample allowed for the chance that only younger, shorter students were included. This is a natural consequence of the fact that a sample has less data than the population to which it belongs. If the sample selection is poor, then you will have a sample mean seriously skewed from the population mean.

There is one surefire way to mitigate the risk of having a skewed sample mean — take a larger set of samples! The sample mean of a larger sample set will more closely approximate the population mean, and reduce the chance of a sampling error.

Sample Mean and Population Mean - II

You begin the statistical hypothesis testing process by defining a _hypothesis_, or an assumption about your population that you want to test. A hypothesis can be written in words, but can also be explained in terms of the sample and population means you just learned about.

Say you are developing a website and want to compare the time spent on different versions of a homepage. You could run a hypothesis test to see if version A or B makes users stay on the page significantly longer. Your hypothesis might be:

`"The average time spent on homepage A is greater than the average time spent on homepage B."`

While this is a fine hypothesis to make, data analysts are often very hesitant people. They don't like to make bold claims without having data to back them up! Thus when constructing hypotheses for a hypothesis test, you want to formulate a null hypothesis. A _null hypothesis_ states that there is no difference between the populations you are comparing, and it implies that any difference seen in the sample data is due to sampling error. A null hypothesis for the same scenario is as follows:

`"The average time spent on homepage A is the same as the average time spent on homepage B."`

You could also restate this in terms of population mean:

`"The population mean of time spent on homepage A is the same as the population mean of time spent on homepage B."`

After collecting some sample data on how users interact with each homepage, you can then run a hypothesis test using the data collected to determine whether your null hypothesis is true or false, or can be rejected (i.e. there is a difference in time spent on homepage A or B).

Hypothesis Formulation

Suppose you want to know if students who study history are more interested in volleyball than students who study chemistry. Before doing anything else to answer your original question, you come up with a null hypothesis: `"History and chemistry students are interested in volleyball at the same rates."`

To test this hypothesis, you need to design an experiment and collect data. You invite `100` history majors and `100` chemistry majors from your university to join an extracurricular volleyball team. After one week, `34` history majors sign up (`34%`), and `39` chemistry majors sign up (`39%`). More chemistry majors than history majors signed up, but is this a “real”, or significant difference? Can you conclude that students who study chemistry are more interested in volleyball than students who study history?

In your experiment, the `100` history and `100` chemistry majors at your university are samples of their respective populations (all history and chemistry majors). The sample means are the percentages of history majors (`34%`) and chemistry majors (`39%`) that signed up for the team, and the difference in sample means is `39%` - `34%` = `5%`. The population means are the percentage of history and chemistry majors worldwide that would sign up for an extracurricular volleyball team if given the chance.

You want to know if the difference you observed in these sample means (`5%`) reflects a difference in the population means, or if the difference was caused by sampling error, and the samples of students you chose do not represent the greater populations of history and chemistry students.

Restating the null hypothesis in terms of the population means yields the following:

`"The percentage of all history majors who would sign up for volleyball is the same as the percentage of all chemistry majors who would sign up for volleyball, and the observed difference in sample means is due to sampling error."`

This is the same as saying, “If you gave the same volleyball invitation to every history and chemistry major in the world, they would sign up at the same rate, and the sample of `200` students you selected are not representative of their populations.”

Designing an Experiment

When using automated processes to make decisions, you need to be aware of how this automation can lead to mistakes. Computer programs can be as fallible as the humans who design them. Because of this, there is a responsibility to understand what can go wrong and what can be done to contain these foreseeable problems.

In statistical hypothesis testing, there are two types of error. A _Type I error_ occurs when a hypothesis test finds a correlation between things that are not related. This error is sometimes called a "false positive" and occurs when the null hypothesis is rejected even though it is true.

For example, consider the history and chemistry major experiment from the previous exercise. Say you run a hypothesis test on the sample data you collected and conclude that there is a significant difference in interest in volleyball between history and chemistry majors. You have rejected the null hypothesis that there is no difference between the two populations of students. If, in reality, your results were due to the groups you happened to pick (sampling error), and there actually is no significant difference in interest in volleyball between history and chemistry majors in the greater population, you have become the victim of a false positive, or a Type I error.

The second kind of error, a _Type II error_, is failing to find a correlation between things that are actually related. This error is referred to as a "false negative" and occurs when the null hypothesis is not rejected even though it is false.

For example, with the history and chemistry student experiment, say that after you perform the hypothesis test, you conclude that there is no significant difference in interest in volleyball between history and chemistry majors. You did _not_ reject the null hypothesis. If there actually is a difference in the populations as a whole, and there is a significant difference in interest in volleyball between history and chemistry majors, your test has resulted in a false negative, or a Type II error.


Type I and Type II Errors

You know that a hypothesis test is used to determine the validity of a null hypothesis. Once again, the null hypothesis states that there is no actual difference between the two populations of data. But what result does a hypothesis test actually return, and how can you interpret it?

A hypothesis test returns a few numeric measures, most of which are out of the scope of this introductory lesson. Here we will focus on one: p-values. P-values help determine how confident you can be in validating the null hypothesis. In this context, a _p-value_ is the probability that, assuming the null hypothesis is true, you would see at least such a difference in the sample means of your data.

Consider the experiment on history and chemistry majors and their interest in volleyball from a previous exercise:
* Null Hypothesis: `"History and chemistry students are interested in volleyball at the same rates"`
* Experiment Sample Means: `34%` of history majors and `39%` of chemistry majors sign up for the volleyball class 

Assuming the null hypothesis is true, there is no actual difference in preference for volleyball between all history and chemistry majors, and any difference present in the experiment data is the result of sampling error. Imagine you run a hypothesis test on this experiment data and it returns a p-value of `0.04`. A p-value of `0.04` indicates that you could expect to see a difference of at least `5%` (calculated as `39%` - `34%` = `5%`) in the sample means only 4% of the time.

Essentially, if you ran this same experiment `100` times, you would expect to see as large a difference in the sample means only `4` times given the assumption that there is no actual difference between the populations (i.e. they have the same mean).

Seems like a really small probability, right? Are you thinking about rejecting the null hypothesis you originally stated?

P-Values

While a hypothesis test will return a p-value indicating a level of confidence in the null hypothesis, it does not definitively claim whether you should reject the null hypothesis. To make this decision, you need to determine a threshold p-value for which all p-values below it will result in rejecting the null hypothesis. This threshold is known as the _significance level_.

A higher significance level is more likely to give a false positive, as it makes it "easier" to state that there is a difference in the populations of your data when such a difference might not actually exist. If you want to be very sure that the result is not due to sampling error, you should select a very small significance level.

It is important to choose the significance level before you perform a statistical hypothesis test. If you wait until after you receive a p-value from a test, you might pick a significance level such that you get the result you want to see. For instance, if someone is trying to publish the results of their scientific study in a journal, they might set a higher significance level that makes their results appear statistically significant. Choosing a significance level in advance helps keep everyone honest.

It is an industry-standard to set a significance level of `0.05` or less, meaning that there is a `5%` or less chance that your result is due to sampling error.

Significance Level

Consider the fictional business BuyPie, which sends ingredients for pies to your household so that you can make them from scratch. Suppose that a product manager hypothesizes the average age of visitors to BuyPie.com is `30`. In the past hour, the website had `100` visitors and the average age was `31`. Are the visitors older than expected? Or is this just the result of chance (sampling error) and a small sample size? 

You can test this using a One Sample T-Test. A _One Sample T-Test_ compares a sample mean to a hypothetical population mean. It answers the question "What is the probability that the sample came from a distribution with the desired mean?"

The first step is formulating a null hypothesis, which again is the hypothesis that there is no difference between the populations you are comparing. The second population in a One Sample T-Test is the hypothetical population you choose. The null hypothesis that this test examines can be phrased as follows: `"The set of samples belongs to a population with the target mean".` 

One result of a One Sample T-Test will be a _p-value_, which tells you whether or not you can reject this null hypothesis. If the p-value you receive is less than your significance level, normally `0.05`, you can reject the null hypothesis and state that there is a significant difference.

R has a function called `t.test()` in the `stats` package which can perform a One Sample T-Test for you.

`t.test()` requires two arguments, a distribution of values and an expected mean:

```r
results <- t.test(sample_distribution, mu = expected_mean)
```
* `sample_distribution` is the sample of values that were collected
* `mu` is an argument indicating the desired mean of the hypothetical population
* `expected_mean` is the value of the desired mean

`t.test()` will return, among other information we will not cover here, a p-value &mdash; this tells you how confident you can be that the sample of values came from a distribution with the specified mean.

P-values give you an idea of how confident you can be in a result. Just because you don’t have enough data to detect a difference doesn’t mean that there isn’t one. Generally, the more samples you have, the smaller a difference you can detect. 


One Sample T-Test

Suppose that last week, the average amount of time spent per visitor to a website was `25` minutes. This week, the average amount of time spent per visitor to a website was `29` minutes. Did the average time spent per visitor change (i.e. was there a statistically significant bump in user time on the site)? Or is this just part of natural fluctuations?

One way of testing whether this difference is significant is by using a Two Sample T-Test. A _Two Sample T-Test_ compares two sets of data, which are both approximately normally distributed. 

The null hypothesis, in this case, is that the two distributions have the same mean.

You can use R's `t.test()` function to perform a Two Sample T-Test, as shown below:

```r
results <- t.test(distribution_1, distribution_2)
```

When performing a Two Sample T-Test, `t.test()` takes two distributions as arguments and returns, among other information, a p-value. Remember, the p-value let's you know the probability that the difference in the means happened by chance (sampling error).

Two Sample T-Test

Suppose that you own a chain of stores that sell ants, called VeryAnts. There are three different locations: A, B, and C. You want to know if the average ant sales over the past year are significantly different between the three locations.

At first, it seems that you could perform T-tests between each pair of stores.

You know that the p-value is the probability that you incorrectly reject the null hypothesis on each t-test. The more t-tests you perform, the more likely that you are to get a false positive, a Type I error. 

For a p-value of `0.05`, if the null hypothesis is true, then the probability of obtaining a significant result is `1 – 0.05` = `0.95`. When you run another t-test, the probability of still getting a correct result is `0.95` * `0.95`, or `0.9025`. That means your probability of making an error is now close to `10%`! This error probability only gets bigger with the more t-tests you do.



Dangers of Multiple T-Tests

In the last exercise, you saw that the probability of making a Type I error got dangerously high as you performed more t-tests.

When comparing more than two numerical datasets, the best way to preserve a Type I error probability of `0.05` is to use ANOVA. _ANOVA (Analysis of Variance)_ tests the null hypothesis that all of the datasets you are considering have the same mean. If you reject the null hypothesis with ANOVA, you're saying that at least one of the sets has a different mean; however, it does not tell you which datasets are different.

You can use the `stats` package function `aov()` to perform ANOVA on multiple datasets. `aov()` takes the different datasets combined into a data frame as an argument. For example, if you were comparing scores on a video game between math majors, writing majors, and psychology majors, you could format the data in a data frame `df_scores` as follows:

|group|score|
|-----|-----|
|math major|88|
|math major|81|
|writing major|92|
|writing major|80|
|psychology major|94|
|psychology major|83|

You can then run an ANOVA test with this line:

```r
results <- aov(score ~ group, data = df_scores)
```
Note: `score ~ group` indicates the relationship you want to analyze (i.e. how each `group`, or major, relates to `score` on the video game)

To retrieve the p-value from the results of calling `aov()`, use the `summary()` function:

```r
summary(results)
```

The null hypothesis, in this case, is that all three populations have the same mean score on this video game. If you reject this null hypothesis (if the p-value is less than `0.05`), you can say you are reasonably confident that a pair of datasets is significantly different. After using only ANOVA, however, you can't make any conclusions on which two populations have a significant difference.

Let's look at an example of ANOVA in action.

ANOVA

Before you use numerical hypothesis tests, you need to be sure that the following things are true:

#### 1. The samples should each be normally distributed...ish

Data analysts in the real world often still perform hypothesis tests on datasets that aren't exactly normally distributed. What is more important is to recognize if there is some reason to believe that a normal distribution is especially unlikely. If your dataset is definitively not normal, the numerical hypothesis tests won't work as intended.

For example, imagine you have three datasets, each representing a day of traffic data in three different cities. Each dataset is independent, as traffic in one city should not impact traffic in another city. However, it is unlikely that each dataset is normally distributed. In fact, each dataset probably has two distinct peaks, one at the morning rush hour and one during the evening rush hour. The histogram of a day of traffic data might look something like this:

![histogram](https://content.codecademy.com/courses/learn-hypothesis-testing/lesson_ii/histogram_data_traffic.svg)

In this scenario, using a numerical hypothesis test would be inappropriate.

#### 2. The population standard deviations of the groups should be equal

For ANOVA and Two Sample T-Tests, using datasets with standard deviations that are significantly different from each other will often obscure the differences in group means.

To check for similarity between the standard deviations, it is normally sufficient to divide the two standard deviations and see if the ratio is "close enough" to 1. "Close enough" may differ in different contexts, but generally staying within `10%` should suffice.

#### 3. The samples must be independent

When comparing two or more datasets, the values in one distribution should not affect the values in another distribution. In other words, knowing more about one distribution should not give you any information about any other distribution.

Here are some examples where it would seem the samples are not independent:

* the number of goals scored per soccer player before, during, and after undergoing a rigorous training regimen
* a group of patients' blood pressure levels before, during, and after the administration of a drug

It is important to understand your datasets before you begin conducting hypothesis tests on them so that you know you are choosing the right test.

Assumptions of Numerical Hypothesis Tests

Phew! Nobody said hypothesis testing is easy, but you made it to the end of the lesson. Congratulations! The world of hypothesis testing is vast. There is much more you can learn, and so many applications where you can use them.

Let's review what you've learned in this lesson:
* _Samples_ are subsets of an entire _population_, and the _sample mean_ can be used to approximate the _population mean_
* The _null hypothesis_ is an assumption that there is no difference between the populations you are comparing in a hypothesis test
* _Type I Errors_ occur when a hypothesis test finds a correlation between things that are not related, and _Type II Errors_ occur when a hypothesis test fails to find a correlation between things that are actually related
* _P-Values_ indicate the probability that, assuming the null hypothesis is true, such differences in the samples you are comparing would exist
* The _Significance Level_ is a threshold p-value for which all p-values below it will result in rejecting the null hypothesis
* _One Sample T-Tests_ indicate whether a dataset belongs to a distribution with a given mean 
* _Two Sample T-Tests_ indicate whether there is a significant difference between two datasets 
* _ANOVA (Analysis of Variance)_ allows you to detect if there is a significant difference between one of multiple datasets

Learn about Rgrammers preferred integreated development environment (IDE), RStudio!

Introduction to the RStudio IDE

The only way to assign a variable name is with `<-` arrow syntax.

Variables store information in your program and associate that information with a name.

Once a variable has been created, you cannot change its value.

Variable names can start with letters, numbers, and a period or underscore character.

If `age` is greater than 18 and `registered` is TRUE, set the value of `can_vote` to TRUE. Otherwise, set it to FALSE.

If `age` is greater than 18 or `registered` is TRUE, set the value of `can_vote` to TRUE. Otherwise, set it to FALSE.

If `age` is greater than or equal to 18 and `registered` is TRUE, set the value of `can_vote` to TRUE. Otherwise, set it to FALSE.

If `age` is greater than or equal to 18 and `registered` is TRUE, set the value of `can_vote` to FALSE. Otherwise, set it to TRUE.

if (age >= 18 & registered == TRUE) {
  can_vote <- TRUE
} else {
  can_vote <- FALSE
}

cool_variable1 = 324.2
cool_variable2 = "Hola, mucho gusto."
cool_variable3 = FALSE
cool_variable4 = 0

Vectors can store items of different data types.

Vectors are created by calling the `c()` function and passing in arguments to add to the vector.

You can retrieve the length of a vector by calling the `length()` function and passing in the vector as an argument.

Test your understanding of R fundamentals with this quiz.


Introduction to R

An introduction to the R programming language and its unique syntax.

Ahoy! We R excited for you to start your learning adventure with a language built for data enthusiasts! The R community is made up of people passionate about the intersection of numbers, data, analysis, and code. In this lesson, we will introduce you to some basic R syntax and discuss how R  classifies data types so that it can mathematically process them in analysis.

But before you go any further, let's talk about how R is different from most programming languages. Unlike with other languages, most beginners who want to learn R do so because they want to analyze data. In this way, R is more of a tool to understand data than a programming language used to build software applications. Our approach to teaching you R will be similar to teaching you how to use a new tool.

What is the tool for? R is powerful for conducting statistics and other specialized data analysis. It was invented by scientists for statistical computing and a community of specialized packages has been built around the language. 

Some housekeeping: over the next exercises you'll be conducting all of your work inside R-notebook files. You'll see the **.Rmd** file format will allow you to write blocks of code and see their output as a webpage. You will see the blocks wrapped inside the following syntax:

```R
This is regular markdown text that will render as regular text.
` ` `{r}
7 + 7 # This is R code
` ` `
```

This format allows you to group blocks of code with common logic, see their output, and then render all of it together as an HTML file that can be viewed on a browser as a static webpage. The `.Rmd` format is the preferred standard in the data science industry. 

Note: This lesson assumes no prior coding knowledge and will introduce programming concepts as you need them in order to continue using R as a tool to conduct analysis. 

Image of female-identifying pirate juggling data science symbols such as math, graphs, and numbers.

Why R?

Let's start with the basic syntax for mathematical calculations in R. R performs addition, subtraction, multiplication, and division with `+`, `-`, `*`, and `/`:

```R
# Results in "500"
573 - 74 + 1

# Results in "50"
25 * 2

# Results in "2"
10 / 5
```

Mathematical operations in R follow the standard mathematical [order of operations](https://en.wikipedia.org/wiki/Order_of_operations). Let's write your first line of R code and calculate some basic math!

Calculations

Ironically, the second thing we're going to do is show you how to tell R to ignore a part of your program. We promise it's very useful to know how to do this. Text written in a program but not run by the computer is called a _comment_. R interprets anything after a `#` as a comment.

Why would anyone want the computer to ignore a part of their file? Multiple reasons! Comments can: 

 * Provide context for why something is written the way it is:
```R
# This code will be used to count the number of times anyone tweets the word persnickety
persnickety_count <- 0
```

 * Help other people reading the code understand it faster: 
```R
# This code will calculate the likelihood that it will rain tomorrow
complicated_rain_calculation_for_tomorrow()
```

 * Ignore a line of code and see how a program will run without it:
```R
# useful_value <- old_sloppy_code()
useful_value <- new_clean_code()
```

Annotating or documenting your code can help other people read your program later! It could also help your future-self understand the code when you go back and read an old file trying to remember how it works. Documenting your code will help others reproduce it, and it will help you become a better programmer too. Note: In R notebooks, you're also allowed to add documentation using markdown text outside the code blocks.


Comments

Now that you know how to calculate basic math and add comments explaining your code, let's dive into how R "thinks about" different types of data. In R and in programming, _data types_ are the classifications we give different kinds of information pieces.  In  this lesson, we will explore the following R data types:
1. _Numeric_: Any number with or without a decimal point: `23`, `0.03`and the numeric null value `NA`. 
2. _Character_: Any grouping of characters on your keyboard (letters, numbers, spaces, symbols, etc.) or text. Most strings are surrounded by single quotes: `' ... '` or double quotes `" ... "`, though we prefer single quotes. Sometimes you will hear this type referred to as "string."
3. _Logical_: This data type only has two possible values&mdash; either `TRUE` or `FALSE` (without quotes). It’s helpful to think of logical types or booleans as on and off switches or as the answers to a "yes" or "no" question.
4. _Vectors_: A list of related data that is all the same type.
5. _NA_:  This data type represents the absence of a value, and is represented by the keyword `NA` (without quotes) but it has its own significance in the context of the different types. That is there is a numeric NA, a character NA, and a logical NA.

Let's get comfortable with checking the data type of the following:
```R
class(2) # numeric
class('hello') # character
class('23') #character
class (FALSE) #logical
class(NA) #logical
```
In the example above, notice that the third line is labeled a character type. Why? Because the characters `23` are in quotes, so it gets classified as a character. 

Data Types

Now that you know how R classifies some of the basic information types, let's figure out how to store them. In programming, variables allow us to store information and associate that information with a name. In R, we _assign_ variables by using the assignment operator, an arrow sign (`<-`) made with a carat and a dash.

```R
full_name <-"Natalia Rodríguez Nuñez"
```

In the example above, we store the string value "Natalia Rodríguez Nuñez" in a variable called `full_name`. Variables can't have spaces or symbols in their names other than an underscore (`_`). They can't begin with numbers but they can have numbers after the first letter (e.g., `cool_variable_5` is OK). 

It's no coincidence we call these creatures "variables". If we need to update a variable but perform the same logical process on it, we can change its value! For example, take the variable `message_string`:

```py
# Greeting
message_string <- "Hello there"
print(message_string)

# Farewell
message_string <- "Hasta la vista"
print(message_string)
```

Above, we create the variable `message_string`, assign a welcome message, and print the greeting. After we greet the user, we want to wish them goodbye. We then update `message_string` to a departure message and print that out.

Note: You can also use `=` instead of `<-` to assign a value but R-tists(R programmers) prefer to do it with an arrow.




Variables

We mentioned Vectors when we introduced data types earlier. In R, vectors are a list-like structure that contain items of the same data type. 

Take a look here:
```R
spring_months <- c("March", "April","May","June")
```

In the example above, we created a new variable with the value of a new vector. We created this vector by separating four character strings with a comma and wrapping them inside `c()`.

A few things you should know how to do with vectors:
+ You can check the type of elements in a vector by using `typeof(vector_name)`
+ You can check the length of a vector by using `length(vector_name)`
+ You can access individual elements in the vector by using `[]` and placing the element position inside the brackets. For example, if we wanted to access the second element we would write: `vector_name[2]`. Note: In R, you start counting elements at position one, not zero.





Vectors

In R, we will often perform a task based on a condition. For example, if we are analyzing data for the summer, then we will only want to look at data that falls in June, July, and August.

We can perform a task based on a condition using an if statement:

```R
if (TRUE) {
  print('This message will print!')
} 
```
Notice in the example above, we have an _if_ statement. The if statement is composed of:


+ The if keyword followed by a set of parentheses `()` which is followed by a code block, or block statement, indicated by a set of curly braces `{}`.
+ Inside the parentheses `()`, a condition is provided that evaluates to `TRUE` or `FALSE`.
+ If the condition evaluates to true, the code inside the curly braces `{}` runs, or executes.
+ If the condition evaluates to false, the code inside the block won't execute.

Knowing how to use if statements will help you introduce logic in your data analysis. There is also a way to add an _else statement_. An else statement must be paired with an if statement, and together they are referred to as an if...else statement.
```R
if (TRUE) {
   print("Go to sleep!")
} else {
   print("Wake up!")
}
```

In the example above, the else statement:
+ Uses the else keyword following the code block of an if statement.
+ Has a code block that is wrapped by a set of curly braces `{}`.
+ The code inside the else statement code block will execute when the if statement's condition evaluates to false.
These if...else statements allow us to automate solutions to yes-or-no questions, also known as binary decisions.

Conditionals

When writing conditional statements, sometimes we need to use different types of operators to compare values. These operators are called comparison operators.

Here is a list of some handy comparison operators and their syntax:

+ Less than: `<`
+ Greater than: `>`
+ Less than or equal to: `<=`
+ Greater than or equal to: `>=`
+ Is equal to: `==`
+ Is NOT equal to: `!=`

Comparison operators compare the value on the left with the value on the right. For instance:
```R
10 < 12 # Evaluates to TRUE
```
It can be helpful to think of comparison statements as questions. When the answer is "yes", the statement evaluates to true, and when the answer is "no", the statement evaluates to false. The code above would be asking: is 10 less than 12? Yes! So 10 < 12 evaluates to true.



Comparison Operators

Working with conditionals means that we will be using logical, true or false values. In R, there are operators that work with logical values known as logical operators. We can use logical operators to add more sophisticated logic to our conditionals. There are three logical operators:

+ the AND operator (`&`)
+ the OR operator (`|`)
+ the NOT operator, otherwise known as the bang operator (`!`)

When we use the `&` operator, we are checking that two things are true:
```R
if (stopLight == 'green' & pedestrians == 0) {
  print('Go!');
} else {
  print('Stop');
}
```
When using the `&` operator, both conditions must evaluate to true for the entire condition to evaluate to true and execute. Otherwise, if either condition is false, the `&` condition will evaluate to false and the else block will execute.

If we only care about either condition being true, we can use the `|` operator:
```R
if (day == 'Saturday' | day == 'Sunday') {
  print('Enjoy the weekend!')
} else {
  print('Do some work.')
}
```
When using the `|` operator, only one of the conditions must evaluate to true for the overall statement to evaluate to true. In the code example above, if either `day == 'Saturday'` or `day == 'Sunday'` evaluates to true the if's condition will evaluate to true and its code block will execute. If the first condition in an `|` statement evaluates to true, the second condition won't even be checked. 

The `!` NOT operator reverses, or negates, the value of a TRUE value:
```R
excited <- TRUE
print(!excited) # Prints FALSE
```
Essentially, the `!` operator will either take a true value and pass back false, or it will take a false value and pass back true.

Logical operators are often used in conditional statements to add another layer of logic to our code.

Logical Operators

Functions are actions we can perform. R provides a number of functions, and you've actually been using a few of them even though you maybe didn't realize!

We _call_, or use, these functions by stating the name of the function and following it with an opening and closing parentheses: ie. `functionName()`. In addition, between the parenthesis, we usually pass in an _argument_, or a value that the function uses to conduct an action, i.e. `functionName(value)`. 

Does that syntax look a little familiar? When we have used `print()` we're calling the `print` function. When we made a vector, we actually used the combine `c()` function! Let's see `print()` and some real functions in action!

```R
sort(c(2,4,10,5,1)); # Outputs c(1,2,4,5,10)
length(c(2,4,10,5,1)); # Outputs 5
sum(5,15,10) #Outputs 30
```
Let's look at each of the lines above:
- On the first line, the `sort()` function is called with a parameter of the vector `c(2,4,10,5,1)`. The result is a sorted vector `c(1,2,4,5,10)` with the values in ascending order.
- On the second line, we called a function we've seen before: `length()` and it returned the value 5 because there were five items in the vector.
- On the third line, we called a function `sum()` which added up all of the arguments we passed to it.

Understanding how to call a function and what arguments it needs is a fundamental part of leveraging R as a tool to conduct analysis. Let's practice calling some useful functions!

Calling a Function

R's popularity is also largely due to the many fantastic packages available in the language! A package is a bundle of code that makes coding certain tasks easier. There are all sorts of packages for all sorts of purposes, ranging from visualizing and cleaning data, to ordering pizza or posting a tweet.  In fact, most R-grammers (R programmers) use packages when they code. This is why you might hear them differentiate packages from "Base R." Base R refers the R language by itself and all that it can do without importing any packages. 

Base R is very powerful, but most of the time, you'll want to import a package to make your life easier. You only need to run this command the first time you install a package, after that there is no need to run it:
```R
install.packages('package-name')
```
 To import a package you simply:
```R
library(package-name)
```
You can look up documentation for different packages available in R at the [CRAN](https://cran.r-project.org/) (Comprehensive R Archive Network).

In this lesson, we coded in Base R but let's practice importing one of the most popular R packages: dplyr. Dplyr is a package used to clean, process, and organize data which you will use as you learn about R. 

Importing Packages

Congrats on finishing your first R lesson!

Here's a summary of some of the concepts you've learned:

+ R is a powerful statistical programming language with a large community of data enthusiasts.
+ You can calculate arithmetic with R and it will follow the normal order of operations
+ Data types allow us to classify different pieces of information.
+ You can store that information inside of variables
+ You can use conditional statements and operators to introduce logic to your code
+ You can call a function in R by using the `()` and passing in the correct arguments
+ R programmers have published lots of useful packages that specialize in different tasks, which are all available for you to use in your programs after you install them.

We hope you're as excited as we are about the possibilities of analyzing data now that you know some of the basics of programming with R.




Introduction to R Syntax

In this project, you will learn how to use the basics of R syntax and operations to make calculations.



Istanbul is the largest city in Turkey and the fifth largest city in the world. It has experienced enormous growth over the past 50 years and is one of the world's 10 fastest growing metropolitan areas.  

While the program that we will write can be used with data from any city, we'll start by using data from Istanbul and saving our data to variables. Using variables will allow us to swap out the data in the future. 

The following chart is an abbreviated list of the population size by year in Istanbul. Take a moment to read over the data &mdash; you will need to refer back to this chart as you complete certain tasks. 

 <div class = "narrative-table-container">
<table>
<tr>
    <th>Year</th>
    <th>Population</th>

  </tr>
  <tr>
    <td>1927</td>
    <td>691000</td>

  </tr>
  <tr>
    <td>1950</td>
    <td>983000</td>
  </tr>
  <tr>
    <td>2000</td>
    <td>8831800</td>
  </tr>
  <tr>
    <td>2017</td>
    <td>15029231</td>
  </tr>
</table>
</div>

First, create the variable `city_name` and set it equal to `"Istanbul, Turkey" `.

The dataset starts with the population value for the year 1927 and ends with 2017.

Create the variable `pop_year_one`. In the chart, find the population value for 1927 and set it equal to the variable `pop_year_one`.

Next, create the variable `pop_year_two`. Find the population for 2017 and set its value equal to the variable `pop_year_two`.

Using the variables that we just created, we're going to write a script that allows us to calculate the _annual percentage growth rate_. The annual percentage growth rate is the amount in which the population changes each year during a certain period. 

First, create the variable `pop_change`. Calculate the difference in population between 2017 and 1927 and save the result to the variable `pop_change`. Feel free to print any of these variables if you want to check their values!

Before we calculate the annual percentage growth rate, we need to calculate the _percentage growth rate_. This is the percentage with which a population changes, but doesn't account for period of time during which the change takes place.

We can calculate percentage growth rate using the following formula:

```py
percentage_gr <- (((pop_present - pop_past)/pop_past) * 100)
```
Create the variable `percentage_gr`.

Using the variable `pop_change`, calculate the annual percentage growth rate between 1927 and 2017 and assign the result to the variable `percentage_gr`.

Now that we have the percentage growth rate, we can calculate the annual percentage growth. Create a variable for `annual_gr`.

To calculate the annual percentage growth, take the result of the variable `percentage_gr` and divide it by the number of years elapsed. Set the result equal to the variable `annual_gr`.

Print the `annual_gr` by using the `print()` function. 

Try using the same formula but changing the values for the years. You could pick a ten year period of your liking and see how population change between the earliest year in the decade and the last. This is the beauty of code and reproducibility, you can change the value of your variables and compute the same equation.

You've coded the calculation from scratch! At the top of your notebook, we've included a function named `calculate_annual_growth` that prints a sentence explaining the change in population. With your new knowledge of calling functions, and your understanding of variables and arguments, call the function to print a summary.

The `calculate_annual_growth` function takes five arguments:
+ `year_one`
+ `year_two`
+ `pop_y1`
+ `pop_y2`
+ `city`


Pass in the correct values for each one-  remember you already turned a few of them into variables! The others you can pass as values. Note: The argument `city` just corresponds to the city name as a string.
 

See the summary result that is printed as the result of the call!

Calculating Population Change Over Time with R

Learn about the statistical programming language R and why it is unique compared to other programming languages!

How is R Special?

Use your knowledge of DataFrames, reader, and dplyr to explore this dataset about cars from 1985. 

What good is an analysis if we don't even have the tools to perform the said analysis? Some of the tools you will need for this analysis are the `readr` and `dplyr` tidyverse packages. Load the libraries at the top of the **notebook.Rmd** file so you can access the functions you will need later on.

The last tool we need is the data itself! The file **cars85.csv** stores the data that comes from the UCI Machine Learning Repository. Load the file into a dataframe called `cars` to get started.

It's always a good idea to inspect the data you load into R. It helps you to know what you are working with. Inspect `cars` with `head()` and `summary()`.

What kind of information do you have? What can you do with this information?

Each row in this dataframe is a single car, and each column stores some characteristic about that car. You want to get the best value for your collection, so you want to analyze as much as you can before buying. Doing so will help you make your choice easier!

After inspecting the dataframe, you notice something odd about the `normalized_losses` column. This column has a lot of entries that are question marks (`?`). This variable is not worth looking at since we don't have all the cars' expected losses.

Let's remove this column from the dataset. Select all columns from `cars` but `normalized_losses`. Save your new dataframe to `cars`.

Print the column names of `cars`. Are they clear and descriptive?

You know, `symboling` doesn't say anything to you at first glance. According to the UCI webpage, the `symboling` variable represents the car's risk factor. That variable name doesn't seem to go with the description. You should simplify this variable name to have it make more sense.

Update that column name in `cars` as follows:

 - `symboling` -> `risk_factor`

Print the column names of `cars` to confirm the names of the columns have changed.

Your car collection means a lot to you. You want each car to be of value to you. What better way to do that than to buy a car with a lot of miles-per-gallon on the highways? To determine this, first, suppose only cars exceeding 30 mpg on the highways interest you. You seek to measure how different each car's highway mpg is from your 30 mpg threshold. Create a variable called `mpg_threshold` with the value `30`.

Add a new column to `cars` called `mpg_diff_from_threshold`. This will measure how far each car's highway mpg is from 30 mpg. View the updated `cars` dataframe.

You'll add a car to your collection only if it gets more than 30 miles per gallon on the highways. Filter the rows of `cars` to find all the cars where `mpg_diff_from_threshold` is greater than `0`. Save this new dataframe to `mpg_exceeds_threshold` and view it.

Which cars have the highest miles per gallon on the highways? To find this, arrange the rows of `mpg_exceeds_threshold` by `mpg_diff_from_threshold` descending. Save this new dataframe as `mpg_exceeds_threshold`.

Now suppose you want your next car to have a large engine. Order the rows of `cars` by `engine_size` descending. Save the new data frame to `ordered_by_engine_size`. View `ordered_by_engine_size`.

There's a lot of makes of cars to choose from, but you may prefer one over the others. Which make do you prefer the most? Create a variable called `chosen_make` that contains the make you want to check. The hint below provides the list of makes to choose from.

Filter `cars` to only include rows where the `make` column is equal to `chosen_make`. Save the new dataframe to `chosen_make_details`.

Order the rows of `chosen_make_details` by `engine_size` descending and save the new dataframe to `chosen_make_details`. View `chosen_make_details`.

How large are the engines in each of the cars from that make that you chose? You can change the make stored in `chosen_make` to check out the engine sizes for other makes.

The process of buying a new car can cause a lot of stress - you don't want to buy a car you won't like! You've now seen how performing an analysis can ease some of the stress of making the decision of which car to buy. You also get to add a nice new car to your collection! Great work!

Explore the 1985 Cars Dataset

`content <- read_csv('content_inventory.csv')`

`content <- read_csv('content_inventory')`

`content <- from_csv('content_inventory.csv')`

`content <- from_csv('content_inventory')`

```r
clinic_visits %>% 
  filter(month == 'May')
```

```r
clinic_visits %>% 
  select(month == 'May')
```

```r
clinic_visits %>% 
  filter(month,'May')
```

```r
clinic_visits %>% 
  select(month,'May')
```

```r
inventory %>%
  mutate(remaining_inventory = initial_inventory - number_sold)
```

```r
inventory %>%
  transmute(remaining_inventory = initial_inventory - number_sold)
```

```r
inventory %>%
  mutate(remaining_inventory = inventory.initial_inventory - inventory.number_sold)
```

```r
inventory %>%
  transmute(remaining_inventory = inventory.initial_inventory - inventory.number_sold)
```

```r
photos %>%
  rename(num_likes = likes, num_comments = comments)
```

```r
photos %>%
  rename(likes = num_likes, comments = num_comments)
```

```r
photos %>%
  colnames(num_likes = likes, num_comments = comments)
```

```r
photos %>%
  colnames(likes = num_likes, comments = num_comments)
```

```r
grades %>%
  arrange(desc(unit_3))
```

```r
grades %>%
  order_by(desc(unit_3))
```

```r
attendance %>%
  transmute(student_name = student_name,
            total_absent_late = days_absent + days_late)
```

```r
attendance %>%
  mutate(student_name = student_name,
         total_absent_late = days_absent + days_late)
```

```r
attendance %>%
  transmute(total_absent_late = days_absent + days_late)
```

```r
attendance %>%
  mutate(total_absent_late = days_absent + days_late)
```

Practice what you've learned about manipulating data frames with dplyr in this multiple choice quiz!

Manipulating Data Frames in R

Learn the basics of loading, selecting, filtering and arranging data frames in R with dplyr.

Data lies at the heart of nearly every problem in the business world and society. Having the right tools to manipulate data and organize it in a meaningful way is integral to performing data analyses and discovering unique insights!

The dplyr package in R is designed to make data manipulation tasks simpler and more intuitive than working with base R functions only. Called a "grammar of data manipulation," dplyr provides functions that solve many challenges that arise when organizing tabular data (i.e., data in a table with rows and columns). Tabular data has a lot of the same functionality as tables from SQL or Excel, but dplyr adds the power of R. 

In addition to learning how to load data into R with the readr package, this lesson will introduce how to perform the following data manipulation tasks with dplyr:
* select columns of a table
* filter rows of a table
* arrange rows of a table in order

dplyr and readr are a part of the tidyverse, a collection of R packages designed for data science. In this and future lessons, you will use different packages of the tidyverse to more easily analyze and visualize data!

The tidyverse is a package itself, and it can be imported at the top of your file if you need to use any of the packages it contains.

In our lessons, however, we will explicitly import the packages within the tidyverse that we are using. To get started with readr and dplyr, you can import them at the top of your `.Rmd` R-markdown file or `.R` script:

```r
library(readr)
library(dplyr)
```


A data frame is an R object that stores tabular data in a table structure made up of rows and columns. You can think of a data frame as a spreadsheet or as a SQL table. While data frames can be created in R, they are usually imported with data from a CSV, an Excel spreadsheet, or a SQL query.

Data frames have rows and columns. Each column has a name and stores the values of one variable. Each row contains a set of values, one from each column. The data stored in a data frame can be of many different types: numeric, character, logical, or NA.

A data frame containing the address, age and name of students in a class could look like this:

<div class="narrative-table-container"> <div class="narrative-table-scroll">

|address|age|name|
|-|-|-|
|123 Main St.|34|John Smith|
|456 Maple Ave.|28|Jane Doe|
|789 Broadway|51|Joe Schmo|

</div>
</div>
<br>

As seen in the first row, the column names of this data frame are `address`, `age`, and `name`.

Note: when working with `dplyr`, you might see functions that take a data frame as an argument and output something called a tibble. Tibbles are modern versions of data frames in R, and they operate in essentially the same way. The terms tibble and data frame are often used interchangeably. Here on Codecademy we will use the term data frame!

What is a Data Frame?

When working with data frames, most of the time you will load in data from an existing data set. One of the most common formats for big datasets is the _CSV_.

_CSV (comma separated values)_ is a text-only spreadsheet format.  You can find CSVs in lots of places such as:
* online datasets from governments and companies (here's an example from <a href="https://catalog.data.gov/dataset?res_format=CSV" target="_blank" rel="noopener noreferrer">data.gov</a>)
* exported from Excel or Google Sheets
* exported from SQL

The first row of a CSV contains column headings. All subsequent rows contain values. Each column heading and each variable is separated by a comma:

```
column1,column2,column3
value1,value2,value3
value4,value5,value6
```
That example CSV represents the following table:

<div class="narrative-table-container">

|column1|column2|column3|
|-|-|-|
|value1|value2|value3|
|value4|value5|value6|

</div>
<br>


CSVs

When you have data in a CSV, you can load it into a data frame in R using `readr`'s `read_csv()` function:

```r
df <- read_csv('my_csv_file.csv')
```
* In the example above, the `read_csv()` function is called
* The CSV file `my_csv_file.csv` is passed in as an argument
* A data frame containing the data from `my_csv_file.csv` is returned

You can also save data from a data frame to a CSV using `readr`'s `write_csv()` function:

```r
write_csv(df,'new_csv_file.csv')
```

In the example above, `write_csv()` takes two arguments:
* `df`, which represents a data frame object
* `new_csv_file.csv`, the name of the CSV file that will hold the data from the data frame

By default, this method will save the CSV file to your current directory.

Loading and Saving CSVs

When you load a new data frame from a CSV, you want to get an understanding of what the data looks like.

If the data frame is small, you can display it by typing its name `df`. If the data frame is larger, it can be helpful to inspect a few rows of the data frame without having to look at the rest of it.

The `head()` function returns the first 6 rows of a data frame. If you want to see more rows, you can pass an additional argument `n` to `head()`. For example, `head(df,8)` will show the first `8` rows.

The function `summary()` will return summary statistics such as mean, median, minimum and maximum for each numeric column while providing class and length information for non-numeric columns.


Inspecting Data Frames

One of the most appealing aspects of dplyr is the ability to easily manipulate data frames. Each of the dplyr functions you will explore takes a data frame as its first argument.

The _pipe operator_, or `%>%`, helps increase the readability of data frame code by piping the value on its left into the first argument of the function that follows it. For example:

```r
df %>%
  head()
```

pipes the data frame `df` into the first argument of `head()`, becoming

```r
head(df)
```

The true power of pipes comes from the ability to link multiple function calls together. Once you learn some of dplyr's functions, we'll revisit pipes and see how they are so useful!

Note: the pipe operator is *not* a part of base R. It comes from the `magrittr` package, but do not worry about loading magrittr in your code. Any time you load a package from the tidyverse, like dplyr, `%>%` will automatically be loaded!

Piping

Suppose you have a data frame called `customers`, which contains the ages of your business's customers:

<div class='narrative-table-container'>

|name|age|gender|
|-|-|-|
|Rebecca Erikson|35|F|
|Thomas Roberson|28|M|
|Diane Ochoa|42|NA|

</div>
<br>

For your analysis, you only care about the age and gender of your customers, not their names. The data frame you want looks like this:

<div class='narrative-table-container'>

|age|gender|
|-|-|
|35|F|
|28|M|
|42|NA|

</div>
<br>
 
You can select the appropriate columns for your analysis using `dplyr`'s `select()` function:

```r
select(customers,age,gender)
```
* `select()` takes a data frame as its first argument
* all additional arguments are the desired columns to select
* `select()` returns a new data frame containing only the desired columns

But what about the pipe `%>%`, you ask? Great question. You can simplify the readability of your code by using the pipe:

```r
customers %>%
  select(age,gender)
```

When using the pipe, you can read the code as: from the `customers` table, `select()` the `age` and `gender` columns. From now on we will use the pipe symbol where appropriate to simplify our code.

Selecting Columns

Sometimes rather than specify what columns you want to select from a data frame, it's easier to state what columns you do not want to select. `dplyr`'s `select()` function also enables you to do just that! Consider a `customers` data frame that contains biographical information for the customers of your business:

<div class="narrative-table-container">

|name|address|phone|age|
|-|-|-|-|
|Martha Jones|123 Main St.|234-567-8910|28|
|Rose Tyler|456 Maple Ave.|212-867-5309|22|
|Donna Noble|789 Broadway|949-123-4567|35|
|Amy Pond|98 West End Ave.|646-555-1234|29|
|Clara Oswald|54 Columbus Ave.|714-225-1957|31|

</div>
<br>

You are interested in analyzing where your customers live and how old they are. For your analysis, you do not care about the `name` and `phone` associated with a customer, only their `address` and `age`. To exclude the columns you do not need:

```r
customers %>%
  select(-name,-phone)
```
* the data frame `customers` is piped into `select()`
* the columns to remove, prepended with a `-`, are given as arguments
* a new data frame without the `name` and `phone` columns is returned

Excluding Columns

Filtering Rows with Logic I

Filtering Rows with Logic II

Sometimes all the data you want is in your data frame, but it's all unorganized! Step in the handy dandy dplyr function `arrange()`! `arrange()` will sort the rows of a data frame in ascending order by the column provided as an argument.

For numeric columns, ascending order means from lower to higher numbers. For character columns, ascending order means alphabetical order from A to Z.

Let's look back at the `customers` data frame for your company:

<div class="narrative-table-container">
<div class="narrative-table-scroll">

|name|age|address|phone|
|-|-|-|-|
|Martha Jones|28|123 Main St.|234-567-8910|
|Rose Tyler|22|456 Maple Ave.|212-867-5309|
|Donna Noble|35|789 Broadway|949-123-4567|
|Amy Pond|29|98 West End Ave.|646-555-1234|
|Clara Oswald|31|54 Columbus Ave.|714-225-1957|

</div>
</div>
<br>

To arrange the customers in ascending order by name:

```r
customers %>%
  arrange(name)
```
* the `customers` data frame is piped into `arrange()`
* the column to order by, `name`, is given as an argument
* a new data frame is returned with rows in ascending order by `name`

`arrange()` can also order rows by descending order! To arrange the customers in descending order by age:

```r
customers %>%
  arrange(desc(age))
```
* the `customers` data frame is again piped into `arrange()`
* the column to order by, `age`, is given as an argument to `desc()`, which is then given as an argument to `arrange()`
* a new data frame is returned with rows in descending order by `age`

If multiple arguments are provided to `arrange()`, it will order the rows by the column given as the first argument and use the additional columns to break ties in the values of preceding columns.

Arranging Rows

There you have it! With the power of readr and dplyr in your hands, you can now:
* load data from a CSV into a data frame
* inspect the data frame with `head()` and `summary()`
* `select()` the columns you want to analyze
* `filter()` the rows with comparison and logical operators
* `arrange()` rows in ascending or descending order

You've also been exposed to the pipe `%>%`, a powerful tool for chaining function calls, as well as the general principles of data manipulation.

Now that you are well on your way to being a dplyr master, let's combine what you have learned together to perform an analysis and see the true power of the pipe!

Introduction to Data Frames in R

Learn the basics of modifying data frames in R with dplyr.

When working with data frames, you often need to modify the columns for your analysis at hand. With the help of the dplyr package, data frame modifications are easily performed.

In this lesson, you'll learn how to modify an existing data frame using dplyr. Some of the skills you'll learn include:
* adding columns to an existing data frame
* adding new columns and dropping existing columns from a data frame
* renaming columns

Sometimes you might want to add a new column to a data frame. This new column could be a calculation based on the data that you already have.

Suppose you own a hardware store called The Handy Woman and have a data frame containing inventory information:

<div class="narrative-table-container">

|product_id|product_description|cost_to_manufacture|price|
|--------------|---------------------------|----------------------------|-|
|1                  | 3 inch screw            | 0.50                            | 0.75 |
|2                  | 2 inch nail               | 0.10                            | 0.25 |
|3                 | hammer                    | 3.00                            | 5.50 |
|4                  | screwdriver            | 2.50                            | 3.00|

</div>
<br>

You can add a new column to the data frame using the `mutate()` function. `mutate()` takes a name-value pair as an argument. The name will be the name of the new column you are adding, and the value is an expression defining the values of the new column in terms of the existing columns. `mutate()` returns a new data frame with the added column.

Maybe you want to add a column to your inventory table with the amount of sales tax that is charged for each item.  The following code multiplies each `price` by `0.075`, the sales tax in your state:

```r
df %>%
  mutate(sales_tax = price * 0.075)
```
Now the inventory table has a column called `sales_tax`, where the value is `0.075 * price`:

<div class="narrative-table-container">

|product_id|product_description|cost_to_manufacture|price|sales_tax|
|--------------|---------------------------|----------------------------|--------|-|
|1                  | 3 inch screw            | 0.50                            | 0.75 | 0.06|
|2                  | 2 inch nail               | 0.10                            | 0.25 | 0.02|
|3                 | hammer                    | 3.00                            | 5.50 | 0.41|
|4                  | screwdriver            | 2.50                            | 3.00| 0.22|

</div>
<br>


Adding a Column

Let's refer back to the inventory table for your store, The Handy Woman.

<div class="narrative-table-container">

|product_id|product_description|cost_to_manufacture|price|sales_tax|
|--------------|---------------------------|----------------------------|--|--|
|1| 3 inch screw            | 0.50                            | 0.75 |0.06|
|2                  | 2 inch nail               | 0.10         | 0.25 |0.02|
|3                 | hammer                    | 3.00            | 5.50 |0.41|
|4                  | screwdriver            | 2.50               | 3.00|0.22|

</div>
<br>

You want to add two more new columns to your table. One column will contain the profit made from selling each item (`price - cost_to_manufacture`), and the other will state whether the item is currently in stock (suppose every item is currently in stock).

`mutate()` can take multiple arguments to add any number of new columns to a data frame:

```r
df %>%
  mutate(profit = price - cost_to_manufacture,
         in_stock = TRUE)
```
* `mutate()` takes two arguments, defining new columns `profit` and `in_stock`
* `profit` is equal to `price` minus `cost_to_manufacture`
* `in_stock`, rather than be derived from values in existing columns, is given the value `TRUE` for all rows

The inventory table will now look like this:

<div class="narrative-table-container">

|product_id|product_description|cost_to_manufacture|price|sales_tax|profit|in_stock|
|--------------|---------------------------|----------------------------|-|-|-|--|
|1                  | 3 inch screw            | 0.50                            | 0.75 |0.06|0.25|TRUE|
|2                  | 2 inch nail               | 0.10                            | 0.25 |0.02|0.15|TRUE|
|3                 | hammer                    | 3.00                            | 5.50 |0.41|2.5|TRUE|
|4                  | screwdriver            | 2.50                            | 3.00|0.22|0.5|TRUE|

</div>
<br>


Adding Multiple Columns

When creating new columns from a data frame, sometimes you are interested in only keeping the new columns you add, and removing the ones you do not need. dplyr's `transmute()` function will add new columns while dropping the existing columns that may no longer be useful for your analysis. Let's go back to the original inventory data frame for your store, The Handy Woman.

<div class="narrative-table-container">

|product_id|product_description|cost_to_manufacture|price|
|--------------|---------------------------|----------------------------|-|
|1                  | 3 inch screw            | 0.50                            | 0.75 |
|2                  | 2 inch nail               | 0.10                            | 0.25 |
|3                 | hammer                    | 3.00                            | 5.50 |
|4                  | screwdriver            | 2.50                            | 3.00|

</div>
<br>

Like `mutate()`, `transmute()` takes name-value pairs as arguments. The names will be the names of the new columns you are adding, and the values are expressions defining the values of the new columns. The difference, however, is that `transmute()` returns a data frame with _only_ the new columns.

To add `sales_tax` and `profit` columns while dropping all other columns from the data frame:

```r
df %>%
  transmute(sales_tax = price * 0.075,
            profit = price - cost_to_manufacture)
```

This inventory table will now look like this:

<div class="narrative-table-container">

|sales_tax|profit|
|-|-|
|0.06|0.25|
|0.02|0.15|
|0.41|2.5|
|0.22|0.5|

</div>
<br>


Transmute Columns

Modifying Data Frames in R

R is a widely used programming language that works well with data. It’s a great option for statistical analysis, and has an active development community that’s constantly releasing new packages, making R code even easier to use. It’s built around a central data science concept: The DataFrame, so if you’re interested in data science, analysis, and visualization, you’ll want to learn how to use R.

### Skills you'll gain
* Write code in R
* Organize, edit, and clean data
* Create data visualizations

Learn how to code and clean and manipulate data for analysis and visualization with the R programming language.