In this lesson, we will learn about sample data, sampling distributions, the Central Limit Theorem. 

In statistics, we often want to learn about a large population. Since collecting data for an entire population is often impossible, researchers may use a smaller sample of data to try to answer their questions.

To do this, a researcher might calculate a statistic such as mean or median for a sample of data. Then they can use that statistic as an estimate for the population value they really care about.

For example, suppose that a researcher wants to know the average weight of all Atlantic Salmon fish. It would be impossible to catch every single fish. Instead, the researchers might collect a sample of 50 fish off the coast of Nova Scotia and determine that the average weight of those fish is *x*. If the same researchers collected 50 new fish and took the new average weight, that average would likely be slightly different than the first sample average. 

Over the course of this lesson, we will go over how we can extrapolate from sample data in order to describe our uncertainty about the statistics of the full population.

Sampling from a Population

Now that we've generated some random samples from a population using an applet, let's code this ourselves in Python. The `numpy.random` package has several functions that we could use to simulate random sampling. In this exercise, we'll use the function `np.random.choice()`, which generates a sample of some size from a given array. 

In the example code, we'll pretend that we're all-powerful and actually have a list of all the weights of Atlantic Salmon that currently exist.

In the example code to the right, we have done the following:
* Loaded in the weights of all salmon into a dataframe called `population`.
* Plotted the distribution of `population` and calculated the mean.
* Used `np.random.choice()` function to generate a sample called `sample` of size 30 (`samp_size` variable is equal to `30`).

Random Sampling in Python

As we saw in the last example, each time we sample from a population, we will get a slightly different sample mean. In order to understand how much variation we can expect in those sample means, we can do the following:

- Take a bunch of random samples of fish, each of the same size (50 fish in this example)
- Calculate the sample mean for each one
- Plot a histogram of all the sample means

This process gives us an estimate of the sampling distribution of the mean for a sample size of 50 fish.

The code to accomplish this is shown below:

```python
salmon_population = population['Salmon_Weight']

sample_size = 50
sample_means = []

# loop 500 times to get 500 random sample means
for i in range(500):
  # take a sample from the data:
  samp = np.random.choice(salmon_population, sample_size, replace = False)
  # calculate the mean of this sample:
  this_sample_mean = np.mean(samp)
  # append this sample mean to a list of sample means
  sample_means.append(this_sample_mean)

# plot all the sample means to show the sampling distribution
sns.histplot(sample_means, stat='density')
plt.title("Sampling Distribution of the Mean")
plt.show()
```

The distribution of the `sample_means` looks like this:

![This is a sampling distribution with a sample of 500. The distribution is centered around x=60 and looks fairly symmetrical.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/sampling-distributions/sampling_dist.svg)

Note that we can look at a sampling distribution for any statistic. For example, we could estimate the sampling distribution of the maximum by calculating the maximum of each sample, rather than the mean (as shown above).

Sampling Distributions

So far, we've defined the term *sampling distribution* and shown how we can simulate an approximated sampling distribution for a few different statistics (mean, maximum, variance, etc.). The *Central Limit Theorem (CLT)* allows us to specifically describe the sampling distribution of the mean.

The CLT states that the sampling distribution of the mean is normally distributed as long as the population is not too skewed or the sample size is large enough. Using a sample size of n > 30 is usually a good rule of thumb, regardless of what the distribution of the population is like. If the distribution of the population is normal, the sample size can be smaller than that.

Let's take another look at the salmon weight to see how the CLT applies here. The first plot below shows the population distribution. The salmon weight is skewed right, meaning the tail of the distribution is longer on the right than on the left. 

![This graph shows the distribution of salmon weights across the entire population. The distribution is right-skewed as it ranges from 0 to almost 300 pounds.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/sampling-distributions/pop_distribution.svg)

Next, we've simulated a sampling distribution of the mean (using a sample size of 100) and super-imposed a normal distribution on top of it. Note how the estimated sampling distribution follows the normal curve almost perfectly. 

![This graph shows the sampling distribution of salmon weights across a sample size of 50. The sampling distribution is approximately normal, despite the population distribution being right-skewed, showcasing one of the key ideas behind the central limit theorem.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/sampling-distributions/normal_samp_distribution.svg)

Note that the CLT only applies to the sampling distribution of the mean and not other statistics like maximum, minimum, and variance!

Central Limit Theorem

Now that we've examined the CLT from a high level, let's get into the details.

The CLT not only establishes that the sampling distribution will be normally distributed, but it also allows us to describe that normal distribution quantitatively. Normal distributions are described by their mean `μ` (mu) and standard deviation `σ` (sigma).

Let's break this up:
- We take samples of size *n* from a population (that has a true population mean `μ` and standard deviation of `σ`) and calculate the sample mean  `x`.
- Given that *n* is sufficiently large (n > 30), the sampling distribution of the means will be normally distributed with:
    - mean `x` approximately equal to the population mean `μ`
    - standard deviation equal to the population standard deviation divided by the square root of the sample size. We can write this out as:
```tex
Sampling\ Distribution\ St.Dev = \frac{\sigma}{\sqrt{n}}
```

We'll focus on the first point in this exercise and the second point in the next exercise.

As an example of this, let's look again at our salmon fish population. Last exercise, we saw that the sampling distribution of the mean was normally distributed. In the plot below, we can see that the mean of the simulated sampling distribution is approximately equal to the population mean.

![This graph shows the distribution of salmon weights across the entire population. The mean is at about 60.7 pounds. The distribution is right-skewed as it ranges from 0 to almost 300 pounds.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/sampling-distributions/pop_mean.svg)
![This graph shows the sampling distribution of salmon weights across a sample size of 50. The mean is at about 60.8 pounds which is almost identical to the population mean. The sampling distribution is also approximately normal showcasing both key ideas behind the central limit theorem.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/sampling-distributions/mean_sampling_dist.svg)

In the workspace, we've simulated a sampling distribution of the mean using a sample size of 50.

CLT Continued

The second part of the Central Limit Theorem is:  

The sampling distribution of the mean is normally distributed, with standard deviation equal to the population standard deviation (often denoted as the greek letter, sigma) divided by the square root of the sample size (often denoted as n):  

```tex
\frac{\sigma}{\sqrt{n}}
```    
The standard deviation of a sampling distribution is also known as the *standard error of the estimate of the mean*. In many instances, we cannot know the population standard deviation, so we estimate the standard error using the sample standard deviation:

```tex
\frac{standard\ deviation\ of\ our\ sample}{\sqrt{\text{sample size}}}
```

Two important things to note about this formula is that:
- As sample size increases, the standard error will decrease.
- As the population standard deviation increases, so will the standard error.

Standard Error

According to the Central Limit Theorem, the mean of the sampling distribution of the mean is equal to the population mean. This is the case for some, but not all, sampling distributions. Remember, you can have a sampling distribution for any sample statistic, including:

  - mean
  - median
  - max / min
  - variance

Because the mean of the sampling distribution of the mean is equal to the mean of the population, we call it an *unbiased estimator*. A statistic is called an unbiased estimator of a population parameter if the mean of the sampling distribution of the statistic is equal to the value of the statistic for the population.

The maximum is one example of a *biased estimator*, meaning that the mean of the sampling distribution of the maximum is not centered at the population maximum.

Biased Estimators

Once we know the sampling distribution of the mean, we can also use it to estimate the probability of observing a particular range of sample means, given some information (either known or assumed) about the population. To do this, we can use the *Cumulative Distribution Function, or (CDF)* of the normal distribution.

Let's work through this with our salmon fish example. Let's say we are transporting the salmon and want to make sure the crate we carry the fish in will be strong enough to hold the weight. 

- Suppose we estimate that the salmon population has an average weight of 60 lbs with a standard deviation of 40 lbs. 
- We have a crate that supports 750 lbs, and we want to be able to transport 10 fish at a time. 
- We want to calculate the probability that the average weight of those 10 fish is less than or equal to 75 (750/10).

Using the CLT, we first estimate that the mean weight of 10 randomly sampled salmon from this population is normally distributed with mean = 60 and standard error = 40/10^.5. Then, we can use this probability distribution to calculate the probability that 10 randomly sampled fish will have a mean weight less than or equal to 75.

```python
x = 75
mean = 60
std_dev = 40
samp_size = 10
standard_error = std_dev / (samp_size**.5)
# remember that **.5 is raising to the power of one half, or taking the square root

stats.norm.cdf(x,mean,standard_error)
```

This returns 0.882, or a probability of 88.2% that the average weight of our sample of 10 fish will be less than or equal to 75. 


Calculating Probabilities

Let's recap what we've learned in this lesson:

- A sampling distribution is obtained by taking a random sample of a certain size multiple times, taking a sample statistic, and plotting the distribution of this sample statistic.
- The Central Limit Theorem establishes that the sampling distribution of the mean will be normally distributed (even if the original population was not normally distributed).
- A statistic is called an unbiased estimator of a population parameter if the mean of the sampling distribution of the statistic is equal to the value of the statistic for the population. The mean is an unbiased estimator.
- We can use the Standard Error of our sample mean distribution in order to calculate probabilities of obtaining a sample with a certain statistic using the CDF.


Review

Learn how to quantify the statistics of a randomly sampled experiment and visualize the distribution.

#### Probability
*Probability* is a branch of mathematics that allows us to quantify uncertainty. In our daily lives, we often use probability to make decisions, even without thinking about it! 

For example, many weather reports give a percent chance that it will rain. If we hear that there is an 80 percent chance of rain, we probably are not going to make many plans outside. However, if there is only a 5 percent chance of rain, we may feel comfortable planning a picnic.

In this article, we are going to build a foundation for understanding probability. To do this, we are going to explore a field of mathematics called *set theory*.

#### Set Theory 

*Set theory* is a branch of mathematics based around the concept of *sets*. In simple terms, a set is a collection of things. For example, we can use a set to represent items in a backpack. We might have:

```tex
\{Book, Paper, Folder, Hat, Pen, Snack\}
```

Notationally, mathematicians often represent sets with curly braces. Sets also follow two key rules: 

* Each element in a set is distinct.
* The elements in a set are in no particular order. 

Therefore, we can say: 

```tex
\{1, 2, 3, 4, 5\} = \{5, 3, 2, 4, 1\}
```

When defining a set, we often use a capital letter. For example:

```tex
A = \{1, 2, 3, 4, 5\}
```

Sets can also contain *subsets*. Set *A* is a subset of set *B* if all the elements in *A* exist within *B*. For example:

```tex
\begin{aligned}
A = \{1, 2, 3\} \\
B = \{1, 2, 3, 4, 5\} \\
\end{aligned}
```

Here, set *A* is a subset of *B* because all elements of *A* are contained within *B*.

#### Experiments and Sample Spaces

In probability, an *experiment* is something that produces observation(s) with some level of uncertainty. A *sample point* is a single possible outcome of an experiment. Finally, a *sample space* is the set of all possible sample points for an experiment.  

For example, suppose that we run an experiment where we flip a coin twice and record whether each flip results in heads or tails. There are four sample points in this experiment: two heads (*HH*), tails and then heads (TH), heads and then tails (HT), or two tails (TT). We can write the full sample space for this experiment as follows: 

```tex
S = \{HH, TT, HT, TH\}
```

Suppose we are interested in the probability of one specific outcome: *HH*. A specific outcome (or set of outcomes) is known as an *event* and is a subset of the sample space. Three events we might look at in this sample space are:

```tex
\begin{aligned}
\text{Getting Two Heads} \\
A = \{HH\} \\
\text{Getting Two Tails} \\
B = \{TT\} \\
\text{Getting Both a Heads and Tails}\\
C = \{HT, TH\}
\end{aligned}
```

The frequentist definition of *probability* is as follows: If we run an experiment an infinite amount of times, the probability of each event is the proportion of times it occurs. Unfortunately, we don't have the ability to flip two coins an infinite amount of times &mdash; but we can estimate probabilities by choosing some other large number, such as 1000. Let's give it a try!

Okay, we have flipped two coins 1000 times. Wasn't that FUN? Here are each of the outcomes and the number of times we observed each one:

* *{HH}*: 252 
* *{TT}*: 247
* *{HT}*: 256
* *{TH}*: 245

To calculate the estimated probability of any one outcome, we use the following formula: 

```tex
P(Event) = \frac{\text{Number of Times Event Occurred}}{\text{Total Number of Trials}}
```

In this scenario, a *trial* is a single run of our experiment (two coin flips). So, the probability of two heads on two coin flips is approximately:

```tex
P(HH) =\frac{252}{1000} = .252
```

Based on these 1000 trials, we would estimate there is a 25.2 percent chance of getting two heads on two coin flips. This is great! However, if we do this same procedure over and over again, we may get slightly different results. For example, if we repeat the experiment another 1000 times, we might get two heads only 24.2 percent of the time. 

If we want to feel confident that we are close to the true probability of a particular event, we can leverage the *law of large numbers*.

#### Law of Large Numbers 

We can't repeat our random experiment an infinite amount of times (as much FUN as that would be!). However, we can still flip both coins a large number of times. As we flip both coins more and more, the observed proportion of times each event occurs will converge to its true probability. This is called the *law of large numbers*.

Let's observe the law of large numbers in real-time. We will use Python to simulate flipping both coins as many times as we want and watch the proportion of two heads converge to its true probability.

Let's walk through each part of the code below one step at a time. You do not need to worry about every line of code, but understanding the overall objective will help you build your understanding of probability.

<Assessment id="605219b8d47fa6000fa3272e" />


After setting `num_trials` to large numbers, we see that the proportion of trials resulting in two heads converges to 0.25. The horizontal line at *y=0.25* is completely covered after about one hundred thousand flips. By simulating a huge number of flips in Python, we have shown that the true probability of seeing two heads on two separate coin flips is equal to 0.25. 

#### Review

We have covered:
* An introduction to probability
* An introduction to set theory
* Sample spaces and events
* The law of large numbers

This completes our brief introduction to what probability is and how we can represent it. Now, it's time to dive into ways we can calculate probability and expand on our knowledge.

Learn about what probability is and the language we use to represent it!

Probability, Set Theory, and the Law of Large Numbers

The complement of event *A* is {2, 4, 6}.

The complement of event *A* is {1, 3, 5}.

Event *A* and its complement capture the entire sample space for rolling one 6-sided die. 

Event *A* and its complement are mutually exclusive events.

```tex
\frac{0.85 \cdot 0.90}{0.765 + 0.04}
```

This probability cannot be quantified with the given information.

```tex
\frac{0.90 \cdot 0.765}{0.85 + 0.04}
```

Rules of Probability

Learn about various probability rules through interactive applets and Python programming.

Probability is a way to quantify uncertainty. When we flip a fair coin, we say that there is a 50 percent chance (probability = 0.5) of it coming up tails. This means that if we flip INFINITELY many fair coins, half of them will come up tails. Similarly, when we roll a six-sided die, we say there is a 1 in 6 chance of rolling a five. 

What if we flip a coin in one hand and roll a die in the other at the same time. What is the probability that the coin comes up tails AND the die comes up as a five? Is there a way to quantify the probability that these two different events BOTH occur? In this lesson, we will walk through different rules of probability that help us quantify the probability of multiple random events.


Introduction

Let's dive into some key concepts we will use throughout this lesson: *union*, *intersection*, and *complement*. 

#### Union

The *union* of two sets encompasses any element that exists in either one or both of them. We can represent this visually as a *Venn diagram*. 

![A Venn diagram that shows two overlapping circles, one that represents event A and one that represents event B. Both of these circles are shaded blue.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-probability/union-venndiagram.svg)

For example, let's say we have two sets, *A* and *B*. *A* represents rolling an odd number with a six-sided die (the set *{1, 3, 5}*). *B* represents rolling a number greater than two (the set *{3, 4, 5, 6})*. The union of these two sets would be everything in either set *A*, set *B*, or both: *{1, 3, 4, 5, 6}*. We can write the union of two events mathematically as *(A or B)*.

#### Intersection

The *intersection* of two sets encompasses any element that exists in both of the sets. Visually: 

![A Venn diagram that shows two overlapping circles, one that represents event A and one that represents event B. Only the overlap of the two circles is shaded.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-probability/intersection-venndiagram.svg)

The intersection of the above sets (*A* represents rolling an odd number on a six-sided die and *B* represents rolling a number greater than two) includes any value that appears in both sets: *{3, 5}*. We can write the intersection of two events mathematically as *(A and B)*.

#### Complement

Lastly, the *complement* of a set consists of all possible outcomes outside of the set. Visually:

![A Venn diagram that shows a circle representing event A. Everything outside of event A is shaded blue.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-probability/complement-venndiagram.svg)

Consider set *A* from the above example (rolling an odd number on a 6-sided die). The complement of this set would be rolling an even number: *{2, 4, 6}*. We can write the complement of set *A* as *A<sup>C</sup>*. One key feature of complements is that a set and its complement cover the entire sample space. In this die roll example, the set of even numbers and odd numbers would cover all possible rolls: *{1, 2, 3, 4, 5, 6}*.

Union, Intersection, and Complement

Imagine that we flip a fair coin 5 times and get 5 heads in a row. Does this affect the probability of getting heads on the next flip? Even though we may feel like it's time to see "tails", it is impossible for a past coin flip to impact a future one. The fact that previous coin flips do not affect future ones is called *independence*. Two events are *independent* if the occurrence of one event does not affect the probability of the other.  

Are there cases where previous events DO affect the outcome of the next event? Suppose we have a bag of five marbles: two marbles are blue and three marbles are red. If we take one marble out of the bag, what is the probability that the second marble we take out is blue?

![A diagram of the possible outcomes of pulling two marbles out of a bag when pulling them out without replacement.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-probability/marble-diagram-1.svg)

This describes two events that are *dependent*. The probability of grabbing a blue marble in the second event *depends* on whether we take out a red or a blue marble in the first event. 

What if we had put back the first marble? Is the probability that we pick a blue marble second independent or dependent on what we pick out first? In this case, the events would be independent. 

![A diagram of the possible outcomes of pulling two marbles out of a bag when pulling them out with replacement.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-probability/marble-diagram-2.svg)

Why do we care if events are independent or dependent? Knowing this helps us quantify the probability of events that depend on preexisting knowledge. This helps researchers understand and predict complex processes such as:

* Effectiveness of vaccines
* The weather on a particular day
* Betting odds for professional sports games

We will explore applications of this further in the lesson!

Independence and Dependence

Two events are considered *mutually exclusive* if they cannot occur at the same time. For example, consider a single coin flip: the events "tails" and "heads" are mutually exclusive because we cannot get both tails and heads on a single flip. 

We can visualize two mutually exclusive events as a pair of non-overlapping circles. They do not overlap because there is no outcome for one event that is also in the sample space for the other:

![A Venn diagram that shows two non-overlapping circles, one that represents event A and one that represents event B. Nothing is shaded in the Venn diagram.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-probability/mutually-exclusive-venndiagram.svg)

What about events that are not mutually exclusive? If event *A* is rolling an odd number and event *B* is rolling a number greater than two, these events are not mutually exclusive. They have an intersection of *{3, 5}*. Any events that have a non-empty intersection are not mutually exclusive.

Mutually Exclusive Events

Now, it's time to apply these concepts to calculate probabilities. 

Let's go back to one of our first examples: event *A* is rolling an odd number on a six-sided die and event *B* is rolling a number greater than two. What if we want to find the probability of one or both events occurring? This is the probability of the union of *A* and *B*:
```tex
P(A \text{ or } B)
```

We can visualize this calculation as follows: 

![This gif shows three sequential images of a Venn diagram that outline the formula for P(A or B). In the Venn Diagram, there are two overlapping circles: one that corresponds to event A and one that corresponds to event B. In the first image, the event A circle is shaded blue and P(A) is added to the formula. In the second image, the event B circle is shaded red and the formula now shows P(A) + P(B). In the final image, the overlap of event A and event B is shaded green and the formula now shows P(A) + P(B) - P(A and B).](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-probability/addition-rule-independent-venndiagram.gif)

This animation gives a visual representation of the addition rule formula, which is:
```tex
P(A \text{ or } B) = P(A) + P(B) - P(A \text{ and } B)
```
We subtract the intersection of events *A* and *B* because it is included twice in the addition of *P(A)* and *P(B)*. 

What if the events are mutually exclusive? On a single die roll, if event *A* is that the roll is less than or equal to 2 and event *B* is that the roll is greater than or equal to 5, then events *A* and *B* cannot both happen.

![This gif shows two sequential images of a Venn diagram that outline the formula for P(A or B) for independent events. In the Venn Diagram, there are two non-overlapping circles: one that corresponds to event A and one that corresponds to event B. In the first image, the event A circle is shaded blue and P(A) is added to the formula. In the second image, the event B circle is shaded red and the final formula now shows P(A) + P(B).](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-probability/addition-rule-dependent-venndiagram.gif)

For mutually exclusive events, the addition rule formula is:
```tex
P(A \text{ or } B) = P(A) + P(B)
```

This is because the intersection is empty, so we don't need to remove any overlap between the two events.

Addition Rule

If we want to calculate the probability that a pair of dependent events both occur, we need to define conditional probability. Using a bag of marbles as an example, let's remind ourselves of the definition of dependent events:

![A diagram of the possible outcomes of pulling two marbles out of a bag when pulling them out without replacement.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-probability/marble-diagram-1.svg)

If we pick two marbles from a bag of five marbles without replacement, the probability that the second marble is red depends on the color of the first marble. We have a special name for this: *conditional probability*. In short, conditional probability measures the probability of one event occurring, given that another one has already occurred.

Notationally, we denote the word "given" with a vertical line. For example, if we want to represent the probability that we choose a red marble given the first marble is blue, we can write:
```tex 
P(\text{Red Second} \mid \text{Blue First})
```

From the above diagram, we know that:
```tex 
P(\text{Red Second} \mid \text{Blue First}) = \frac{3}{4}
```


What if we picked out two marbles with replacement? What does the conditional probability look like? Well, let's think about this. Regardless of which marble we pick out first, it will be put back into the bag. Therefore, the probability of picking out a red marble or a blue marble second is unaffected by the first outcome. 


Therefore, for independent events, we can say the following:

```tex 
\begin{aligned}

P(A \mid B) = P(A) \\
\text{and} \\
P(B \mid A) = P(B) \\

\end{aligned}
```


Conditional Probability

We have looked at the addition rule, which describes the probability one event OR another event (or both) occurs. What if we want to calculate the probability that two events happen simultaneously? For two events, *A* and *B*, this is *P(A and B)* or the probability of the intersection of *A* and *B*.

##### General Formula

The general formula for the probability that two events occur simultaneously is:
```tex
P(A \text{ and } B) = P(A) \cdot P(B \mid A)
```

However, for independent events, we can simplify this formula slightly.

##### Dependent Events

Let's go back to our bag of marbles example. We have five marbles: two are blue, and three are red. We pick two marbles without replacement. What if we want to know the probability of choosing a blue marble first AND a blue marble second?

Taking conditional probability into account, the multiplication rule for these two dependent events is:
```tex
\begin{aligned}
P(\text{Blue 1st and Blue 2nd}) = P(\text{Blue 1st}) \cdot P(\text{Blue 2nd} \mid \text{Blue 1st}) \\
P(\text{Blue 1st and Blue 2nd}) = \frac{2}{5} \cdot \frac{1}{4} \\
P(\text{Blue 1st and Blue 2nd}) = \frac{1}{10}
\end{aligned}
```

This is one potential outcome when picking two marbles out of the bag. One way to visualize all possible outcomes of a pair of events is a *tree diagram*.

Tree diagrams have the following properties:
* Each branch represents a specific set of events.
* The probabilities the terminal branches (all possible sets of outcomes) sum to one.
* We multiply across branches (using the multiplication rule!) to calculate the probability that each branch (set of outcomes) will occur.

In the browser to the right, you will be able to play with one!

##### Independent Events

For two independent events, the multiplication rule becomes less complicated. The probability of two independent events occurring is:
 
```tex
P(A \text{ and } B) = P(A) \cdot P(B)
```
This is because the following is true for independent events:

```tex
P(B \mid A) = P(B)
```
Let's look at the simplest example: flipping a fair coin twice. Event *A* is that we get tails on the first flip, and event *B* is that we get tails on the second flip. *P(A) = P(B) = 0.5*, so according to our formula, the probability of getting tails on both flips would be:
```tex 
P(A \text{ and } B) = 0.5 \cdot 0.5 = 0.25
```

Visually on a tree diagram, we see:
![This visual shows a tree diagram that outlines all possible outcomes of flipping a fair coin two times. ](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-probability/coin-tree-diagram.svg)

Open the diagram if in a new window [here](https://static-assets.codecademy.com/skillpaths/master-stats-ii/intro-probability/coin-tree-diagram.svg) if you would like to zoom in for a better view.

Multiplication Rule

We have introduced conditional probability as a part of the multiplication rule for dependent events. However, let's go a bit more in-depth with it as it is a powerful probability tool that has real-world applications.

For this problem, we will follow along the tree diagram on the right.

Suppose that the following is true (this is shown in the first set of branches in the diagram):
* 20 percent of the population has strep throat.
* 80 percent of the population does not have strep throat.

Now suppose that we test a bunch of people for strep throat. The possible results of these tests are shown in the next set of branches:

* If a person has strep throat, there is an 85% chance their test will be positive and a 15% chance it will be negative. This is labeled as:
```tex
\begin{aligned}
P(+ \mid ST) = 0.85 \\
\text{and} \\
P(- \mid ST) = 0.15 \\
\end{aligned}
```
* If a person does not have strep throat, there is a 98% chance their test will be negative and a 2% chance it will be positive. This can be labeled as:
```tex
\begin{aligned}
P(- \mid \text{NO ST}) = 0.98 \\
\text{and} \\
P(+ \mid \text{NO ST}) = 0.02 \\
\end{aligned}
```

Finally, let's look at the four possible pairs of outcomes that form the terminal branches of our diagram:
```tex
\begin{aligned}
P(\text{ST and +}) = 0.17 \\
P(\text{ST and -}) = 0.03 \\
P(\text{NO ST and +}) = 0.016 \\
P(\text{NO ST and -}) = 0.784 \\
\end{aligned}
```

Together, these add up to one since they capture all potential outcomes after patients are tested. 

It's great that we have all this information. However, we are missing something. If someone gets a positive result, what is the probability that they have strep throat? Notationally, we can write this probability as:

```tex
P(ST \mid +)
```

In the next exercise, we'll explore how we can use our tree diagram to calculate this probability.

This diagram outlines the possible outcomes for the events mentioned in the narrative.

With the narrative, you should have all the information necessary to do the next exercise's calculations!

Conditional Probability Continued

Imagine that you are a patient who has recently tested positive for strep throat. You may want to know the probability that you HAVE strep throat, given that you tested positive:

```tex
P(ST \mid +)
``` 

To calculate this probability, we will use something called *Bayes Theorem*, which states the following:

```tex
P(B \mid A) = \frac{P(A \mid B) \cdot P(B)}{P(A)}
```


Using Bayes' theorem:
 
```tex
P(ST \mid +) = \frac{P(+ \mid ST) \cdot P(ST)}{P(+)}
``` 
We know:
```tex
P(+ \mid ST) = 0.85
```
We also know:
```tex
P(ST) = 0.20
```

What about *P(+)*? Is this something we know? Well, let's think about this. There are four possible outcomes:
* Having strep throat and testing positive
* Having strep throat and testing negative
* Not having strep throat and testing positive
* Not having strep throat and testing negative

We only care about the two outcomes where a patient tests positives for *P(+)*. Therefore, we can say:
```tex
\begin{aligned}
P(+) = P(\text{ST and +}) + P(\text{NO ST and +}) \\
P(+) = 0.17 + .016 \\
P(+) = 0.186 \\
\end{aligned}
```

Finally, if we plug all of these into the Bayes' theorem formula, we get:

```tex
P(ST \mid +) = \frac{0.85 \cdot 0.20}{0.186} = 0.914
```
There is a 91.4% chance that you actually have strep throat given you test positive. This is not obvious from the information outlined in our tree diagram, but with the power of Bayes theorem, we were able to calculate it!

Let's practice some more of this in the questions below.

This diagram outlines the possible outcomes for the events mentioned in the narrative.

With the narrative, you should have all the information necessary to do the exercise's calculations!

Bayes' Theorem

Congratulations, we have finished our exploration into the rules of probability! To recap, we have covered:

* The union and intersection of two events
* The complement of an event
* Independent and dependent events
* Mutually exclusive events
* How to calculate the union of two events using the addition rule
* What conditional probability is and how to calculate it
* How to calculate simultaneous events using the multiplication rule
* How to use tree diagrams to map out possible outcomes
* What Bayes' theorem is and how to calculate it
* Application of conditional probability and Bayes' theorem

Marvel in all that we have covered! For more practice, feel free to use the applet to the right and practice questions below. 

Learn how to describe random variables using probability distributions.

A *random variable* is, in its simplest form, a function. In probability, we often use random variables to represent random events. For example, we could use a random variable to represent the outcome of a die roll: any number between one and six. 

Random variables must be numeric, meaning they always take on a number rather than a characteristic or quality. If we want to use a random variable to represent an event with non-numeric outcomes, we can choose numbers to represent those outcomes. For example, we could represent a coin flip as a random variable by assigning "heads" a value of 1 and "tails" a value of 0.

In this lesson, we will use `random.choice(a, size = size, replace = True/False)` from the `numpy` library to simulate random variables in python. In this method:

- `a` is a list or other object that has values we are sampling from
- `size` is a number that represents how many values to choose
- `replace` can be equal to `True` or `False`, and determines whether we keep a value in `a` after drawing it (`replace = True`) or remove it from the pool (`replace = False`). 


The following code simulates the outcome of rolling a fair die twice using `np.random.choice()`:
```python
import numpy as np

# 7 is not included in the range function
die_6 = range(1, 7)

rolls = np.random.choice(die_6, size = 2, replace = True)

print(rolls)
```
Output:
```
# [2, 5]
```

Random Variables

##### Discrete Random Variables

Random variables with a countable number of possible values are called _discrete random variables_. For example, rolling a regular 6-sided die would be considered a discrete random variable because the outcome options are limited to the numbers on the die. 

Discrete random variables are also common when observing counting events, such as how many people entered a store on a randomly selected day. In this case, the values are countable in that they are limited to whole numbers (you can't observe half of a person).


##### Continuous Random Variables

When the possible values of a random variable are uncountable, it is called a _continuous random variable_. These are generally measurement variables and are uncountable because measurements can always be more precise -- meters, centimeters, millimeters, etc. 

For example, the temperature in Los Angeles on a randomly chosen day is a continuous random variable. We can always be more precise about the temperature by expanding to another decimal place (96 degrees, 96.44 degrees, 96.437 degrees, etc.).

Discrete and Continuous Random Variables

A *probability mass function (PMF)* is a type of *probability distribution* that defines the probability of observing a particular value of a discrete random variable. For example, a PMF can be used to calculate the probability of rolling a three on a fair six-sided die.

There are certain kinds of random variables (and associated probability distributions) that are relevant for many different kinds of problems. These commonly used probability distributions have names and *parameters* that make them adaptable for different situations.

For example, suppose that we flip a fair coin some number of times and count the number of heads. The probability mass function that describes the likelihood of each possible outcome (eg., 0 heads, 1 head, 2 heads, etc.) is called the _binomial distribution_. The parameters for the binomial distribution are:

- `n` for the number of trials (eg., n=10 if we flip a coin 10 times)
- `p` for the probability of success in each trial (probability of observing a particular outcome in each trial. In this example, p= 0.5 because the probability of observing heads on a fair coin flip is 0.5) 

If we flip a fair coin 10 times, we say that the number of observed heads follows a `Binomial(n=10, p=0.5)` distribution. The graph below shows the probability mass function for this experiment. The heights of the bars represent the probability of observing each possible outcome as calculated by the PMF.

![A histogram with markers 0 to 10 along the x-axis and the heights of the bars at each marker represent the probability of observing the value of the marker from this distribution](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/binom_pmf_10_5.svg)


Probability Mass Functions

The `binom.pmf()` method from the `scipy.stats` library can be used to calculate the PMF of the binomial distribution at any value. This method takes 3 values:

- `x`: the value of interest
- `n`: the number of trials
- `p`: the probability of success

For example, suppose we flip a fair coin 10 times and count the number of heads. We can use the `binom.pmf()` function to calculate the probability of observing 6 heads as follows:

```python
# import necessary library
import scipy.stats as stats

# st.binom.pmf(x, n, p)
print(stats.binom.pmf(6, 10, 0.5))
```
Output:
```
# 0.205078
```

Notice that two of the three values that go into the `stats.binomial.pmf()` method are the parameters that define the binomial distribution: `n` represents the number of trials and `p` represents the probability of success.

Calculating Probabilities using Python

We have seen that we can calculate the probability of observing a specific value using a probability mass function. What if we want to find the probability of observing a range of values for a discrete random variable? One way we could do this is by adding up the probability of each value.

For example, let's say we flip a fair coin 5 times, and want to know the probability of getting between 1 and 3 heads. We can visualize this scenario with the probability mass function:

![GIF of highlighting selected bars on a histogram](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/Binomial-Distribution-PMF-Probability-over-a-Range.gif)


We can calculate this using the following equation where *P(x)* is the probability of observing the number *x* successes (heads in this case):

```tex
\begin{aligned}
P(1\;to\;3\;heads) = P(1 <= X <= 3) \\
P(1\;to\;3\;heads) = P(X=1) + P(X=2) + P(X=3) \\
P(1\;to\;3\;heads) = 0.1562 + 0.3125 + 0.3125 \\
P(1\;to\;3\;heads) = 0.7812 
\end{aligned}
```

Using the Probability Mass Function Over a Range

We can use the same `binom.pmf()` method from the `scipy.stats` library to calculate the probability of observing a range of values. As mentioned in a previous exercise, the `binom.pmf` method takes 3 values:
- `x`: the value of interest
- `n`: the sample size
- `p`: the probability of success

For example, we can calculate the probability of observing between 2 and 4 heads from 10 coin flips as follows:

```python
import scipy.stats as stats

# calculating P(2-4 heads) = P(2 heads) + P(3 heads) + P(4 heads) for flipping a coin 10 times
print(stats.binom.pmf(2, n=10, p=.5) + stats.binom.pmf(3, n=10, p=.5) + stats.binom.pmf(4, n=10, p=.5))
```
Output:
```
# 0.366211
```

We can also calculate the probability of observing less than a certain value, let's say 3 heads, by adding up the probabilities of the values below it:

```python
import scipy.stats as stats

# calculating P(less than 3 heads) = P(0 heads) + P(1 head) + P(2 heads) for flipping a coin 10 times
print(stats.binom.pmf(0, n=10, p=.5) + stats.binom.pmf(1, n=10, p=.5) + stats.binom.pmf(2, n=10, p=.5))
```
Output:
```
# 0.0546875
```

Note that because our desired range is less than 3 heads, we do not include that value in the summation.

When there are many possible values of interest, this task of adding up probabilities can be difficult. If we want to know the probability of observing 8 or fewer heads from 10 coin flips, we need to add up the values from 0 to 8:

```python
import scipy.stats as stats

stats.binom.pmf(0, n = 10, p = 0.5) + stats.binom.pmf(1, n = 10, p = 0.5) + stats.binom.pmf(2, n = 10, p = 0.5) + stats.binom.pmf(3, n = 10, p = 0.5) + stats.binom.pmf(4, n = 10, p = 0.5) + stats.binom.pmf(5, n = 10, p = 0.5) + stats.binom.pmf(6, n = 10, p = 0.5) + stats.binom.pmf(7, n = 10, p = 0.5) + stats.binom.pmf(8, n = 10, p = 0.5)
```
Output:
```
# 0.98926
```

This involves a lot of repetitive code. Instead, we can also use the fact that the sum of the probabilities for all possible values is equal to 1:

```tex
\begin{aligned}
P(0\;to\;8\;heads) + P(9\;to\;10\;heads) = P(0\;to\;10\;heads) = 1 \\
P(0\;to\;8\;heads) = 1 - P(9\;to\;10\;heads)
\end{aligned}
```

Now instead of summing up 9 values for the probabilities between 0 and 8 heads, we can do 1 minus the sum of two values and get the same result:

```python
import scipy.stats as stats
# less than or equal to 8
1 - (stats.binom.pmf(9, n=10, p=.5) + stats.binom.pmf(10, n=10, p=.5))
```
Output:
```
# 0.98926
```

Probability Mass Function Over a Range using Python

The _cumulative distribution function_ for a discrete random variable can be derived from the probability mass function. However, instead of the probability of observing a specific value, the cumulative distribution function gives the probability of observing a specific value OR LESS. 

As previously discussed, the probabilities for all possible values in a given probability distribution add up to 1. The value of a cumulative distribution function at a given value is equal to the sum of the probabilities lower than it, with a value of 1 for the largest possible number.

Cumulative distribution functions are constantly increasing, so for two different numbers that the random variable could take on, the value of the function will always be greater for the larger number. Mathematically, this is represented as:

```tex
\text{If}\; x_1 < x_2, \to CDF(x_1) < CDF(x_2)
```

We showed how the probability mass function can be used to calculate the probability of observing less than 3 heads out of 10 coin flips by adding up the probabilities of observing 0, 1, and 2 heads. The cumulative distribution function produces the same answer by evaluating the function at `CDF(X=2)`. In this case, using the CDF is simpler than the PMF because it requires one calculation rather than three.

The animation to the right shows the relationship between the probability mass function and the cumulative distribution function. The top plot is the PMF, while the bottom plot is the corresponding CDF. When looking at the graph of a CDF, each y-axis value is the sum of the probabilities less than or equal to it in the PMF.

Cumulative Distribution Function

We can use a cumulative distribution function to calculate the probability of a specific range by taking the difference between two values from the cumulative distribution function. For example, to find the probability of observing between 3 and 6 heads, we can take the probability of observing 6 or fewer heads and subtracting the probability of observing 2 or fewer heads. This leaves a remnant of between 3 and 6 heads.

The visual to the right demonstrates how this works. It is important to note that to include the lower bound in the range, the value being subtracted should be one less than the lower bound. In this example, we wanted to know the probability from 3 to 6, which includes 3. Mathematically, this looks like the following equation:
```tex
\begin{aligned}
P(3<=X<=6) = P(X<=6) - P(X<3)\\
\text{or} \\
P(3<=X<=6) = P(X<=6) - P(X<=2)
\end{aligned}
```

Cumulative Distribution Function continued

We can use the `binom.cdf()` method from the `scipy.stats` library to calculate the cumulative distribution function. This method takes 3 values:
- `x`: the value of interest, looking for the probability of this value or less
- `n`: the sample size
- `p`: the probability of success

Calculating the probability of observing 6 or fewer heads from 10 fair coin flips (0 to 6 heads) mathematically looks like the following:

```tex
P(6\; or\; fewer\; heads) = P(0\; to\; 6\; heads)
```

In python, we use the following code:

```python
import scipy.stats as stats

print(stats.binom.cdf(6, 10, 0.5))
```
Output:
```
0.828125
```

Calculating the probability of observing between 4 and 8 heads from 10 fair coin flips can be thought of as taking the difference of the value of the cumulative distribution function at 8 from the cumulative distribution function at 3:

```tex
P(4\;to\;8\;Heads) = P(0\;to\;8\;Heads) - P(0\;to\;3\;Heads)
```
In python, we use the following code:

```python
import scipy.stats as stats

print(stats.binom.cdf(8, 10, 0.5) - stats.binom.cdf(3, 10, 0.5))
```
Output:
```
# 0.81738
```

To calculate the probability of observing more than 6 heads from 10 fair coin flips we subtract the value of the cumulative distribution function at 6 from 1. Mathematically, this looks like the following:

```tex
P(more\; than\; 6\; heads) = 1 - P(6\; or\; fewer\; heads)
```

Note that "more than 6 heads" does not include 6. In python, we would calculate this probability using the following code:

```python
import scipy.stats as stats
print(1 - stats.binom.cdf(6, 10, 0.5))
```
Output:
```
# 0.171875
```

Using the Cumulative Distribution Function in Python

Similar to how discrete random variables relate to probability mass functions, continuous random variables relate to probability density functions. They define the probability distributions of continuous random variables and span across all possible values that the given random variable can take on.

When graphed, a probability density function is a curve across all possible values the random variable can take on, and the total area under this curve adds up to 1.

The following image shows a probability density function. The highlighted area represents the probability of observing a value within the highlighted range.

![GIF or visual of the area under the curve highlighted and showing the calculated area under the curve](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/Adding-Area.gif)

In a probability density function, we cannot calculate the probability at a single point. This is because the area of the curve underneath a single point is always zero. The gif below showcases this.

![GIF or visual of the highlighted area under the curve getting smaller and smaller until the area equals 0](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/Normal-Distribution-Area-to-Zero.gif)

As we can see from the visual above, as the interval becomes smaller, the width of the area under the curve becomes smaller as well. When trying to evaluate the area under the curve at a specific point, the width of that area becomes 0, and therefore the probability equals 0.

We can calculate the area under the curve using the cumulative distribution function for the given probability distribution.

For example, heights fall under a type of probability distribution called a _normal distribution_. The parameters for the normal distribution are the mean and the standard deviation, and we use the form _Normal(mean, standard deviation)_ as shorthand. 

We know that women's heights have a mean of 167.64 cm with a standard deviation of 8 cm, which makes them fall under the Normal(167.64, 8) distribution. 

Let's say we want to know the probability that a randomly chosen woman is less than 158 cm tall. We can use the cumulative distribution function to calculate the area under the probability density function curve from 0 to 158 to find that probability.

![Image to show the area under the curve highlighted from 0 to 158 cm](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/norm_pdf_167_8_filled.svg)

We can calculate the area of the blue region in Python using the `norm.cdf()` method from the `scipy.stats` library. This method takes on 3 values:
- `x`: the value of interest
- `loc`: the mean of the probability distribution
- `scale`: the standard deviation of the probability distribution


```python
import scipy.stats as stats

# stats.norm.cdf(x, loc, scale)
print(stats.norm.cdf(158, 167.64, 8))
```
Output:
```
# 0.1141
```

Probability Density Functions

We can take the difference between two overlapping ranges to calculate the probability that a random selection will be within a range of values for continuous distributions. This is essentially the same process as calculating the probability of a range of values for discrete distributions.

![Gif with two overlapping densities, subtract one out to find the difference, and therefore the probability in that range](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/Normal-PDF-Range.gif)

Let's say we wanted to calculate the probability of randomly observing a woman between 165 cm to 175 cm, assuming heights still follow the Normal(167.74, 8) distribution. We can calculate the probability of observing these values or less. The difference between these two probabilities will be the probability of randomly observing a woman in this given range. This can be done in python using the `norm.cdf()` method from the `scipy.stats` library. As mentioned before, this method takes on 3 values:
- `x`: the value of interest
- `loc`: the mean of the probability distribution
- `scale`: the standard deviation of the probability distribution

```python
import scipy.stats as stats
# P(165 < X < 175) = P(X < 175) - P(X < 165)
# stats.norm.cdf(x, loc, scale) - stats.norm.cdf(x, loc, scale)
print(stats.norm.cdf(175, 167.74, 8) - stats.norm.cdf(165, 167.74, 8))
```
Output:
```
# 0.45194
```

We can also calculate the probability of randomly observing a value or greater by subtracting the probability of observing less than the given value from 1. This is possible because we know that the total area under the curve is 1, so the probability of observing something greater than a value is 1 minus the probability of observing something less than the given value.

Let's say we wanted to calculate the probability of observing a woman taller than 172 centimeters, assuming heights still follow the Normal(167.74, 8) distribution. We can think of this as the opposite of observing a woman shorter than 172 centimeters. We can visualize it this way:

![Image showing how P(X > 172) = 1 - P(X < 172)](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/norm_pdf_167_8_filled2.svg)

We can use the following code to calculate the blue area by taking 1 minus the red area:

```python
import scipy.stats as stats

# P(X > 172) = 1 - P(X < 172)
# 1 - stats.norm.cdf(x, loc, scale)
print(1 - stats.norm.cdf(172, 167.74, 8))
```
Output:
```
# 0.29718
```

Probability Density Functions and Cumulative Distribution Function

Congrats! We have finished our introduction to probability distributions! To review, we have: 
* Learned about different types of random variables
* Calculated the probability of specific events using probability mass functions (discrete random variable)
* Calculated the probability of ranges using probability mass functions and cumulative distribution functions (discrete random variable)
* Calculated the probability of ranges using probability density functions and cumulative distribution functions (continuous random variable)

Introduction to Probability Distributions

Determine the number of defective products made at a factory on a given day. Apply concepts from the Poisson distribution, including random variables, the probability mass function, the cumulative distribution function, and expected values.

Create a variable called `lam` that represents the rate parameter of our distribution.

You know that the rate parameter of a Poisson distribution is equal to the expected value. So in our factory, the rate parameter would equal the expected number of defects on a given day. You are curious about how often we might observe the exact expected number of defects. 

Calculate and print the probability of observing exactly `lam` defects on a given day.

Our boss said that having 4 or fewer defects on a given day is an exceptionally good day. You are curious about how often that might happen. 

Calculate and print the probability of having one of these days.

On the other hand, our boss said that having more than 9 defects on any given day is considered a bad day. 

Calculate and print the probability of having one of these bad days.

You've familiarized yourself a little bit about how the Poisson distribution works in theory by calculating different probabilities. But let's look at what this might look like in practice.

Create a variable called `year_defects` that has 365 random values from the Poisson distribution.

Let's take a look at our new dataset. Print the first 20 values in this data set.

If we expect 7 defects on a given day, what is the total number of defects we would expect over 365 days?

Calculate and print this value to the output terminal.

Calculate and print the total sum of the data set `year_defects`. How does this compare to the total number of defects we expected over 365 days?

Calculate and print the average number of defects per day from our simulated dataset.

How does this compare to the expected average number of defects each day that we know from the given rate parameter of the Poisson distribution?

You're worried about what the highest amount of defects in a single day might be because that would be a hectic day. 

Print the maximum value of `year_defects`.

Wow, it would probably be super busy if there were that many defects on a single day. Hopefully, it is a rare event! 

Calculate and print the probability of observing that maximum value or more from the Poisson(7) distribution.

Congratulations! At this point, you have now explored the Poisson distribution and even worked with some simulated data. We have a couple of extra tasks if you would like an extra challenge. Feel free to try them out or move onto the next topic!

Let's say we want to know how many defects in a given day would put us in the 90th percentile of the Poisson(7) distribution. One way we could calculate this is by using the following method:
```python
stats.poisson.ppf(percentile, lambda) 
``` 
`percentile` is equal to the desired percentile (a decimal between 0 and 1), and `lambda` is the lambda parameter of the Poisson distribution. This function is essentially the inverse of the CDF.

Use this method to calculate and print the number of defects that would put us in the 90th percentile for a given day. In other words, on 90% of days, we will observe fewer defects than this number.

Now let's see what proportion of our simulated dataset `year_defects` is greater than or equal to the number we calculated in the previous step. 

By definition of a percentile, we would expect 1 - .90, or about 10% of days to be in this range.

To calculate this:
1. Count the number of values in the dataset that are greater than or equal to the 90th percentile value.
2. Divide this number by the length of the dataset.

Click the hint if you want to see an example calculation.

Detecting Product Defects with Probability

Random variables are the actual numbers you get when you run experiments.

Random variables are functions with numerical outcomes.

Even when a specific outcome is very likely, there is still the possibility that a random variable takes on a different value.

A random variable can be either discrete or continuous.

The probability mass function defines the probability that a random variable equals a specific value.

The probability mass function defines what value we will see in a random event.

The probability mass function defines the probability that a random variable equals a specific value or less.

The probability mass function only works for random variables that have a 50% chance of success, such as observing heads on a fair coin flip.

Cumulative density functions are only for discrete random variables.

The cumulative density function of a random variable defines the probability of observing a value less than or equal to a specific number.

Cumulative density functions calculate the probability of observing an exact number.

When graphed, cumulative density functions are always a smooth, continuous curve.

`n` is the number of trials, and `p` is the probability of success in each trial.

`n` is the number of trials, and `p` is the probability of success over all trials.

`n` is the number of successes we expect to see from all of the trials, and `p` is the probability of success in each trial.

`n` is the number of successes we expect to see from all of the trials, and `p` is the probability of success over all trials.

Assess your knowledge of probability distributions!

Probability Distributions

Let's explore more about probability distributions using the Poisson distribution as an example!


There are numerous probability distributions used to represent almost any random event. In the previous lesson, we learned about the binomial distribution to represent events like any number of coin flips as well as the normal distribution to represent events such as the height of a randomly selected woman.

The Poisson distribution is another common distribution, and it is used to describe the number of times a certain event occurs within a fixed time or space interval. For example, the Poisson distribution can be used to describe the number of cars that pass through a specific intersection between 4pm and 5pm on a given day. It can also be used to describe the number of calls received in an office between 1pm to 3pm on a certain day.

The Poisson distribution is defined by the rate parameter, symbolized by the Greek letter lambda, λ.

Lambda represents the expected value &mdash; or the average value &mdash; of the distribution. For example, if our expected number of customers between 1pm and 2pm is 7, then we would set the parameter for the Poisson distribution to be 7. The PMF for the Poisson(7) distribution is as follows:

![This plot shows the PMF of the Poisson(7) distribution. The plot is centered around 7 ](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/pois_7_pmf.svg)

Introduction to the Poisson Distribution

The Poisson distribution is a discrete probability distribution, so it can be described by a probability mass function and cumulative distribution function.

We can use the `poisson.pmf()` method in the `scipy.stats` library to evaluate the probability of observing a specific number given the parameter (expected value) of a distribution. For example, suppose that we expect it to rain 10 times in the next 30 days. The number of times it rains in the next 30 days is "Poisson distributed" with lambda = 10. We can calculate the probability of exactly 6 times of rain as follows:

```python
import scipy.stats as stats
# expected value = 10, probability of observing 6
stats.poisson.pmf(6, 10)
```
Output:
```
0.06305545800345125
```

Like previous probability mass functions of discrete random variables, individual probabilities can be summed together to find the probability of observing a value in a range.

For example, if we expect it to rain 10 times in the next 30 days, the number of times it rains in the next 30 days is "Poisson distributed" with lambda = 10. We can calculate the probability of 12-14 times of rain as follows:

```python
import scipy.stats as stats
# expected value = 10, probability of observing 12-14
stats.poisson.pmf(12, 10) + stats.poisson.pmf(13, 10) + stats.poisson.pmf(14, 10)
```
Output:
```
0.21976538076223123
```

Calculating Probabilities of Exact Values Using the Probability Mass Function

We can use the `poisson.cdf()` method in the `scipy.stats` library to evaluate the probability of observing a specific number or less given the expected value of a distribution. For example, if we wanted to calculate the probability of observing 6 or fewer rain events in the next 30 days when we expected 10, we could do the following: 

```python
import scipy.stats as stats
# expected value = 10, probability of observing 6 or less
stats.poisson.cdf(6, 10)
```
Output:
```
0.130141420882483
```

This means that there is roughly a 13% chance that there will be 6 or fewer rainfalls in the month in question.

We can also use this method to evaluate the probability of observing a specific number or more given the expected value of the distribution. For example, if we wanted to calculate the probability of observing 12 or more rain events in the next 30 days when we expected 10, we could do the following: 

```python
import scipy.stats as stats
# expected value = 10, probability of observing 12 or more
1 - stats.poisson.cdf(11, 10)
```
Output:
```
0.30322385369689386
```

This means that there is roughly a 30% chance that there will be 12 or more rain events in the month in question.

Note that we used 11 in the statement above even though 12 was the value given in the prompt. We wanted to calculate the probability of observing 12 or more rains, which includes 12. `stats.poisson.cdf(11, 10)` evaluates the probability of observing 11 or fewer rains, so `1 - stats.poisson.cdf(11, 10)` would equal the probability of observing 12 or more rains.

Summing individual probabilities over a wide range can be cumbersome. It is often easier to calculate the probability of a range using the cumulative density function instead of the probability mass function. We can do this by taking the difference between the CDF of the larger endpoint and the CDF of one less than the smaller endpoint of the range. 

For example, while still expecting 10 rainfalls in the next 30 days, we could use the following code to calculate the probability of observing between 12 and 18 rainfall events:

```python
import scipy.stats as stats
# expected value = 10, probability of observing between 12 and 18
stats.poisson.cdf(18, 10) - stats.poisson.cdf(11, 10)
```
Output:
```
0.29603734909303947
```

Calculating Probabilities of a Range using the Cumulative Density Function

Earlier, we mentioned that the parameter lambda (λ) is the _expected value_ (or average value) of the Poisson distribution. But what does this mean?

Let's put this into context: let's say we are salespeople, and after many weeks of work, we calculate our average to be 10 sales per week. If we take this value to be our expected value of a Poisson Distribution, the probability mass function will look as follows:

![This is a plot of the probability mass function for the Poisson distribution with the expected value equal to 10. The bar at 10 is colored red, and the rest of the bars are colored blue.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/pois_10_pmf_red.svg)

The tallest bar represents the value with the highest probability of occurring. In this case, the tallest bar is at 10. This does not, however, mean that we will make 10 sales. It means that on average, across all weeks, we expect our average to equal about 10 sales per week. 

Let's look at this another way. Let's take a sample of 1000 random values from the Poisson distribution with the expected value of 10. We can use the `poisson.rvs()` method in the `scipy.stats` library to generate random values:

```python
import scipy.stats as stats

# generate random variable
# stats.poisson.rvs(lambda, size = num_values)
rvs = stats.poisson.rvs(10, size = 1000)
```

The histogram of this sampling looks like the following:

![This plot is a histogram of 1000 random samples from the Poisson(10) distribution](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/pois_10_1000samp.svg)

We can see observations of as low as 2 but as high as 20. The tallest bars are at 9 and 10. If we took the average of the 1000 random samples, we would get:

```python
print(rvs.mean())
```
Output:
```
10.009
```

This value is very close to 10, confirming that over the 1000 observations, the expected value (or average) is 10.

When we talk about the expected value, we mean the average over many observations. This relates to the Law of Large Numbers: the more samples we have, the more likely samples will resemble the true population, and the mean of the samples will approach the expected value. So even though the salesperson may make 3 sales one week, they may make 16 the next, and 11 the week after. In the long run, after many weeks, the expected value (or average) would still be 10.

Expectation of the Poisson Distribution

Probability distributions also have calculable variances. Variances are a way of measuring the spread or dispersion of values and probabilities in the distribution. For the Poisson distribution, the variance is simply the value of lambda (λ), meaning that the expected value and variance are equivalent in Poisson distributions. 

We know that the Poisson distribution has a discrete random variable and must be greater than 0 (think, a salesperson cannot have less than 0 sales, a shop cannot have fewer than 0 customers), so as the expected value increases, the number of possible values the distribution can take on would also increase.

The first plot below shows a Poisson distribution with lambda equal to three, and the second plot shows a Poisson distribution with lambda equal to fifteen. Notice that in the second plot, the spread of the distribution increases. Also, take note that the height of the bars in the second bar decrease since there are more possible values in the distribution. 

![This image shows a Poisson distribution with lambda equal to three.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/poisson_lambda_3.svg)

![This image shows a Poisson distribution with lambda equal to fifteen.](https://static-assets.codecademy.com/skillpaths/master-stats-ii/probability-distributions/poisson_lambda_15.svg)

As we can see, as the parameter lambda increases, the variance &mdash; or spread &mdash; of possible values increases as well.

We can calculate the variance of a sample using the `numpy.var()` method:

```python
import scipy.stats as stats
import numpy as np

rand_vars = stats.poisson.rvs(4, size = 1000)
print(np.var(rand_vars))
```
Output:
```
3.864559
```

Because this is calculated from a sample, it is possible that the variance might not equal EXACTLY lambda. However, we do expect it to be relatively close when the sample size is large, like in this example.

Another way to view the increase in possible values is to take the range of a sample (the minimum and maximum values in a set). The following code will take draw 1000 random variables from the Poisson distribution with lambda = 4 and then print the minimum and maximum values observed using the `.min()` and `.max()` Python functions:

```python
import scipy.stats as stats

rand_vars = stats.poisson.rvs(4, size = 1000)

print(min(rand_vars), max(rand_vars))
```
Output:
```
0 12
```

If we increase the value of lambda to 10, let's see how the minimum and maximum values change:
```python
import scipy.stats as stats

rand_vars = stats.poisson.rvs(10, size = 1000)

print(min(rand_vars), max(rand_vars))
```
Output:
```
1 22
```

These values are spread wider, indicating a larger variance.

Spread of the Poisson Distribution

Other types of distributions have expected values and variances based on the given parameters, just like the Poisson distribution. Recall that the Binomial distribution has parameters `n`, representing the number of events and `p`, representing the probability of "success" (or the specific outcome we are looking for occurring). 

Consider the following scenario: we flip a fair coin 10 times and count the number of heads we observe. How many heads would you expect to see? You might naturally think 5, and you would be right! What we are doing is calculating the expected value without even realizing it. We take the 10 coin flips and multiply it by the chance of getting heads, or one-half, getting the answer of 5 heads. And that is exactly the equation for the expected value of the binomial distribution:

```tex
Expected(\#\;of\;Heads) = E(X) = n \times p
```

Note that if we were counting the number of heads out of 5 fair coin flips, the expected value would be:

```tex
5 \times 0.5 = 2.5
```

It is ok for the expected value to be a fraction or have decimal values, though it would be impossible to observe 2.5 heads. 

Let's look at a different example. Let's say we forgot to study, and we are going to guess **B** on all 20 questions of a multiple-choice quiz. If we assume that every letter option (A, B, C, and D) has the same probability of being the right answer for each question, how many questions would we expect to get correct? `n` would equal 20, because there are 20 questions, and `p` would equal 0.25, because there is a 1 in 4 chance that **B** will be the right answer. Using the equation, we can calculate:

```tex
Expected(\#\;right\;answers) = E(X) = 20 \times 0.25 = 5
```

Expected Value of the Binomial Distribution

The variance of a binomial distribution is how much the expected value of success (from the previous exercise) may vary. In other words, it is a measurement of the spread of the probabilities to the mean/expected value.

Variance for the Binomial distribution is similarly calculated to the expected value using the *n* (# of trials) and *p* (probability of success) parameters. Let's use the 10 fair coin flips example to try to understand how variance is calculated. Each coin flip has a certain probability of landing as heads or tails: 0.5 and 0.5, respectively. 

The variance of a single coin flip will be the probability that the success happens times the probability that it does not happen: *p·(1-p)*, or 0.5 x 0.5. Because we have *n = 10* number of coin flips, the variance of a single fair coin flip is multiplied by the number of flips. Thus we get the equation:

```tex
\begin{aligned}
Variance(\#\;of\;Heads) = Var(X) = n \times p \times (1-p) \\
Variance(\#\;of\;Heads) = 10 \times 0.5 \times (1 - 0.5) = 2.5
\end{aligned}
```

Let's consider our 20 multiple choice quiz again. The variance around getting an individual question correct would be *p·(1-p)*, or 0.25 x 0.75. We then multiply this variance for all 20 questions in the quiz and get:

```tex
Variance(\#\;of\;Correct\;Answers) \&= 20 \times 0.25 \times (1 - 0.25) = 3.75
```

We would expect to get 5 correct answers, but the overall variance of the probability distribution is 3.75.

Variance of the Binomial Distribution

There are several properties of expectation and variance that are consistent through all distributions:

### Properties of Expectation
1. The expected value of two independent random variables is the sum of each expected value separately:
```tex
E(X + Y) = E(X) + E(Y)
```
For example, if we wanted to count the total number of heads between 10 fair quarter flips and 6 fair nickel flips, the expected value combined would be 5 heads (from the quarters) and 3 heads (from the nickels) so 8 heads overall.

2. Multiplying a random variable by a constant *a* changes the expected value to be *a* times the expected value of the random variable:
```tex
E(aX) = aE(X)
```
For example, the expected number of heads from 10 fair coin flips is 5. If we wanted to calculate the number of heads from this event run 4 times (40 total coin flips), the expected value would now be 4 times the original expected value, or 20.

3. Adding a constant *a* to the distribution changes the expected value by the value *a*:
```tex
E(X + a) = E(X) + a
```
Let's say that a test was given and graded, and the average grade was 78 out of 100 points. If the teacher decided to curve the grade by adding 2 points to everyone's grade, the average would now be 80 points.

### Properties of Variance
1. Increasing the values in a distribution by a constant *a* does not change the variance:
```tex
Var(X + a) = Var(X)
```
This is because the variance of a constant is 0 (there is no range for a single number). Adding a constant to a random variable does not add any additional variance. Let's take the previous example with the teacher curving grades: though the expected value (or average) of the test changes from 78 to 80, the spread and dispersion (or variance) of the test scores stays the same.

2. Scaling the values of a random variable by a constant *a* scales the variance by the constant squared:
```tex
Var(aX) = a^2Var(X)
```

3. The variance of the sum of two random variables is the sum of the individual variances:
```tex
Var(X + Y) = Var(X) + Var (Y)
```
This principle ONLY holds if the X and Y are independent random variables. Let's say that *X* is the event getting a heads on a single fair coin flip, and *Y* is the event rolling a 2 on a fair six-sided die:
```tex
Var(X) = 0.5 * (1 - 0.5) = 0.25
```
```tex
Var(Y) = 0.167 * (1 - 0.167) = 0.139
```
```tex
Var(X + Y) = Var(X) + Var(Y) = 0.25 + 0.139 = 0.389
```

Properties of Expectation and Variance

Congrats! We have finished our second lesson on probability distributions! Let's review some of the things we've learned: 
* The Poisson distribution and its parameter lambda (λ)
* How the probability mass function of the Poisson distribution changes with different values of lambda (λ)
* Calculating probabilities of specific values and ranges of values from the Poisson distribution
* Calculating probabilities of ranges using the cumulative density function of the Poisson distribution
* Generating random values from a distribution
* Principles of expectation and variance of various distributions
* Universal properties of expectation and variance


More on Probability Distributions

Take many random samples from a population, each of a different size.

Take many random samples from a population, each of the same size.

Plot the distribution of the sample statistics.

```tex
\begin{aligned}
mean = 4.25 \text{ inches} \\
std\_error = \frac{0.25}{\sqrt{100}} \text{ inches} \\
\end{aligned}
```



```tex
\begin{aligned}
mean = 4.25 \text{ inches} \\
std\_error = \frac{0.25}{100} \text{ inches} \\
\end{aligned}
```

```tex
\begin{aligned}
mean = \frac{4.25}{\sqrt{100}} \text{ inches} \\
std\_error = 0.25 \text{ inches} \\
\end{aligned}
```

We can't determine this since we don't know if the population distribution of hummingbird wingspans is normally distributed or skewed.

Mean: stays the same
Standard Error: decreases

Mean: stays the same
Standard Error: stays the same

Mean: increases
Standard Error: decreases

Mean: stays the same
Standard Error: increases

Assess your knowledge of sampling distributions!

Investigate sampling distributions of Spotify data!

You will be working with a dataset called **spotify_data.csv**. In **script.py**, use the `read_csv()` pandas function to load in **spotify_data.csv** into a variable called `spotify_data`. 



Use the pandas `.head()` function to preview the `spotify_data`. If you need a reminder of how to use this function, click the hint below.

For this project, we are going to focus on the `tempo` variable. This column gives the beats per minute (bpm) of each song in **spotify.csv**. The other columns in our dataset are:
* `danceability`
* `energy`
* `instrumentalness`
* `liveness`
* `valences`

For now, we are going to ignore these other columns.

Create a variable called `song_tempos` that contains the `tempo` column data.

Let's investigate the helper functions we will use in the following sections. A file called **helper_functions.py** should be opened in the workspace for you. It contains three functions: `choose_statistic()`, `population_distribution()`, and `sampling_distribution()`. The code in these functions is similar to what we saw in the previous lesson, but let's explore these together.

`choose_statistic()` allows us to choose a statistic we want to calculate for our sampling and population distributions. It contains two parameters:
* `x`: An array of numbers
* `sample_stat_text`: A string that tells the function which statistic to calculate on `x`. It takes on three values: "Mean", "Minimum", or "Variance".

`population_distribution()` allows us to plot the population distribution of a dataframe with one function call. It takes the following parameter:
* `population_data`: the dataframe being passed into the function

`sampling_distribution()` allows us to plot a simulated sampling distribution of a statistic. The simulated sampling distribution is created by taking random samples of some size, calculating a particular statistic, and plotting a histogram of those sample statistics. It contains three parameters:
* `population_data`: the dataframe being sampled from
* `samp_size`: the size of each sample
* `stat`: the specific statistic being measured for each sample &mdash; either "Mean", "Minimum", or "Variance"

Read through these functions in `helper_function.py` to familiarize yourself with them. Click the hint to see examples of `population_distribution()` and `sampling_distribution()` being used.

Now that our data is loaded into **script.py** and we have gone over the functions in **helper_functions.py** let's start our sampling distributions exploration. Make sure to write your code in **script.py**.


To start off, let's use the `population_distribution()` function to graph distribution of `song_tempos`.

When you click run, you should see a graph with the following title:
```
Population Distribution
```

How would you describe this distribution?

Now let's plot the sampling distribution of the sample mean with sample sizes of 30 songs. To do this, use the `sampling_distribution()` helper function.

Once you hit run, you should see a graph with the following title:
```
Sampling Distribution of the Mean
Mean of the Sample Means: {Mean of the Sample Means} 
Population Mean: {Population Mean}
```

Compare your sampling distribution of the sample means to the population mean. Is the sample mean an unbiased or biased estimator of the population? 

Now let's plot the sampling distribution of the sample minimum with sample sizes of 30 songs. To do this, use the `sampling_distribution()` helper function.

Once you hit run, you should see a graph with the following title:
```
Sampling Distribution of the Minimum
Mean of the Sample Minimums: {Mean of the Sample Minimums}
Population Mean: {Population Mean}
```


Compare your sampling distribution of the sample minimums to the population minimum.  Is the sample minimum an unbiased or biased estimator of the population? 

Now let's plot the sampling distribution of the sample variance with sample sizes of 30 songs. To do this, use the `sampling_distribution()` helper function.

Once you hit run, you should see a graph with the following title:
```
Sampling Distribution of the Variance
Mean of the Sample Variances: {Mean of the Sample Variances}
Population Variance: {Population Variance}
```


Compare your sampling distribution of the sample variance to the population variance. Does the sample variance appear to be an unbiased or biased estimator of the population?

Click the hint for more information about sample variance.

Go to line 17 in **helper_functions.py**. You should see the following line of code:

```python
np.var(x)
```

Change this to:
```python
np.var(x, ddof=1)
```

Adding this `ddof=1` parameter will divide our input by *n-1* instead of *n*, therefore applying the sample variance formula. 

After changing this line of code, run **script.py**. Does the sample variance appear to be an unbiased or biased estimator of the population?


We have a good sense of some sample statistics now that we've investigated sampling distributions. Let's take our analysis further by calculating probabilities. 

First, calculate the population mean and population standard deviation of `song_tempos`. Save these values in two separate variables called `population_mean` and `population_std`.

Use `population_mean` and `population_std` to calculate the standard error of the sampling distribution of the sample mean with a sample size of 30.

Save this value in a variable called `standard_error`.

You are afraid that if the average tempo of the songs you randomly select is less than 140bpm that your party will not be enjoyable. 

Using `population_mean` and `standard_error` in a CDF, calculate the probability that the sample mean of 30 selected songs is less than 140bpm.

Remember to print your result into the output terminal. 

You know the party will be truly epic if the randomly sampled songs have an average tempo of greater than 150bpm. 

Using `population_mean` and `standard_error` in a CDF, calculate the probability that the sample mean of 30 selected songs is GREATER than 150bpm.

Remember to print your result into the output terminal.

Does this probability make you feel confident about the party?



Awesome job! You are ready to throw an awesome party! If you want to do some more exploration of sampling distributions, here are some more opportunities:

* Add another sample statistic to the `choose_statistic()` function in **helper_functions.py** &mdash; such as median, mode, or maximum.
* Explore a different column of data from the **spotify_data.csv** dataset.
* Use the sampling distribution of the sample minimum to estimate the probability of observing a specific sample minimum. For example, from the plot, what is the chance of getting a sample minimum that is less than 130bpm?

Happy coding!

Sampling Distributions Dance Party!

In this unit, we will cover fundamental rules of probability including how to describe random events. We will cover topics such as set theory, conditional probability, joint probability, Bayes rule, probability distributions, and sampling distributions. These concepts are important in order to understand the likelihood of events, fit machine learning models, and perform hypothesis tests.

Learn about what probability is, the language we use to define it, and how we can quantify uncertainty!

Learn how to describe different types of random events.

Learn the fundamentals of probability and how to quantify and visualize uncertainty. 

Probability

PRO SALE: Get 50% off annual Pro memberships using code [LLM50](https://www.codecademy.com/checkout?plan_id=proGoldAnnualV2&discountCode=LLM50&plan_type=pro)