Learn about propensity scores and how to use them to find a causal treatment effect.

Flowchart showing the five steps of a propensity score analysis.

One of the main assumptions in causal inference is known as the assumption of _conditional exchangeability_. This assumption states that, so long as we account for confounders (the non-treatment, non-outcome variables), we would observe the same outcomes if the treatment and non-treatment groups were swapped. Conditional exchangeability is achieved via randomization since it balances both observed AND unobserved variables between treatment groups. However, conditional exchangeability can be difficult to achieve in non-randomized situations.

Propensity score methods are widely used in causal inference because they can help reach conditional exchangeability even when randomization is not possible. So what are propensity scores, and how can we apply propensity score methods to our own questions?

A _propensity score_ is essentially the probability of being in a particular treatment group given a set of observed variables. Typically we will think of propensity scores as the probability of being in the treatment group as opposed to the control group. In a sense, propensity scores summarize all the traits of an observation to a single score, which can be an advantage when there are lots of observed variables.

Propensity score analyses can be broken down into five ordered steps:
 1. Check initial overlap and balance.
 2. Model propensity scores.
 3. Use propensity scores to weight the dataset.
 4. Re-check overlap and balance.
 5. Estimate the treatment effect, or return to step two to improve the propensity score model.

What are Propensity Scores?

The first step in a propensity score analysis is to check how similar the treatment and control groups are at baseline, before using propensity score methods. There are two measures that are commonly used to describe the degree of similarity between treatment groups: _overlap_ and _balance_.

* _Overlap_ is the range of values of a variable that the treatment and control groups have in common.
 * _Overlap_ can also be thought of as the range of values of a variable where the probability of being in the treatment group is greater than 0 but less than 1.
 * We already know that overlap is an important assumption of causal inference!
* _Balance_ describes how similar the treatment and control groups are with respect to the entire distribution of each of the other variables.

Balance is expressed as a statistic that summarizes the entire distribution of a variable. Two statistics are commonly used to measure balance.
1. _Standardized mean difference_ (SMD). The SMD of a variable in a sample is defined as the difference in the average value of the variable between groups divided by the standard deviation of the variable in both groups.
2. _Variance ratio_. The variance ratio of a variable in a sample is the variance of the variable in one treatment group divided by the variance of the variable in the other treatment group.

So how is "good" or "bad" balance defined? 
* An SMD close to zero indicates good balance. This means the average value (and thus the center of the distribution) of the variable is similar between the treatment and control groups. 
* A variance ratio close to one is another indicator of good balance. This means that the variability, or spread, of the variable is the same in both groups.

Variable Overlap and Balance

Suppose we are interested in determining whether the practice of meditation increases the amount of sleep that university students get per night. To gather more information, we surveyed 250 students about their sleeping and meditation habits over the previous year. Note that students were not randomly assigned to treatment groups; students were simply asked about their actions. Their responses were recorded in a dataset with these variables:
 * `sleep` &mdash; average hours of sleep per night (outcome variable)
 * `meditate` &mdash; indicates whether or not the individual reports consistent use of meditation (treatment group variable)
 * `stress` &mdash; a self-reported measure of stress on a 1-to-100 scale, with 1 representing no stress and 100 representing extremely high stress
 * `graduate` &mdash; indicates whether or not a student is in a graduate program versus an undergraduate program (0 = undergraduate; 1 = graduate).

We can use the `bal.plot()` function from the R package, cobalt, to visually check if any variables are severely imbalanced and whether propensity score methods might be useful. The function takes a formula where the left side specifies the treatment group indicator and the right side includes a variable we want to view. To check the `stress` variable, the formula would be `meditate ~ stress`. Then we specify the dataset name and the variable of interest in quotes as additional arguments.

```r
# import library
library(cobalt)
# plot distributions for stress variable
bal.plot(
  x = meditate ~ stress, #formula
  data = sleep_data, #dataset
  var.name = "stress" #variable
  colors = c("#E69F00", "#009E73") #set fill colors
)
```
 
Note that we also set the optional argument `colors` to `c("#E69F00", "#009E73")` for better contrast.

![Density plot of the stress levels of the meditation and non-meditation groups. The meditation group is centered around a score of 40 with a narrow spread, while the non-meditation group is centered around 55 with a much wider spread of stress scores.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/ps-e3-stress.svg)

The distributions in the plot appear to differ pretty substantially between the treatment groups, potentially indicating poor balance. The meditation group (green) is centered around a score of 40 with a narrow spread, while the non-meditation group (orange) is centered around 55 with a much wider spread of stress scores.

Visual Check of Numeric Variables

Distribution plots are great for numeric variables, but we need a different type of plot for categorical variables. Fortunately, we can use the exact same `bal.plot()` function from cobalt with no need to specify the variable type. By updating the arguments for `x` and `var.name` to use `graduate`, we will get a bar plot to examine balance for the categorical variable `graduate`.

```r
# import library
library(cobalt)
# plot distributions for stress variable
bal.plot(
  x = meditate ~ graduate, #new formula
  data = sleep_data, #dataset
  var.name = "graduate", #new variable
  colors = c("#E69F00", "#009E73") #set fill colors
)
```

![A bar plot of the proportion of undergraduate and graduate students split by treatment (meditation) status. There are more treated than control undergraduates, and there are more control than treated graduates.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/ps-e4-grad.svg)

From this plot, we see that the ratio of undergraduates to graduates is much larger for the meditation group (green) than for the non-meditation group (orange). 

Both plots so far suggest that there are differences between the treatment and control groups with respect to the `stress` and `graduate` variables. However, balance plots don't precisely quantify the degree of imbalance in the dataset. To get a more detailed picture, we can check balance numerically.

Visual Check of Categorical Variables

While visual assessments of balance are definitely helpful, we can also assess overlap and balance numerically using the standardized mean difference (SMD) and variance ratio for each variable.

Observing an SMD of exactly zero or a variance ratio of exactly one is pretty uncommon. Therefore, the following guidelines can be used to indicate good balance:
 - SMD between -0.1 and 0.1
 - Variance ratio between 0.5 and 2.0

The `bal.tab()` function from the cobalt package is a complement to the `bal.plot()` function that quantifies the balance of variables in a dataset. The `bal.tab()` function has similar arguments and syntax to the `bal.plot()` function. We need to update our formula to include both variables of interest. Then we can show SMDs for all variables and variance ratios for all continuous variables in the sleep dataset by specifying `binary = "std"` and `disp.v.ratio = TRUE`, respectively:

```r
# import library
library(cobalt)
# print table of SMDs and variance ratios
bal.tab(
  x = meditate ~ stress + graduate, #formula
  data = sleep_data, #dataset
  disp.v.ratio = TRUE, #display variance ratio
  binary = "std" #SMDs for binary variables
)
```

The output of the `bal.tab()` that follows shows that the `stress` variable has an SMD of -0.9132 and a variance ratio of 0.5461 between the treatment and control groups. The `graduate` variable has an SMD of -0.6548.

```
Balance Measures
            Type    Diff.Un   V.Ratio.Un
stress    Contin.   -0.9132       0.5461
graduate  Binary    -0.6548

Sample Sizes
        Control Treated
All        190      60
```

The SMDs clearly fall outside the range of -0.1 to 0.1, which suggests there is an imbalance between the treatment and control groups. The variance ratio for the `stress` variable is only just within the acceptable range. Time to put propensity score methods to the test to see if we can reduce this imbalance!

Checking Balance Numerically

Returning to our student sleep data, we are interested in the effect of meditation on sleep. It seems intuitive that people with high levels of stress might struggle with sleep AND might be less likely to engage in coping mechanisms such as meditation. We can use propensity scores to model this interaction.

Propensity scores reflect the probability of being in the treatment group, as opposed to the control group, given a set of characteristics. Because this probability corresponds to a binary outcome&mdash;either being in the treatment group or the control group&mdash;we can model the propensity scores using logistic regression. The outcome of the regression will predict whether or not an individual is in the treatment group. It will use potential confounding variables as predictors.

With regards to our sleep data, the propensity score should model the probability of practicing meditation based on the other characteristics in the data. Let's start with a propensity score model with the `meditate` variable as the outcome and the `stress` variable as a predictor.

The `glm()` function in R makes modeling propensity scores via logistic regression simple. By default, the `glm()` function fits a linear regression model, so we need to modify the `family` argument to specify that the treatment group variable is binary. To do this, we set the `family` argument to `"binomial"`:

```r
prop_model <- glm(
  formula = meditate ~ stress, #formula
  data = sleep_data, #dataset
  family = "binomial" #specify logistic regression
)
```

To get a sense of what the propensity scores produced from a logistic regression look like, let's take a look at a histogram of the propensity scores from `prop_model`.

![Histogram showing frequency of propensity scores ranging from 0.0 to 0.8. The distribution is hill-shaped but right-skewed and centered near 0.2.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/ps-e6-hist.svg)

The plot indicates that our model produced a lot of low probabilities of being treated for our observations. This makes sense given that the `bal.tab()` output in the last exercise showed us we had 190 students in the non-meditation group and only 60 in the meditation group.



Modeling Propensity Scores

Now that you know about the basics of propensity scores, let's talk about some possible applications. 

Propensity scores often show up in matching and stratification. However, we will focus on _propensity score weighting_ because it is a popular choice with strong performance.

Propensity score weighting transforms estimated propensity scores into _weights_ that emphasize or diminish certain observations in our dataset. One widely used type of propensity score weighting is known as inverse probability of treatment weighting (IPTW).

IPTW weights are calculated differently depending on whether we want to estimate the average treatment effect (ATE) or the average treatment effect on the treated (ATT). Note that with ATE we are looking at the effect across the entire population, both the treated and control groups. ATT is just on the treated group. 

The formulas for the treatment and control group weights for each estimand are below. `p` represents the propensity score for a particular individual.


|                     | ATE     | ATT     |
|:-------------------:|:-------:|:-------:|
|**Treatment Weight** | 1/p     | 1       |
|**Control Weight**   | 1/(1-p) | p/(1-p) |


It may not be immediately obvious, but these formulas tell us that IPTW makes the treatment and control groups appear more similar in two ways:
 - By giving MORE weight to individuals who look like those in the **opposite** treatment assignment.
 - By giving LESS weight to individuals who look like those in their own treatment assignment.

Check out the following table that shows some example weights for treatment and control observations using IPTW for the ATE.

| treatment group | propensity score | weight |
|:---------------:|:----------------:|:------:|
| treatment       | 0.1              | 10.0   |
| treatment       | 0.5              | 2.0    |
| treatment       | 0.9              | 1.1    |
| control         | 0.1              | 1.1    |
| control         | 0.5              | 2.0    |
| control         | 0.9              | 10.0   |

Note the pattern: 
* The treated individual with the low propensity score of 0.1 (looks more like a control) is given a high weight of 10. 
* The treated individual with the high propensity score of 0.9 (looks like the other treated individuals) is given a low weight of 1.1. 
* The weighting goes in the opposite direction for the controls.

The justification for this method is that someone who looks more like the individuals in the other treatment group is a better counterfactual for them, so we count these individuals as more important. This helps balance our treatment groups without discarding any observations.

Propensity Score Weighting

If this seems like a lot of work, don't worry! The WeightIt package in R has functions to model the propensity scores and simultaneously perform propensity score weighting. We don't need to make a separate logistic regression or compute the weights manually using a formula.

IPTW can be performed in R with the `weightit()` function from the `WeightIt` package. There are several key arguments to this function that allow us to tweak how weighting is performed.
 - `formula`&mdash;represents the propensity score model to use.
 - `method`&mdash;determines the weighting method that will be used. While there are various options, we will use "ps" to perform IPTW using logistic regression.
 - `estimand`&mdash;specifies the desired treatment effect estimand: "ATE" for average treatment effect, "ATC" for average treatment effect on the controls, or "ATT" for average treatment effect on the treated.

To perform IPTW for the ATT on the student sleep data, we fill in these arguments accordingly. Remember that our propensity score model includes `meditate` as the outcome and only `stress` as the predictor.

```r
# import library
library(WeightIt)
# model propensity scores and IPTW weights
iptw_sleep <- weightit(
  formula = meditate ~ stress, #propensity model
  data = sleep_data, #dataset
  method = "ps", #use IPTW
  estimand = "ATT" #estimand
)
```

The `weightit()` function models the propensity scores and creates the IPTW weights in one step. We save these outputs in a `weightit` object we name `iptw_sleep`, which we will use in our next step.

Performing IPTW in R

If propensity score weighting is successful, we expect the distribution of propensity scores in the treatment group to be similar to that of the control group. 

To check the overall balance of propensity scores, we can again use the `bal.plot()` function from the cobalt package. This time we pass the `weightit` object to the `x` argument and `"prop.score"` to the `var.name` argument, with no need for the `data` argument. Lastly, we set `which` equal to `both` so we can view the propensity scores before ("unadjusted") and after ("adjusted") weighting is performed.

```r
# import library
library(cobalt)

# create balance plot of propensity scores
bal.plot(
  x = iptw_sleep, #weightit object
  var.name = "prop.score", #propensity scores
  which = "both", #before and after
  colors = c("#E69F00", "#009E73") #sets fill colors
)
```

![Two density plots side by side showing the distributions of propensity scores for the treatment and control groups. The left plot is from before weighting. The two distributions are nearly the same height, but the control group is centered near 0.1 while the treatment group is centered near 0.3, so they have limited overlap. The right plot is from after weighting. The treatment distribution is the same as the left plot, but the control distribution is now shorter and centered closer to 0.3, so they overlap more than they did in the left plot.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/ps-e9-bal.svg)

The distributions of propensity scores look similar after IPTW, but we should check the balance of individual variables, too. The `love.plot()` function in cobalt visually checks the standard mean differences (SMD) between treatment groups for all variables before and after weighting. We can also show guidelines at ±0.1 SMD by setting `thresholds = c(m = 0.1)`. 

```r
love.plot(
  x = iptw_sleep, #weightit object
  binary = "std", #use SMD
  thresholds = c(m = 0.1), #guidelines
  colors = c("#E69F00", "#009E73") #sets fill colors
)
```

![A Love plot that shows standardized mean differences across the x-axis and points for unadjusted and adjusted SMD's for stress and propensity scores. The unadjusted scores are far outside the guidelines of plus or minus 0.1. The adjusted scores are just outside these guidelines.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/ps-e9-love.svg)

Oh no! Based on this plot, it appears as if the propensity score weighting was unsuccessful: the SMDs between groups are outside the ±0.1 cutoffs for the `stress` variable and for the propensity scores themselves. Since there is still some residual imbalance between treatment groups, we should backtrack to step 2 and refine the propensity score model.

Re-checking Overlap and Balance

As you may have noticed, propensity score methods are an iterative process: we check variable balance, model propensity scores, perform weighting, then check balance again. If imbalance still exists, we can change the propensity score model. A main assumption of propensity score weighting is that we've modeled the propensity scores correctly &mdash; a poor model could lead to biased estimates of the treatment effect!

Let's update our propensity score model from the student sleep data to see if imbalances between groups can be reduced further.

The initial propensity score model only included the `stress` variable as a predictor of meditation, but what happens if we add the `graduate` variable as a second predictor? We need to update the formula in the `weightit()` function:

```r
# import library
library(WeightIt)
# update weightit object
iptw_sleep_update <- weightit(
  #new formula
  formula = meditate ~ stress + graduate,
  data = sleep_data,
  estimand = "ATT",
  method = "ps"
)
```

We re-check balance again to see if the new propensity score model produces a better balance between groups.

```r
# import library
library(cobalt)
# create Love plot of updated model
love.plot(
  x = iptw_sleep_update, #updated model,
  binary = "std", #show SMD
  thresholds = c(m = 0.1), #guidelines
  colors = c("#E69F00", "#009E73") #fill colors
)
```

![A Love plot that shows standardized mean differences across the x-axis and points for unadjusted and adjusted SMD's for stress, graduate status, and propensity scores. The unadjusted scores are far outside the guidelines of plus or minus 0.1. The adjusted scores are now all within these guidelines.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/ps-e10-love.svg)

Success! Both plots show that this new propensity score model produces improved balance. The SMDs now fall between -0.1 and 0.1.

Note that it can take multiple tries to find a good propensity score model. If we needed to further refine our model, we might add more complex terms to the equation, such as polynomial terms or interactions.

Refining the Propensity Score Model

Now that we have a good balance, we can proceed to the last step of a propensity score analysis: estimating the causal treatment effect. 

If we think back to the beginning of our lesson, the motivation for studying student sleep was to estimate the effect of meditation on average sleep in university students. So, to estimate the causal treatment effect of meditation, we need to fit a regression model for the outcome variable (hours of sleep) and incorporate the propensity score weights from our `weightit` model.

The final regression model should include hours of sleep as the outcome variable, use of meditation as the treatment group variable, and stress level and graduate status as the other predictor variables.

To use the propensity score weights from IPTW, we set the `weights` argument of the `glm()` function equal to the estimated IPTW weights. These are stored in our updated `weightit` model that we called `iptw_sleep_update`.


```r
outcome_mod_weight <- glm(
  #outcome model
  formula = sleep ~ meditate + stress + graduate,
  #dataset 
  data = sleep_data,
  #IPTW weights
  weights = iptw_sleep_update$weights 
)
```

To get the estimated treatment effect, we use the `coeftest()` function from the lmtest package. Weighting can cause our standard errors to be inaccurate. To get the best estimate of the treatment effect, we need a more _robust_ calculation of the standard errors, so we add the argument `vcov. = vcovHC` made available by the sandwich package. We won't cover this in detail here, but this adjustment means we are using a _heteroscedasticity-consistent estimation of the covariance matrix_ for estimates of the coefficients.

```r
# import library
library(lmtest)
library(sandwich)
# perform tests of regression coefficients
coeftest(
  outcome_mod_weight, #weighted outcome model
  vcov. = vcovHC #robust standard errors
)
```

As we can see in the following output, the coefficient for the meditation variable is 1.02. If we have met the assumptions of IPTW, this means that we can conclude that a typical student who practiced meditation got an additional 1.02 hours of sleep because of meditation.

```
z test of coefficients:

              Estimate  Std. Error  z value   Pr(>|z|)
(Intercept)   8.971964    0.669241  13.4062  < 2.2e-16 ***
meditate      1.024871    0.215333   4.7595  1.941e-06 ***
stress       -0.045191    0.013664  -3.3072  0.0009422 ***
graduate     -0.770913    0.280460  -2.7487  0.0059823 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1
```

NOTE: You may need to adjust the width of this section of the screen to make it large enough to view the output properly.

Estimating Causal Effects

Flowchart of the five steps for propensity score analysis along with the R packages that are useful at each step.

Congratulations! You have learned a lot about propensity score methods in this lesson and are quickly becoming a master of causation (not just correlation).

In this lesson, you learned:
  - There are five stages of using propensity scores in causal inference.
  - A propensity score is computed by predicting the probability of treatment from the other observed variables.
  - We use propensity scores in Inverse Probability of Treatment Weighting (IPTW) to create balance across observed variables.
  - We can check balance between treatment and control groups across variables using the cobalt package functions `bal.tab()`, `bal.plot()`, and `love.plot()`.
  - A poor propensity score model may lead to biased estimates of the treatment effect, so it is very important that we find the best model possible.
  - We get an estimate of the treatment effect by creating a regression model with IPTW weights.

Conclusion

Propensity Scores

Learn about matching and weighting methods for causal inference.

Matching and Weighting Methods

## Welcome

Welcome to Causal Inference with R! The goal of this course is to understand the conceptual foundations of causal thinking and learn how to use R to implement a few basic causal inference methods.

## Why Learn Causal Inference?

Most questions that ask why one thing causes another are answered through experiments, but experiments are often difficult or impossible to perform. We also have a tremendous amount of observational data available to us outside of experiments, but data science frequently focuses on using that data to answer predictive questions. Causal inference gives us methods to leverage observational data to answer causal questions without the use of experiments.

## What Will We Cover in This Course?

First, we will cover the conceptual foundations of causal inference. Then, we will cover specific methods using R, including:
 - Propensity score weighting
 - Regression discontinuity design
 - Instrumental variables
 - Difference-in-differences

After this course, you will be able to:
 - Explain the basis of causal thinking through the Potential Outcomes Framework
 - Identify and visualize situations where a causal inference method could be used
 - Apply several basic causal inference methods in R

## Prerequisites

This course requires prior knowledge of basic R programming and a basic understanding of linear and logistic regression.

## Learning is Social

Whatever you're working on, be sure to connect with the Codecademy community in the [forums](https://discuss.codecademy.com/). Remember to check in with the community regularly, including for things like asking for code reviews on your project work and providing code reviews to others in the [projects category](https://discuss.codecademy.com/c/project/1833), which can help to reinforce what you've learned.


Introduce yourself to what will be covered in Learn Causal Inference with R and why causal inference is important.

Introduction and Course Overview

Association is a relationship without a specified direction, while a causal relationship does have a specified direction.

Association is a linear relationship while causation implies a non-linear relationship.

A causal relationship does not specify the strength or pattern of the relationship, while an association does specify the strength or pattern of the relationship.

Because individuals were randomly assigned to one of the two groups, we can only observe one of the potential outcomes.

There are two potential outcomes for each individual: number of calories burned while exercising had they listened to music and number of calories burned while exercising had they NOT listened to music.

The counterfactual outcome for an individual in the music group is the number of calories burned while exercising without music.

The observed outcome for an individual in the music group is the number of calories burned while exercising with music.

Z is the variable that indicates whether or not an individual has been randomized to one of the treatment groups.

Y<sup>1</sup> is the potential outcome given that the individual was in the treatment group.

Y<sup>0</sup> is the potential outcome given that the individual was NOT in the treatment group.

Y is the outcome that was actually observed. 

`ITE` = Y<sup>1</sup> - Y<sup>0</sup>, `ATE` = 2 

We can never observe both potential outcomes, only the one that actually occurs.

When randomization is not possible due to ethical or logistical concerns, we cannot estimate causal effects.

Selection bias makes it impossible to estimate causal effects.

The observed outcome is always different from the counterfactual outcome, so we can never estimate the true treatment effect.

Randomization is usually not preferred because it introduces selection bias into an experiment.

If randomization is not possible, we have to estimate counterfactual outcomes in order to estimate causal effects.

Selection bias can occur when individuals self-select their treatment status and can lead to differences between treatment and control conditions.

Randomization may not always be possible due to ethical or logistical concerns.

`ATT` = (3 + 2 + 4) / 3 = 3, `ATC` = (5 + 3) / 2 = 4


`ATT` = (2 + 1 + 3) / 3 = 2, `ATC` = (4 + 2) / 2 = 3

`ATT` = (5 + 3 + 7) / 3 = 5, `ATC` = (4 + 2) / 2 = 3

Test your knowledge of the foundations and framework for causal inference.

Causal Inference Foundations

In 2004, the Journal of Applied Psychology published a study showing that height was strongly correlated with income even after controlling for sex, age, and weight. It was an amazing result that has been talked about almost constantly since. But how did the researchers figure this out? Clearly, they didn't randomly assign people to the "tall" and "not tall" groups. The answer lies in causal inference. 

In this example, height (tall and not tall) is the treatment condition and income is the outcome. 

To find the true causal effect of being tall, we would need to compare how much money someone made if they were tall and how much money that same person made if they were NOT tall. Since both of those things cannot exist, we have to compare what really happened to a prediction of what would have happened if a tall person was short or if a short person was tall. 

To find the true causal effect of being tall (the treatment), we would need to compare the incomes (outcomes) for what would have happened had the person been tall (the treatment was received) to the income (the outcome) for what would have happened had they been not tall (the treatment was NOT received). Since we can't observe both of these _potential outcomes_, we must compare our _observed outcomes_ to a prediction of their _counterfactuals_. 

## Randomization

In the case of our earners, we also had gender, age, and weight as additional traits. 

If our treatment and control groups look similar across the additional traits, we can use them as counterfactuals for each other. So, if our short and tall groups are similar in terms of gender, age, and weight, we can use them as counterfactuals of each other. 

Randomization is considered the best technique to accomplish this. We can think of randomization into one of two groups like the result of a coin toss: heads for the treatment group and tails for the control group. Since no specific trait determines which group an individual is assigned, we should end up with two fairly similar groups. 
* We use the control group as a counterfactual for the treatment group (what the outcome would have been had the tall people been short).
* We use the treatment group as a counterfactual for the control group (what the outcome would have been had the short people been tall).

![two wooden figures, one tall and one short, but otherwise identical are standing next to each other.](https://static-assets.codecademy.com/Courses/causal-inference/conceptual-foundations/short-tall-same.png)

While randomization is the ideal technique, we frequently are unable to use it (traits like gender and height are inextricably related). As such, much of the field of causal inference is dedicated to establishing techniques to  estimate causal effects for nonrandomized studies.

## When Randomization Isn't Possible

Randomized experiments are not always feasible because:

* **Ethics:** It may be unethical to randomly assign individuals into treatment groups. For example, randomly assigning people to drink contaminated water in order to study the impact of pollution on health would be horribly unethical.
* **Cost:** The time, expense, or resources required to run an experiment may be too high. For example, a business may want to study if certain factors influence individuals to become lifetime customers. An experiment that collects data over the course of a customer's lifetime would be too expensive and lengthy to be practical.
* **Logistics:** It may be impractical or even impossible to conduct a randomized study, such as if we want to study the impact of college attendance on future earnings: We can't randomly assign whether or not people go to college.
* **Noncompliance:** Participants may not comply with their treatment assignment even when there was randomization. Suppose the treatment is a long-term commitment, like a job training program. Participants may drop out before completing the training, leaving a smaller treatment group that may differ from the control group in significant ways.

## Observational Studies and Natural Experiments

Since randomizing subjects into treatment and control conditions is not always possible, we often have to work with observational data. When doing an observational study, the researcher does not manipulate who gets the treatment. The outcomes are merely observed and recorded.

Examples of observational data include:
* Surveys of political views and voting behavior
* Weather measurements, such as observed temperature and humidity
* Demographics and health history collected in medical records

![hands on a keyboard with a stethoscope on table, intended to represent entering medical records](https://static-assets.codecademy.com/Courses/causal-inference/conceptual-foundations/hands-keyboard-medical.png)

Some observational studies take advantage of events that result in some randomization of treatment. These types of studies are called natural experiments, or quasi-experimental studies.

Examples of events that might introduce a natural experiment include the following:
 - A tax credit is given only to people who fall below a certain income limit.
 - A business expands into a neighboring state with a different minimum wage.
 - Forest fires break up similar regions into affected and unaffected areas.
 - Housing is made available based on a lottery system.

## Causal Inference Techniques

Through study design and statistical estimation, causal inference methods frequently aim to create situations that look like the product of randomization: two groups with similar characteristics other than the treatment they received. 

How does causal inference achieve this? Generally, possibilities include:

* Find individuals from both groups that match on other traits. Compare only these individuals and discard the data from any leftover individuals.
* When calculating the treatment effect, give more importance to the outcomes of treated individuals who look like control individuals and vice versa.
* Use the partial randomization of treatment assignment made available by natural experiments to identify comparable groups. 
* Find a group of individuals who look a lot like the group that received treatment and compare them.

These are broad generalizations &mdash; the field of causal inference is filled with innovative techniques and statistical algorithms that are constantly evolving to be more accurate and to cover even more situations where experimentation isn't possible.

Explore how causal inference helps us identify the effects of treatment.

Finding Causal Effects 

Learn about the fundamentals of causal inference.

A comic depicting two characters. Character 1: "Another huge study found no evidence that cell phones cause cancer. What was the W.H.O. thinking?" Character 2: "I think they just got it backward." Character 1: "Huh?" Character 2: "Well, take a look." A line chart of cancer incidence and cell phone users in the United States over time. the line for cancer rate increases to a high level first. Then the line for cell phone users begins to increase after. Character 1: "You're not... There are so many problems with that." Character 2: "Just to be safe, until I see more data I'm going to assume cancer causes cell phones."

As humans, we are hardwired to look for patterns and identify relationships between things we observe in the world around us. Our brains naturally tend to fill in details and come up with explanations for these relationships. Unfortunately, jumping to conclusions often leads us to see nonrandom in the random and to blur the lines between association and causation.

Before we get some hands-on experience using causal inference methods, we need to build up some intuition about what exactly causal inference is, when causal inference is appropriate to use, and what causal inference can (and cannot) do. In this lesson, we will gradually introduce you to the key ideas, statistical frameworks, and required assumptions that are fundamental to causal inference.

Causal Inference: An Introduction

Chances are, you or someone you know is superstitious to some extent. Whether it’s wearing a lucky t-shirt to a sporting event or using a favorite pencil and eraser on exams, we believe in superstitions because we think our actions will lead to some desired&mdash;but usually unrelated&mdash;result. Superstitions are, in fact, extreme examples of assuming an associational relationship is actually causal in nature. 

One of the most important concepts in causal inference is the distinction between association and causation. Let’s formally define these two terms:

1. _Association_ is a general term to describe a relationship between variables. Association can describe the strength or pattern of a relationship, but it does not explain the mechanism behind the relationship. 
    
    One frequently used statistical measure of association is _correlation_. Correlation is typically used to describe the association between two variables with a linear pattern. The animation below shows what variables with different degrees of correlation look like.
    
<iframe src="https://static-assets.codecademy.com/Courses/causal-inference/conceptual-foundations/po-corr/index.html" width="350" height="350" frameBorder="0"></iframe>
    
2. _Causation_ describes not only the strength or pattern of a relationship but also the MECHANISM of a relationship. In a causal relationship, variable X CAUSES a change in variable Y; we know that X must happen before Y.



A line plot of swimming pool sales and forest fires over time, where both lines follow similar trends. The lines both rise in the summer months and fall in the colder months, meaning there are more swimming pool sales and more forest fires during warmer months.

Correlation is Not Causation

A depiction of the two possible realities. When the patient receives therapy animal services, Universe T is the factual outcome and Universe C is the counterfactual outcome. When the patient does not receive therapy animal services, Universe C is the factual outcome and Universe T is the counterfactual outcome.

The second important concept we must learn is _counterfactual thinking_. Counterfactual thinking is the process of asking, “What WOULD have happened if circumstances were different?” Let’s illustrate counterfactual thinking using the following example:

Dogs are often called “human’s best friend,” but did you know there may be a biological explanation behind this saying? Research has shown that just a few minutes of interacting with dogs or cats can reduce levels of cortisol, a hormone linked to stress that may lead to weight gain or a weakened immune system. Let’s say we’re interested in learning whether interacting with a trained therapy animal leads to decreased levels of cortisol in hospital patients. 

Approaching this using counterfactual thinking, we must consider what each patient’s cortisol level would be in two different “universes”: 
 - In one universe, a patient interacts with a therapy animal&mdash;we will call this “Universe T.” 
 - In the other universe, referred to as “Universe C,” the same patient does NOT interact with a therapy animal. 

The cortisol levels in these two parallel universes are called _potential outcomes_ because either could potentially be observed. But in reality, we can only observe one cortisol level for a particular patient at one specific moment in time.

Assume that a patient exists in Universe T and ACTUALLY interacted with a therapy animal. The cortisol level observed in this scenario is the _observed_ or _factual outcome_ because it is the outcome that was observed. The cortisol level that would have been observed if the patient existed in Universe C and did not interact with the therapy animal would be the _counterfactual outcome_. We can never actually observe the counterfactual outcome.

Using counterfactual thinking allows us to compare the exact same person at the exact same time under two different circumstances. Because the only difference between the two universes is that the patient received the treatment in Universe T and the control in Universe C, we could compare the cortisol levels from each universe to get an estimate of the effect of interacting with therapy animals.



Thinking Counterfactually

In order to generalize our understanding of the potential outcomes framework, we will now introduce some notation that will be used throughout the rest of this course.  

 - Z represents the treatment or exposure condition. When Z is binary, there are only two possible treatment conditions: treatment (Z = 1) or control (Z = 0).
    
    (Z could also be a continuous variable, such as medication dosage or number of days in the hospital, but our example assumes that there are only two treatment conditions for simplicity.)
 - Y represents the observed value of the outcome variable.

In the potential outcomes framework, we consider how the Y values we observe would change if the treatment were different:

 - When Z is binary, the two potential outcomes are represented with Y<sup>1</sup> and Y<sup>0</sup>:
   - Y<sup>1</sup> is the potential outcome under the treatment condition (Z = 1).
   - Y<sup>0</sup> is the potential outcome under the control condition (Z = 0).

In real life, an individual can only be in one treatment group at any specific point in time. Thus, we can never know the value of both potential outcomes for an individual. We will discuss how to deal with this problem in future exercises. For now, assume that we could know the values of both Y<sup>0</sup> and Y<sup>1</sup>.

Potential Outcomes Notation

If we knew both potential observations for every individual, we could use them to estimate several different statistics that summarize the effect of the treatment:
 - The _individual treatment effect_  (ITE) is computed as Y<sup>1</sup> - Y<sup>0</sup>. This statistic directly compares the two potential outcomes for each individual.
 - The _Average Treatment Effect_ (ATE) is the average of all individual treatment effects, which can be calculated as the difference between the average of Y<sup>1</sup> and the average of Y<sup>0</sup>.

Hold up! You may be wondering, “How are we supposed to calculate the individual treatment effect or true ATE if we can never observe the counterfactual outcome?” This question gets at the fundamental problem of causal inference. Causal inference is essentially a missing data problem: since we can only observe the outcome that actually happened, we are always missing the counterfactual outcome.


A Missing Data Problem

Because we can never know both potential outcomes for an individual, we need to use a different method to estimate causal effects. The most accurate way to do this is to use _randomization_.

Randomization is a method of treatment group assignment that is essentially a coin flip to determine whether an individual receives the treatment or the control. This ensures that, for a large enough sample size, the treatment groups will be similar on average with respect to all factors EXCEPT for the treatment condition. While we still won’t know the counterfactual outcome for any individual, we can be reasonably confident that similar individuals received each treatment. This allows us to estimate the potential outcomes we weren’t able to observe.

When randomization is possible, we can estimate the ATE by taking the difference of the average observed outcome values in the treatment and control groups. In the example in the learning environment, the estimated ATE equals -5.4, which is very close to the true ATE of -5.8 that was calculated in the previous exercise.


Estimating the ATE

Often, randomization of treatment is unethical. For example, it would be unethical to force hospital patients who don’t like animals to receive therapy animal services. Randomized experiments can also be expensive and demand a lot of resources. It is often more practical to use data that is already available. In these scenarios, the treatment group is not random but is based on other factors, such as personal preference.

When randomization is not possible or plausible, several sources of bias can impact the accuracy of estimated causal effects. One source of bias is _selection bias_. Selection bias is bias that happens because of how individuals were put into the treatment or control groups. 

In terms of our therapy animal example, selection bias might arise if:
* individuals choose whether they want the therapy (dog-lovers might be over-represented in this case)
* individuals in the control group come from a different hospital
* individuals are only able to receive the therapy based on another variable, such as insurance coverage

If any of these variables that are associated with treatment assignment are also related to the outcome variable (cortisol level), they are considered _confounders_, or _confounding variables_. These variables may lead us to incorrect conclusions about the impact of the treatment on the outcome.

The following is a graphical representation of how the confounder X is associated with both the treatment Z and outcome Y:

![Two diagrams with boxes labeled with letters and arrows connecting the boxes. Diagram 1 shows no confounding because the Z box has an arrow pointing to the Y box. Diagram 2 shows confounding because there is an additional X box with arrows pointing to both the Z and Y boxes.](https://static-assets.codecademy.com/Courses/causal-inference/conceptual-foundations/po-ex7-narr.svg)

Confounders and Selection Bias

So how do we deal with confounders and estimate the ATE when the treatment assignment is not randomized? Let’s return to the therapy animal example once more. 

Suppose that instead of randomizing patients to receive therapy animal services, we allow the twelve hospital patients to CHOOSE whether or not they want therapy. Imagine that we also have data for a new confounding variable X that represents whether an individual does (X = 1) or does not (X = 0) have a diagnosis of anxiety disorder. Here, X is a confounder and impacts both the treatment and the outcome: patients who have anxiety might be more likely to choose to receive therapy animal services AND have higher cortisol levels generally. 

This confounding is problematic because it means there may be more people with anxiety in the treatment group than in the control group. More people with anxiety means the treatment group may have a higher average cortisol level compared to that of the control group before therapy animal services even occur!

When the treatment groups are unbalanced with respect to confounders, the treatment groups are not _exchangeable_: we would observe different outcomes if the treatment groups swapped treatment conditions.

To avoid making poor comparisons between potentially imbalanced treatment groups, we have to be able to assume _conditional exchangeablility_:
 - _Conditional exchangeability_ means that the treatment groups are exchangeable if we take into account confounding variables.
 - This is also called _ignorability_ or _unconfoundedness_.

By taking anxiety diagnosis (variable X) into account, we avoid getting a biased estimate of the cortisol levels produced by those receiving therapy animal services in comparison to those who do not.

Estimating the ATE with Confounders

The last exercise brought up one assumption used throughout causal inference: conditional exchangeability. In this lesson, we will learn a few more assumptions.

A second assumption made in causal inference is  _Stable Unit Treatment Value Assumption_ (SUTVA). The name is a mouthful, but it’s a pretty simple assumption that can be broken down into two components:
1. An individual’s treatment assignment doesn’t impact the outcome of other individuals. Using the example of the hospital patients, this would mean that one individual getting therapy animal services doesn’t impact the stress level of other individuals in the hospital.
2. The treatment (or control) is applied exactly the same way to all patients. For example, the patients receiving therapy animal services should receive therapy for the same amount of time and ideally from the same exact animal to ensure consistent treatment.

The next assumption we need to familiarize ourselves with is _overlap_. The assumption of overlap means that all subgroups of patients divided by their characteristics have a positive, non-zero probability of getting either treatment assignment. Overlap is also referred to as the _common support_ or _positivity_ assumption. 


A plot showing probability distributions for age in two situations. The left plot is labeled "Overlap" and shows treatment and control distributions that are nearly the same shape and location covering one another. The right plot is labeled "No Overlap" and shows the treatment distribution covers ages 0 to 38, while the control distribution covers ages 38 to 80.

Assumptions

So far, we’ve only focused on the ATE, but there are many other useful estimands we can use to summarize the average causal effect of some intervention or exposure. In many situations, it is not realistic to meet the assumptions of conditional exchangeability, SUTVA, AND overlap. Other causal estimands allow us to relax some of these assumptions.

The ATE summarizes the causal effect of treatment across ALL treatment conditions. But, we may only be interested in the causal effect of treatment in a particular group. In these cases, we can use a different estimand:
 - _Average Treatment Effect of the Treated_ (ATT) is the average of Y<sup>1</sup> - Y<sup>0</sup> for all individuals assigned to the treatment condition (Z = 1).
 - _Average Treatment Effect of the Control_ (ATC) is the average of Y<sup>1</sup> - Y<sup>0</sup> for all individuals assigned to the control condition (Z = 0).

There are other estimands we may encounter in causal methods, but the ATE, ATT, and ATC are three that we see in a variety of situations.


Other Causal Estimands

A diagram showing two boxes connected by an arrow that points from the left box to the right box. The left box is labeled "Identification" and states, "Determine the causal estimand. Consider whether causal assumptions have been met: 1 Conditional exchangeability, 2 SUTVA, 3 Overlap." The right box is labeled "Estimation" and states, "Carry out statistical modeling. Obtain an estimate of the treatment effect."

Now that you have some experience with the assumptions needed for causal inference as well as familiarity with a few causal estimands, we need to set up a structured way to apply what you’ve learned when approaching causal inference problems.

We can think of causal inference as a two-step process.
1. _Identification:_ During this stage we determine which causal estimand we will estimate based both on what we want to know and on what we are able to compute. We must also determine whether we will be able to meet the three assumptions in order to infer that the relationship is causal in nature.
2. _Estimation:_ Now that we've reasoned what can compute and that this measure will reflect a causal relationship between variables, we must carry out the statistical model to compute the treatment effect. We used some simple computations throughout this lesson, but we will soon use more complicated methods like multiple linear regression to obtain our estimand.

Causal Inference Process

A comic depicting a conversation between two characters. Character 1 says, "I used to think correlation implies causation. Then I took a statistics class. Now I don't." Character 2 says, "Sounds like the class helped." Character 1 responds, "Well, maybe."

Congratulations! You've completed the Potential Outcomes Framework lesson and now have a solid conceptual foundation for causal inference!

In this lesson, you learned:
 - Causation is different from association in that it implies a relationship where a change in one variable leads to a change in another variable.
 - If we could view both potential outcomes, we could accurately compute the true effect of a treatment.
 - The fundamental problem of causal inference is that we only get to view the observed outcome and not its counterfactual.
 - When randomization is not available to produce estimates of treatment effects, we must use other strategies to predict counterfactuals and find our estimand of interest.
 - The three main assumptions for causal inference are:
   - Conditional exchangeability
   - SUTVA
   - Overlap
 - The causal inference process can be thought of as two steps: identification and estimation.


Review

Potential Outcomes Framework

Propensity scores reflect the probability of being in the treatment group based on other observed characteristics.

Propensity scores reflect the probability of being in the treatment group based on all observed AND unobserved variables.

Propensity scores describe the probability of the treatment having a positive effect.

Propensity scores can ONLY be modeled with one variable.

Misspecification of the propensity score model may lead to biased estimates of the treatment effect.

A misspecified propensity score model could lead to different sample sizes in the treatment and control groups.

If the propensity score model is misspecified, the initial balance and overlap of variables may not be adequate.

Estimate treatment effect or update propensity score model

Sample sizes of the treatment and control groups.

Poor balance because the standardized mean differences are not close to 0 and the variance ratios are different from 1.

Poor balance because the control group has a sample size of 120 while the treatment group only has 46.

Good balance because one standardized mean difference is positive while the other is negative.

Good balance because both of the variance ratios are positive.

Balance Measures
           	   Type Diff.Un V.Ratio.Un
variable_1  Contin.  0.2141 	1.5202
variable_2  Contin. -0.4405 	0.6732

Sample sizes
	    Control Treated
All 	  120  	  46

`variable.1` and `variable.3` both exhibit good balance because the adjusted SMD for both variables is within the range of -0.1 to 0.1.

`variable.1` and `variable.2` because the unadjusted SMD for both variables is outside of the range of -0.1 to 0.1.

Only `variable.3` because the adjusted SMD is smaller than 0.1.

None of the variables have good balance because the unadjusted SMD and adjusted SMD are not equal.

`glm(outcome ~ treat + var1 + var2, data = test_data, weights = wt_iptw$weights)`

`glm(treat ~ var1 + var2, data = test_data, weights = wt_iptw$weights)`

`glm(outcome ~ treat + var1 + var2, data = test_data)`

# import library
library(WeightIt)
# perform IPTW in weightit object
wt_iptw <- weightit(
  formula = treat ~ var1 + var2,
  data = test_data,
  estimand = "ATT",
  method = "ps"
)

The second code block will provide robust standard error estimates whereas the first will not.

The second code block uses stabilized IPTW weights whereas the first block does not.

The second block models the outcome variable using logistic regression whereas the first block models the outcome using linear regression.

Test your knowledge of matching and weighting methods.

Matching and Weighting Quiz

In this project, you will use inverse probability of treatment weighting (IPTW) to estimate the causal effect of cover crop usage on wheat crop yields.

Let's start by loading the dataset. We've provided a file named **farms.csv** in the workspace. Load this file into a dataframe named `farm_df`.

Take a look at the head of the dataframe `farm_df`. Make sure to click through the arrows in the header to see all available variables.

<h6>Outcome:</h6>

 - `total_yield`: represents the average total yield of wheat in bushels per acre.

<h6>Treatment:</h6>

 - `cover_10`: indicates whether at least 10% of farms in a county employ cover crops.

<h6>Predictors:</h6>

 - `region`: geographic region where the county is located.
 - `total_avg`: average total size of farms (thousands of acres).
 - `age_avg`: average age of the primary farm operator (years).
 - `experience_avg`: average number of years of farming experience of the primary farm operator (years).
 - `insurance_avg`: average area of land with crop insurance (thousands of acres).
 - `easement_p`: indicates the average percent of land under an easement in the county.
 - `conservation_till_avg`: average area of land that uses conservation tillage methods (acres).
 - `fertilizer_per_area`: average cost of fertilizers used per acre (hundreds of thousands of dollars).


First, let's assess overlap and balance visually for a couple of variables to compare counties with AT LEAST 10% of farms using cover crops and counties with LESS THAN 10% of farms using cover crops.

Create a balance plot for the `age_avg` variable. Do the treatment and control distributions appear to be centered in the same location and have similar spreads?

Now create a balance plot for the categorical variable `region`. Are the proportions of counties in the treatment versus control groups similar across the four regions?

So far, it looks like we have quite a bit of imbalance between the treatment groups to deal with. Let's assess balance numerically to quantify this imbalance more precisely. Create a balance table to show standardized mean differences (SMD) and variance ratios for all the predictor variables according to the treatment group. Check whether the balance measurements fall outside of the guidelines of ±0.1 for SMDs and between 0.5 and 2 for variance ratios.

Now it's time to perform the inverse probability of treatment weighting (IPTW) procedure to see if we can minimize imbalance between the treatment groups before estimating the causal treatment effect.

To get the most out of this procedure, we have to carefully think about the form of the propensity score model. In other words, we should think about what variables in the dataset are predictive of cover crop use. Let's start with a propensity score model with a limited set of variables:

 - `region`: The ability to grow cover crops depends in part on regional variation in climate.
 - `total_avg`: Cover crops might be more feasible to implement on farms that are smaller.
 - `insurance_avg`: Many farmers have crop insurance, but certain states offer subsidies to offset insurance costs if farmers decide to use cover crops.
 - `fertilizer_per_area`: Cover crops require the use of fertilizer. Higher fertilizer costs might prevent some farmers from using cover crops.

Perform an IPTW procedure using region, average farm size, average number of farms with crop insurance, and average cost of fertilizer per acre as predictors in the propensity score model. Specify in your code to calculate weights in order to estimate the average treatment effect (ATE). Save the output to a weighting object named `farm_iptw`.

We need to perform a check of the IPTW procedure to see whether we have achieved a better balance between groups. Let's look at the SMD values visually to see whether they now fall within our ±0.1 guidelines.

Create a Love plot showing the SMD between treatment groups for each variable in the propensity score model. Be sure to display the SMD for binary variables and the SMD threshold of ± 0.1.

The balance is pretty good, but let's see if we can improve it further by tweaking the propensity score model.

Re-run the IPTW procedure but with a new propensity score model. This time, remove the `fertilizer_per_area` variable from the model. Then add to the model: 
 - the average age of the primary producer
 - the average experience of the primary producer
 - the average percent of land under easement
 - the average number of farms using conservation tillage
Save the new model as `farm_iptw2`.

Let's find out whether our expanded propensity score model leads to a better balance in the weighted data.

Create a Love plot showing the SMD between treatment groups for each variable using the new propensity score model. Be sure to display the SMD for binary variables and the SMD threshold of ±0.1.

The Love plot shows SMD values closer to zero than the previous model! We can also inspect the distribution of propensity scores between treatment groups to see how well the IPTW procedure worked.

Generate a balance plot to display the distribution of propensity scores in each treatment group before AND after the weighting process. Do the distributions of the weighted propensity scores look closer to identical, and are they overlapping each other?

You're almost there &mdash; great job so far! The last step of the analysis is to fit the final outcome model to estimate the causal effect of cover crop usage on crop yields.

First, let's fit the outcome regression model with total crop yield as the outcome, cover crop usage as the treatment variable, and the other variables from the second propensity score model as the additional predictors. Remember to include the weights from the IPTW procedure in the regression model. Save this regression model to an object name `yield_mod`.

We're not quite done yet! Remember that when we use IPTW to estimate the causal treatment effect, we need to use a robust standard error estimate.

Estimate the regression parameters for the weighted regression model using the `coeftest()` function. Incorporate a robust standard error estimator from the sandwich package.

Take a look at the regression parameter for the treatment variable. How would you interpret this value? What do you think it says about the effect of cover crops on wheat yields? Write out your interpretation of the results.

Impact of Cover Crops on Wheat Crop Yields

Congratulations! 

You have just been hired by the local health officer to estimate the effect of having high cholesterol on blood pressure. You have a dataset from a hospital and are eager to apply your new knowledge of causal inference. 

The dataset includes the following variables on 100 patients:
 - High cholesterol &mdash; whether or not the patient has high cholesterol
 - Systolic blood pressure &mdash; measured in mmHg
 - Diastolic blood pressure &mdash; measured in mmHg
 - Age &mdash; the age of the patient in years
 - Smoking status &mdash; whether or not the patient smokes tobacco products

A primer on blood pressure:
 - Systolic blood pressure is the force the heart exerts on the walls of arteries DURING each beat
 - Diastolic blood pressure is the force the heart exerts on the walls of the arteries BETWEEN each beat
 - Normal blood pressure is generally defined as having a systolic pressure less than or equal to 120 mmHg or a diastolic pressure less than or equal to 80 mmHg.

Your first task is to estimate the effect of having high cholesterol on systolic blood pressure. 

Cholesterol is the treatment variable. Blood pressure is the outcome variable. Age and smoking status are _covariates_ (variables that are not the variables of interest).

There are many risk factors for high blood pressure, including obesity, diabetes, age, and family history. This is a problem because you will have trouble studying cholesterol (the treatment) independently from these other factors.

Because we cannot randomize who has high cholesterol, there will likely be imbalances in other variables, between patients with high cholesterol (treatment group) and patients who do not have high cholesterol (control group). For example, overall the high cholesterol group might be older or more people might be smokers. 

But you have some tools to address this problem. The first is Randomization.

Randomization makes the estimation of causal effects easier. In randomized studies, the covariates in the treatment and control groups look similar to each other (i.e., there just as many smokers are in both groups), so treatment status is the only difference between them. By contrast, in non-randomized studies, the variables of the treatment and control groups may look very different (i.e., more smokers in the high cholesterol group).

You can balance the covariates with matching and weighting methods. 

One matching method is to first separate the data into subgroups, or _strata_ (_stratification_), based on some variable such as smoking. 

First, start by dividing the group into the smokers and non-smokers. Looking at just the smokers, compare the incidence of high cholesterol and high blood pressure. Now do the same for the non-smokers. This article dives more into the specifics, but that is the basic idea: divide the data into groups based on some variable and compare within that group.

Because the individuals _within the subgroups_ are similar to one another, you have satisfied the _conditional exchangeability assumption_ of causal inference: if treatment assignments were swapped, the outcomes would still be the same within each subgroup, on average.

Either categorical or numeric variables can be used to split the data into subgroups, but the process differs slightly depending on which type of variable is used.

## Stratification on Categorical Variables

Nicotine (the chemical in tobacco products) causes physical changes that cause both high blood pressure and high cholesterol. Because smoking tobacco is related to both high blood pressure (outcome) and high cholesterol (treatment), you may not be able to estimate the effect of high cholesterol on blood pressure if you do not take smoking into account. 

The following two plots show what stratifying the blood pressure data according to tobacco use looks like.

![Two bar charts showing counts of normal cholesterol and high cholesterol individuals. The left plot shows 48 people for normal and 52 people for high. The right plot shows the same two bars but each is further split by smoking status. There is a greater proportion of non-smokers in the normal cholesterol group than in the high cholesterol group.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/strat-bars.svg)

In the first plot, we see that the number of individuals with high cholesterol (n = 52) is approximately the same as the number who have normal cholesterol (n = 48). However, the second plot shows that the ratio of smokers to non-smokers in each group is quite different. Among individuals with normal cholesterol, 33 (69%) are non-smokers; among those with high cholesterol, only 22 (42%) are non-smokers.

If we don't take smoking status into account when comparing the high and normal cholesterol groups, we would be comparing two very different groups. In other words, we would be violating the conditional exchangeability assumption of causal inference. This could lead to a biased estimate of the treatment effect because we would be ignoring the fact that smoking is related to blood pressure.

## Estimation of Causal Effects

By separating the observations by smoking status, we have stratified our data into two subgroups. When using stratification, causal effects are not estimated directly from the entire sample, but rather within each subgroup. We can view this process as three steps:

  1. Find the average treatment effect (ATE) for each subgroup by subtracting the control group average outcome from the treatment group average outcome within each subgroup.
  2. Multiply each subgroup ATE by the proportion of individuals in that subgroup.
  3. Add all weighted ATEs together.

The table below shows the calculation of the average blood pressure in the treatment and control groups for smokers and non-smokers.

<table class = "table table-condensed">
  <thead>
    <tr class="header">
      <th></th><th><b>Average Blood Pressure</b></th>
    </tr>
    <tr>
      <th></th><th><b>Non-smoker</b></th><th><b>Smoker</b></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><b>High Cholesterol</b></td><td>139</td><td>155</td>
    </tr>
    <tr>
      <td><b>Normal Cholesterol</b></td><td>110</td><td>117</td>
    </tr>
  </tbody>
</table>
 
Let's follow the three steps above to estimate the treatment effect of tobacco use when we stratify on smoking status.

First, we calculate the average treatment effect within each subgroup:

```tex
 \text{ATE}_{Non-smokers} = 139 - 110 = 29 mmHg 
```

```tex
\text{ATE}_{Smokers} = 155 - 117 = 38 mmHg
```


Next, we multiply the subgroup ATEs by the proportions of each subgroup. In this case, out of 100 patients, 55 were non-smokers and 45 were smokers.



```tex
\textbf{Non-smokers}
\hspace{1cm}
29 * \frac{55}{100} = 15.9 mmHg
```

```tex
\textbf{Smokers}
\hspace{1cm}
38 * \frac{45}{100} = 17.1 mmHg
```


Finally, to estimate a causal effect for ALL patients, we sum the weighted subgroup ATEs.

```tex
\text{Weighted ATE} = 15.9 + 17.1 = 33 mmHg </i>
```


The weighted ATE of 33 mmHg suggests that, when accounting for smoking status, individuals with high cholesterol have on average a systolic blood pressure 33 mmHg higher than that of individuals with normal cholesterol.

## Stratification on Numeric Variables

Stratification by numeric variables requires a little more effort. Unlike categorical variables, which have just a few subgroups, numeric variables can vary greatly. This means that without some kind of manipulation of the data, we could end up with a huge number of subgroups each containing very few observations.

Let's think about age first. If we have 100 people age 20-80 (a 60 year span), we will have 60 different ages to work with.

Rather than treating every unique age as its own subgroup, we could lump them together into ranges or _bins_. Let's start with three age groups: younger than 40, between 40 and 60, and older than 60.

![Scatter plot of systolic blood pressure versus age with points grouped by cholesterol status. Blood pressure and age show a positive relationship. High cholesterol points appear to be associated with higher blood pressure. two vertical lines split the graph into three age segments: 20 to 40, 40 to 60, and 60 to 80.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/strat-lines.gif)

Now that we have defined subgroups for age, we can estimate the average treatment effect of tobacco use. We can follow the same three steps we did when stratifying on smoking status.

![The same scatter plot of blood pressure versus age but with additional horizontal lines marking average blood pressure for each cholesterol group. The gap between the groups increases with increasing age. For ages 20 to 40, high is 138 and normal is 116. For ages 40 to 60, high is 154 and normal is 121. For ages 60 to 80, high is 169 and normal is 127.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/strat-diffs.gif)

We look at each subgroup separately and subtract the average blood pressure of the control group from the average blood pressure of the treatment group to get each subgroup's ATE.


```tex
\textbf{Age < 40}
\hspace{1cm}
\text{ATE} = 138 - 116 = 22 mmHg
```



```tex
\textbf{40$\leq$ Age $\leq$ 60}
\hspace{1cm}
\text{ATE} = 154 - 121 = 33 mmHg
```



```tex
\textbf{Age > 60} \hspace{1cm}\text{ATE} = 169 - 127 = 42 mmHg
```


Then, we weight each subgroup's ATE by the proportion of observations in that subgroup.



```tex
\textbf{Age < 40}
\hspace{1cm}
22 * (14/100) = 3.08 mmHg
```


```tex
\textbf{40$\leq$ Age $\leq$ 60}
\hspace{1cm}
33 * (56/100) = 18.5 mmHg
```


```tex
\textbf{Age > 60}
\hspace{1cm}
42 * (30/100) = 12.6 mmHg
```


Finally, we add the weighted averages together.


```tex
\text{Weighted ATE} = 3.08 + 18.5 + 12.6 = 34.1 mmHg
```


The weighted average treatment effect of 34.1 mmHg suggests that, when accounting for age, individuals with high cholesterol have on average a systolic blood pressure 34.1 mmHg higher than that of individuals with normal cholesterol.

## Limitations

You may notice that we got different estimates for the ATE depending on which variable we used to stratify the data, which leads to an important question: which estimate should we trust? 

The estimated ATE is just that &mdash; an estimate. We could refine our estimate of the ATE by stratifying on both smoking status AND age, which would require us to compute a weighted ATE estimate for each of the six strata. However, stratifying on too many variables comes with its own limitations.

Stratification is useful primarily in _low-dimensional_ data, which is data where there are very few observed variables. Stratification on multiple variables can lead to very small subgroups, making estimation of causal effects less reliable. Further, stratification is not a particularly efficient use of data. If the overlap assumption is not met for a subgroup, we cannot estimate a causal effect using that data.

## Assumptions

The three main assumptions for causal inference apply for stratification.

  1. _Conditional exchangeability_: This assumption states that as long as we account for confounders, we would obtain the same outcomes if the groups swapped treatment assignments.
  2. _Overlap_: This assumption requires that individuals in each subgroup formed by the stratification process must have a positive probability of being in either treatment group. This assumption is necessary because we cannot calculate an average causal effect within a subgroup where 100% of the observations are in either the treatment or control group.
  3. _Stable unit treatment value assumption (SUTVA)_: SUTVA states that the potential outcomes of an individual are not affected by the treatment assignment of any other individuals in the same strata and that the treatment is exactly the same across all individuals.

## Review

In this article, we learned that:
 - Stratification is a matching method for causal inference that is best used with low-dimensional data where treatment and control groups are unbalanced across variables.
 - Stratification rebalances observations by creating subgroups or "strata" based on variable values.
 - Causal effects can then be estimated by comparing treatment and control groups within these subgroups.
 - After estimating causal effects within strata, the overall average treatment effect may be estimated by taking a weighted average.


Learn about stratification for causal inference.

Introduction to Stratification

One of the challenges in using observational data for causal inference is that the treatment and the control groups are not similar at baseline (before there is any treatment). In other words, treatment aside, the two groups may differ from one another in ways that affect their outcomes. 

We know randomized experiments are the gold standard for creating balanced treatment and control groups because, in randomized experiments, the two groups only differ by treatment status. However, we can mimic this balancing effect when working with nonrandomized data. Two popular methods for mimicking randomization are matching and weighting. New and more complex variations of these techniques are constantly being developed and employed, but this article will introduce you to the fundamental ideas behind them.

## Unbalanced Observational Data

Let's consider an example of nonrandomized data. Imagine we're trying to estimate the causal effect of tablet ownership on weekly screentime. We don't have the funding to run an experiment where we give out tablets to participants, so we survey people to ask them whether they own a tablet and how many hours per week they use devices with screens.

Let's assume a binary treatment status: the value is 1 if the person has a tablet and is 0 otherwise. We notice that those who have tablets have higher weekly screentime on average.

We might already guess that this difference does not necessarily imply a causal effect of tablet ownership. Because this is an observational study, other factors could be involved. For example, suppose that the group that owns a tablet also has a higher number of social media accounts. Because social media use may also affect weekly screentime, we don't know whether the observed difference in screentime is due to social media use or tablet ownership (or something else entirely!).

Suppose the distribution of the number of social media accounts for each group is shown in the plot that follows.

![Bar plot of the frequency of the number of social media accounts split by tablet ownership. The distribution of accounts for the no-tablet group shows most observations with between 2 and 6 accounts with a peak at 4. The distribution of accounts for the tablet group has most observations between 5 and 9 accounts with a peak at 6.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/mw-obs.svg)

As we can see, the two distributions are different: the treatment group (tablet) appears to have more social media accounts than the control group (no tablet). These groups are different beyond just their treatment status!

In a randomized experiment, we would expect something more balanced like in the following plot.

![Bar plot of the frequency of the number of social media accounts split by tablet ownership. The distributions of accounts for the tablet and no-tablet groups are nearly identical centered on 5 accounts and spread in a bell shape between 0 and 10 accounts.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/mw-exp.svg)

Matching and weighting methods make the two groups distributionally similar in terms of _observed variables_, or the additional attributes that we can measure and record. In the example above, the number of social media accounts is an observed variable. We can use matching or weighting methods to make the distribution of social media accounts similar for the tablet and no-tablet groups.

## Matching

In matching, we match observations in the control group to observations in the treatment group based on the observed variables. Matching usually ensures that the two groups are similar in terms of those observed variables. Any unmatched observations are removed.

Let's consider a simple example where we match observations only on the social media accounts variable. We match individuals from the treatment group with individuals in the control group based on how many social media accounts they have. Then we discard any individuals without a match. 

We can imagine this as though we are drawing a line across our distribution that cuts every taller bar to the size of the shorter bar in the pair, as illustrated in the following diagram.

![The same bar plot of the frequency of the number of social media accounts split by tablet ownership. The distribution of tablet owners shows more social media accounts than that of the no-tablet group. A line is drawn in a bell shape cutting every taller bar of one group to match the shorter group of the other. The portions of tall bars above the line are faded.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/mw-match.svg)

The unmatched observations above our line, such as some people with tablets and 3 accounts or some people without tablets and 7 accounts, will be discarded. We are left with distributions of treatment groups that are centered on 5 accounts and have similar spreads in a bell shape. This new sample looks a lot like the hypothetical randomized experiment data, only with fewer observations in total. This is a very simple example using only one variable, but we can extend this concept to match across multiple variables, or even combinations of variables, to form groups that mimic what we might expect from randomization.


## Weighting

One of the drawbacks of matching is that we discarded some of our observations, making our sample size smaller. Losing information like that, especially with a small study, may mean losing _power_, or the ability to detect an existing effect. This is where weighting has an advantage.

The idea behind weighting is similar to matching: to find control observations that are similar to the treatment group. However, instead of keeping or dropping observations, we can give them weights. We can think of a weight as a number that indicates how much we want an observation to be represented in our group. Let's look at the illustration again, this time with weighting employed.

![The same bar plot of the frequency of the number of social media accounts split by tablet ownership. The distribution of tablet owners shows more social media accounts than that of the no-tablet group. Arrows pointing up are on top of the tablet group bars for 3 and 4 accounts, which are both shorter than the bars for the no-tablet group at those numbers. Arrows pointing up are on top of the no-tablet group bars for 6, 7, and 8 accounts, which are all shorter than the bars for the tablet group at those numbers.](https://static-assets.codecademy.com/Courses/causal-inference/matching-and-weighting/mw-weight.svg)

Instead of discarding unmatched individuals, we give larger weights to the individuals that are fewer in frequency at each number of accounts. For example, we give more importance to people who have a tablet and only 4 accounts along with people who don't have a tablet and have 8 accounts. This is the same as giving more importance to the outcomes of individuals who look more like people in the opposite treatment group. By manipulating how much each individual's outcome counts for, we are able to make the two groups look more alike and still keep all of our observations!

We should note that, while not used here, we could have also _down-weighted_ the outcomes of those that already look a lot like their own treatment group. This means giving these outcomes smaller weights so that they have less impact on our computation of the treatment effect.

## Estimands

Because we use matching and weighting methods before estimating a treatment effect, these techniques are part of the identification phase of causal inference. We are using these methods to create groups that are balanced based on observed traits. Assuming the groups don't differ in some unobserved way, this means the only difference between them is the treatment. This way we can justify inferring a causal relationship between the treatment and the outcome variable.

The way we are able to compose these groups determines which estimand we're trying to compute. For example, if we have equal numbers of observations in each treatment group, and we match each observation to a counterpart in the other group, we may estimate the average treatment effect (ATE) because the groups stand as counterfactuals for each other.

Real data is often not so tidy, however. For this and other reasons, researchers frequently use the average treatment effect on the treated (ATT).

When estimating the ATT using matching:
1. All of the treated observations are kept.
2. Controls are matched to the treated individuals based on variable values.
3. Any unmatched controls are removed from the dataset.

Similarly, to estimate the ATT using weighting:
1. The treated observations are given a weight of 1.
2. The control group is made similar to the treatment group by up-weighting some controls (weights greater than 1) and down-weighting others (weights less than 1).

## Review

To recap, in randomized experiments, the distribution of variables in the treatment and control groups are similar on average. This is a major advantage of randomization.

We learned that matching and weighting methods are ways of manipulating treatment and control groups in an observational study to make them similar in terms of observed traits.

So, why bother with randomized experiments in the first place? Why not use observational studies all the time? After all, they're cheaper and easier to do. The reason is as follows: While matching and weighting create balance with respect to traits that can be observed, randomization creates balance in terms of ALL traits, observed and unobserved. Randomization may be the top choice for estimating causal effects, but matching and weighting methods are very helpful when that's not possible!


Introduction to Matching and Weighting Methods

Apply regression discontinuity design (RDD) to estimate the causal effect of a snow emergency system on average commuting time in a fictitious city.

First, let's import the dataset for this project. We've provided a file named **snow.csv** in the workspace. Load this file into a dataframe named `snow_df`.

Now that the dataset is imported and ready to work with, let's take a look at the first few rows of the dataframe `snow_df`. This dataset contains several variables:

<h6> Forcing Variable </h6>

* `snowfall`: The amount of snowfall on recorded date (inches)

<h6> Treatment Variable </h6>

* `emergency`: Whether or not a snow emergency was issued on the recorded date ("Emergency" vs. "No Emergency")

<h6> Outcome Variable </h6>

* `minutes`: Average commute time on recorded date (minutes)

<h6> Others </h6>

* `date`: Recorded date

Use the `ggplot2` package to create a scatter plot of the data with the forcing variable on the x-axis and the outcome variable on the y-axis. Use colors and shapes to distinguish between dates with a snow emergency and dates without a snow emergency.

Save the base plot as `scatter_base` and view the plot.

Let's add a little more detail to the plot. Add a vertical dashed line to `scatter_base` at the snowfall cutoff that determines emergency status. Save the updated plot to `scatter_cutpoint` and print it to the console to view the plot.

Does it look like this is a sharp RDD or a fuzzy RDD?

To verify that there is actually a discontinuity in the outcome at the cutpoint, add linear best fit lines to `scatter_cutpoint`. Separate best fit lines should be plotted for dates with a snow emergency and dates without a snow emergency. Save this plot as `scatter_fit` and print it to view the plot.

Does there appear to be a discontinuity in the outcome variable around the cutpoint?

Calculate the optimal IK bandwidth for the data and save the result to `snow_ik_bw`. Print `snow_ik_bw` to the console. Is the bandwidth large or small relative to the scale of the forcing variable?

Add solid vertical lines to `scatter_cutpoint` to plot the range of the optimal bandwidth stored in `snow_ik_bw`. Save the updated plot to `scatter_bw` and print it to the console to view the plot.

Does a lot of the data fall outside of the bandwidth? How do you think this could affect the treatment effect estimate?

Using the correct cutpoint and optimal bandwidth you just calculated, fit a local linear regression model using the `RDestimate()` function. Save the results to `snow_rdd`.

Print the results of the local linear regression results stored in `snow_rdd`. Take note of the estimate of the LATE across different bandwidths. Are the values similar?

Print the number of observations in each of the three bandwidths provided in `snow_rdd`. How does sample size change across the optimal bandwidth, half the bandwidth, and double the bandwidth?

Print the standard errors of the LATE estimates at each bandwidth provided in `snow_rdd`. How did the standard errors change as the sample size increased?

Great job! Using the optimal bandwidth of 2.63 inches, we calculated a local average treatment effect (LATE) of -11.04. This means that in this dataset, the use of the snow emergency system resulted in an average decrease in commute time of 11.04 minutes.

Given what we know about interpreting the LATE, think about where this causal effect is generalizable &mdash; do you think the bandwidth was wide enough to cover many different snowfall events?

Effect of Emergency Weather Systems on Transit Times

Learn about Regression Discontinuity Design (RDD) and when it's useful.

Research suggests that individuals put more into retirement accounts or pension plans if their employer matches a portion of their contributions. 

Imagine that a country's legislature passes a law that requires employers with more than 300 employees to provide a retirement contribution matching program. Using tax data, lawmakers compile a dataset with the following information from each of 200 different companies:

 - `size`: Number of employees
 - `group`: Contribution Matching Program Group ("No Program" vs. "Program")
 - `contribution`: Average monthly employee contributions (in dollars)

Number of employees &mdash; acting as the forcing variable with a cutoff at 300 employees &mdash; dictates whether or not a company has a contribution matching program.

We want to assess whether the policy caused an increase in average monthly retirement contributions by employees. This example is a perfect candidate for a new technique, called regression discontinuity design.

RDD works by only focusing on the points near the cutoff. Companies with close to 300 employees are probably very similar to one another. So the ONLY difference, on average, between companies with 299 employees and those with 301 employees should be whether they have the contribution matching program. This means we should have a treatment and control group that look a lot like those of a randomized experiment!

Scatter plot of retirement contribution increasing as number of employees increases with a vertical line at x=300 employees. Points left of the line do not have a contribution program, while points right of the line do have a contribution program.

The Case for RDD

We need a little more vocabulary before we can dive into more details about Regression Discontinuity Design.

Sometimes treatment group assignment is dictated by one continuous variable known as a _forcing variable_ (the company size in our example). The forcing variable, also referred to as a _rating variable_ or _running variable_, has a cutoff value such that:
 - Individuals with a value smaller than the cutoff are in one treatment group.
 - Individuals with a value larger than the cutoff are in the other treatment group.

The treatment group is perfectly predicted by the forcing variable. In this scenario, we cannot rely on other causal inference techniques such as matching or weighting methods because there is not a consistent mixture of treatment and controls across different values of the forcing variable.


Scatter plot of an outcome variable for values of a forcing variable from 30 to 90. Points below a forcing value of 60 are circles for control, and points above 60 are triangles for treatment.

Forcing Variables in RDD

The forcing variable cutpoint can either be exact or not exact:

 - If the cutpoint is exact, the probability of treatment changes from zero to one at the cutpoint. In other words, all observations on one side of the cutpoint are in the treatment group (and actually received the treatment) and all observations on the other side of the cutpoint are in the control group (and didn't receive the treatment). This is known as a _sharp design_.

![Scatter plot of an outcome variable for values of a forcing variable from 30 to 90. Points below a forcing value of 60 are circles for control, and points above 60 are triangles for treatment, indicating a sharp design.](https://static-assets.codecademy.com/Courses/causal-inference/rdd-and-iv/example-sharp.svg)

 - If the cutpoint is NOT exact, the probability of treatment doesn't jump from zero to one at the cutpoint. In other words, there are individuals in each treatment group on BOTH sides of the cutpoint. This is known as a _fuzzy design_.

![The same scatter plot of an outcome variable for values of a forcing variable from 30 to 90. Most circles are below 60, and most triangles are above 60, but some are in the opposite group, indicating a fuzzy design.](https://static-assets.codecademy.com/Courses/causal-inference/rdd-and-iv/example-fuzzy.svg)

While this distinction may seem minor, it is actually incredibly important to recognize when to use sharp RDD as opposed to fuzzy RDD. Not only do the two approaches make different assumptions about the data, but they also require slightly different statistical methods.

We will focus on sharp RDD in this lesson, even though perfect compliance with the treatment assignment is not always realistic.

Sharp or Fuzzy?

We saw that our employee contribution example requires a sharp regression discontinuity design: all companies with at least 300 employees have a contribution matching program, and all companies with fewer than 300 employees do NOT have a contribution matching program. Before we decide which companies are similar enough to compare, we must consider some assumptions.

In order to get valid estimates of the treatment effect using a sharp RDD approach, several assumptions have to be met.

1. The treatment variable impacts the outcome, but not any of the other variables.
2. The treatment assignment happens only at ONE cutpoint value of the forcing variable.
3. Treatment assignment is independent of the potential outcomes within a narrow interval around the cutpoint.
4. Counterfactual outcomes can be modeled within the interval around the cutpoint.

Defining RDD Assumptions

A scatter plot of our data allows us to check certain RDD conditions visually. We can see whether we have a sharp or fuzzy cutoff. We can also use the plot to check for a _discontinuity_ &mdash; a sudden change in the outcome variable &mdash; at the cutoff.

To create this scatter plot with the contribution matching dataset, we can use the `ggplot2` package in R:

```r
library(ggplot2) #load ggplot2 package
```

Our scatter plot should have the number of employees on the x-axis and the contribution amount on the y-axis. The points for treatment and control groups should be different colors and shapes, so we can easily tell the two groups apart. Finally, we add code for a dashed vertical line at the cutoff of 300 employees.

```r
# create a scatterplot with treatment groups
ggplot(
  data = cont_data,
  aes(
    x = size, # forcing variable
    y = contribution, # outcome variable
    color = group, # sets point color by treatment group
    shape = group # sets point shape by treatment group
  )) +
  geom_point() + 
  geom_vline(xintercept = 300, linetype = "dashed") #add line at 300
```

This plot clearly shows that we have a sharp RDD, not a fuzzy one. The dashed line at 300 employees separates the two groups into companies that offer a matching program (at least 300) and companies that do NOT offer a matching program (fewer than 300).

![Scatter plot of contributions against company size with cutoff of 300 separating "no program" group from "program" group.](https://static-assets.codecademy.com/Courses/causal-inference/rdd-and-iv/ex2-ndata-plot.svg)

We can also check to make sure that there is actually a discontinuity in average contributions based on whether or not companies have more than 300 employees. To do this, we can add a separate best fit line for each treatment group using the `geom_smooth()` function. If we had saved our first plot as `rdd_scatter`, we can add the code for the lines as follows:

```r
# add best fit lines for each group to scatter plot
rdd_scatter +
  geom_smooth(
    aes(group = group), #plot separate lines for each group
    method = "lm" #use linear regression
)
```

![The same scatter plot of contributions against company size with added regression lines for each program group that shows a jump in contribution amount at the cutoff line of 300.](https://static-assets.codecademy.com/Courses/causal-inference/rdd-and-iv/rdd-smooth-plot.svg)

There is an obvious jump in the average contributions at the cutoff point, which means there is a discontinuity present. If there were no discontinuity present, we might see something like this:

![A scatter plot of contributions against company size with added regression lines for each program group that connect like a single line rather than jumping at the cutoff line.](https://static-assets.codecademy.com/Courses/causal-inference/rdd-and-iv/no-jump.svg)

Note that there is no jump in the outcome variable here. The lines connect smoothly.
  


Visual Check

In RDD, we know we need to look at points near the cutoff to find treatment and control groups that are similar. But how do we know how close to look?

The _bandwidth_ describes the distance on either side of the cutoff we should use to reduce our dataset. Any points that are more than one bandwidth above or below the cutoff are discarded. Choosing the bandwidth can have a serious impact on the results of an RDD analysis:

 - A wider bandwidth keeps more of the original dataset, so we have more information to estimate the treatment effect with. However, the treatment groups might be too different on confounding variables, which could decrease accuracy.
 - A narrower bandwidth retains less of the original dataset, so treatment groups will be more alike. However, the smaller sample size means less information to estimate the treatment effect.

We could select the bandwidth based on what we BELIEVE is best. However, an algorithm that optimizes the bandwidth mathematically may be a better choice. A popular choice&mdash;which we will use&mdash;is the Imbens-Kalyanaraman (IK) algorithm.

The R package `rdd` contains all of the tools needed to calculate the optimal bandwidth and carry out an RDD analysis. To calculate the IK bandwidth using `rdd`, we will use the `IKbandwidth()` function, which requires three arguments:

 - `X`: the forcing variable
 - `Y`: the outcome variable
 - `cutpoint`: the cutoff value to use.

To calculate the IK bandwidth for the contribution matching dataset, we would use the following code:

```r
library(rdd)

# calculate IK bandwidth
cont_ik_bw <- IKbandwidth(
  X = cont_data$size, # forcing variable
  Y = cont_data$contribution, # outcome variable
  cutpoint = cont_cutpoint # cutpoint
)

# print the IK bandwidth to the console
cont_ik_bw
[1] 13.26322
```

The reduced dataset used in our RDD analysis will include only the companies that have between 286 and 314 employees (300 &#xB1; 13.26). Companies with between 286 and 314 employees are likely to be similar on other variables that may impact employee contributions, such as average salary or insurance costs.

To illustrate the bandwidth visually, we can add bandwidth lines to the scatterplot. We can use `geom_vline()` to add reference lines at the cutpoint &#xB1; the bandwidth to our scatter plot `rdd_scatter` from earlier:

```r
rdd_scatter +
  geom_vline(xintercept = 300 + c(-cont_ik_bw, cont_ik_bw)) # add lines to indicate the bandwidth
```

![A scatter plot of contributions against company size with added vertical lines around 287 and 313 to show the slice of the data that is relevant to our study.](https://static-assets.codecademy.com/Courses/causal-inference/rdd-and-iv/cont_plot6.svg)

This plot shows us just how narrow the optimal bandwidth is for the contribution program dataset.

Choosing a Bandwidth

The use of a bandwidth impacts the type of causal estimand we can calculate in a regression discontinuity design analysis. Because the RDD approach uses a subset of the full dataset, we can only estimate the _local average treatment effect_ (LATE). The LATE is the average treatment effect among the subset of data that falls within the range of the bandwidth.

To estimate the LATE in RDD, a regression model that allows for different slopes on each side of the cutpoint is fit. The regression model is then used to get a predicted value of the outcome variable for each treatment group at the cutpoint. The difference between the predicted outcome values of the treatment and control groups is an estimate of the LATE.

We can use the `RDestimate()` function from the `rdd` package as follows to fit the local linear regression model for the contribution matching data:

```r
cont_rdd <- RDestimate(
  formula = contribution ~ size, #outcome regression model
  data = cont_data, #dataset
  cutpoint = 300, #cutpoint
  bw = cont_ik_bw #bandwidth
)
```

The `RDestimate()` function fits the local linear regression model at the provided bandwidth, but also at half of the bandwidth and twice the bandwidth. If the estimate of the LATE is relatively the same across bandwidths, we can be more confident that the estimate is accurate. We see all three estimates when we print the results.
```r
Call:
RDestimate(formula = contribution ~ size, data = cont_data,
cutpoint = 300, bw = cont_ik_bw)

Coefficients:
    LATE    Half-BW  Double-BW
   90.60     110.67   71.62
```

The model output shows us that the LATE is 90.60, meaning that in this dataset, we can conclude that employer-sponsored retirement contribution matching programs led to an increase in average monthly contributions of \$90.60. However, we see that the estimate changes based on the bandwidth, ranging from \$110.67 at half of the bandwidth to \$71.62 at twice the bandwidth.

Estimating the Causal Treatment Effect

As we've seen, the advantages of regression discontinuity design are that RDD:

 - Is a simple method to understand and implement.
 - Avoids using a complicated regression model for the entire dataset &mdash; the local regression model is simple.
 - Is useful in cases where there is no overlap on a confounding variable, which may prevent us from using stratification or propensity score analysis.

However, there are several drawbacks to RDD inherent to the method:

 - Smaller bandwidths make RDD assumptions more plausible BUT also reduce the sample size.
 - The local average treatment effect (LATE) is not an easily interpretable estimand. We calculated the effect of the contribution matching program only among the companies with close to 300 employees. How confident can we be that this effect would be the same in much smaller or larger companies?

Let's explore the tradeoffs with an example. Say we run `RDestimate()` on the contribution data again, but set the bandwidth to 100 instead of the IK optimal bandwidth of 13. We save the results to `rdd_100` and print the output:

```
Call:
RDestimate(formula = contribution ~ size, data = cont_data,
cutpoint = 300, bw = 100)
 
Coefficients:
    LATE    Half-BW  Double-BW
   56.71      59.90      53.54
```

We can also get information on the number of observations included and the standard error of the LATE for each bandwidth with the code that follows.

```r
rdd_100$obs #number of observations
# Output
[1] 113  68 178

rdd_100$se #standard errors
# Output
[1] 6.647 9.077 5.269
```
Let's consider just the half-bandwidth (50) and the double-bandwidth (200). Using the half-bandwidth, we analyze only companies with 250-350 employees, so we may believe these companies are very similar to one another. But this leaves only 68 companies in our sample and a standard error of 9.077 for the LATE. We may ask:
* Can we trust a LATE with a higher standard error?
* Do these findings apply to companies outside the range of 250-350 employees?

At the double-bandwidth, we analyze companies with 100-500 employees, so our sample size is much larger at 178 companies and our standard error is reduced to 5.269. But how confident are we that companies with 100 employees are similar enough to compare to companies with 500 employees?

Regression discontinuity design is a useful method to keep in our causal inference toolbox, but we must be aware of the tradeoffs throughout the process.

Advantages and Disadvantages of RDD

In this lesson, we showed that the implementation of an employee-sponsored retirement matching program led to an increase in average monthly employee contributions.

We learned a lot about regression discontinuity design along the way, including:

 - RDD is a method used when the treatment assignment is determined by a continuous forcing variable at a specific cutoff point.
 - An RDD is known as _sharp_ if the cutoff is exact and _fuzzy_ if the cutoff is not exact.
 - Individuals within a narrow window on either side of the cutoff are assumed to be similar to each other, except for the treatment group assignment.
 - Local linear regression can be used to determine the local average treatment effect (LATE) among the individuals in this narrow window.
 - Disadvantages of RDD include reduced sample size and potential lack of generalizability of the LATE.

Regression Discontinuity Design

Learn about instrumental variables for causal inference.

Suppose that a city is interested in increasing the per-capita rate of recycling among its citizens while decreasing the city’s operating costs.

To encourage this, the city allows individuals to discontinue curbside recycling pickup and instead opt into a rebate program. The city wants to evaluate whether or not this rebate program increases the amount of recycling in the city.
 
The city compiles the following variables in a dataset to evaluate the success of the rebate program:  
 
`recycled`: amount recycled (kg/person).  
`rebate`: participation in rebate program (curbside vs. rebate).  
`distance`: distance from recycling center (5 miles vs. > 5 miles). An individual who lives less than five miles from a recycling center might be more likely to opt into the rebate program than someone who lives more than five miles from a center. However, distance to a recycling center should not directly impact the amount of waste each person recycles. 

We are going to use Instrumental Variables to answer the city's question, but before we dive into that, we need to establish some more tools.

The first is **conditional exchangeability.**

One key assumption made in causal inference methods like weighting or stratification is _conditional exchangeability_. This assumption states that there are no unmeasured confounding variables that have a causal effect on both the treatment assignment and the outcome.

Randomization of the treatment assignment ensures that both measured and unmeasured confounding variables are evenly balanced between treatment groups. In a non-randomized setting, balance is NOT guaranteed. The assumption of conditional exchangeability cannot be tested or verified &mdash; in most cases, the best we can do without randomization is to identify and measure as many potential confounding variables as possible.

Instrumental variable (IV) estimation is a causal inference technique that helps us estimate the causal effect of the treatment even in the presence of unmeasured confounding variables. In this lesson, we will learn about the assumptions and potential applications of IV estimation.

Diagram of causal relationships made of three boxes and arrows between them. The treatment box has an arrow pointing to the outcome box. The  "measured and unmeasured confounders" box points to both the treatment and the outcome boxes.

Introduction to Instrumental Variables

In observational studies, when randomization is not possible, balance of measured and unmeasured confounding variables is not guaranteed. Without taking appropriate measures, the causal estimate of the effect of a treatment on an outcome of interest will be biased.

Instrumental variable (IV) estimation is one causal inference method that uses _instruments_ to help reduce bias from both measured AND unmeasured confounding variables.

If you started this lesson hoping to learn about pianos and guitars, you may be disappointed. In IV estimation, an _instrument_ (or _instrumental variable_) is a variable that is causally related to an outcome variable ONLY through another variable &mdash; typically the treatment variable of interest.

An instrumental variable would be depicted in a causal diagram as follows:

![IV diagram where Treatment, Outcome, Confounders (both measured and unmeasured), and Instruments are all represented. Treatment feeds the Outcome. Instruments feeds the treatment. The Confounders feed both the Outcome and the Treatment.](https://static-assets.codecademy.com/Courses/causal-inference/rdd-and-iv/iv-e2-narr.svg)

In this diagram, the arrows signify the presence and direction of a causal relationship. For example, there is no arrow directly from the instrument to the outcome because the instrument only impacts the outcome through its causal relationship with the treatment.

Not That Kind of Instrument

When treatment group assignments cannot be randomized due to ethical or practical reasons, the best we can do is to encourage compliance. However, encouragement does not guarantee compliance.

Compliance with the treatment assignment can be influenced by many factors, only some of which may be measurable. To account for unmeasured confounders of treatment compliance and the outcome, we could use IV estimation with treatment assignment as the instrument:

![IV diagram where Treatment Compliance, Treatment Assignment, Outcome, and Confounders (both measured and unmeasured are represented. Treatment Assignment feeds Treatment Compliance. Treatment Compliance feeds the Outcome, and the Confounders feed both Treatment Compliance and the Outcome variables. ](https://static-assets.codecademy.com/Courses/causal-inference/rdd-and-iv/iv-e3-narr.svg)

When the instrument AND treatment are both binary variables, we can define four types of "compliers":

  1. _Always takers_: takes the treatment regardless of treatment assignment.
  2. _Never takers_: never takes the treatment regardless of treatment assignment.
  3. _Compliers_: takes the assigned treatment.
  4. _Defiers_: takes the opposite of the assigned treatment.

In the context of the recycling program example, the four types of compliers would be defined as follows:

<table class = "table table-condensed">
  <thead>
    <tr class="header">
      <th></th><th><b>Instrument Value</b></th>
    </tr>
    <tr>
      <th></th><th><b>&#8804; 5 miles</b></th><th><b>&#062; 5 miles</b></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><b>Always takers</b></td> <td>Rebate</td><td>Rebate</td>
    </tr>
    <tr>
      <td><b>Never takers</b></td> <td>Curbside</td><td>Curbside</td>
    </tr>
    <tr>
      <td><b>Compliers</b></td><td>Rebate</td><td>Curbside</td>
    </tr>
    <tr>
      <td><b>Defiers</b></td><td>Curbside</td><td>Rebate</td>
    </tr>
  </tbody>
</table>

The issue of compliance is related to the fundamental problem of causal inference, which states that we can never observe both potential outcomes. In an IV estimation, we can never observe both potential treatments received for an individual. Thus, we cannot know with certainty what kind of complier an individual is.

Fortunately, we can get around this issue by making additional assumptions about our sample.

Assignment vs. Compliance

Compliance with the treatment assignment influences which causal estimand we calculate:

 - When compliance is perfect, the treatment assignment and treatment received can be used interchangeably to get an accurate estimate of the causal effect.
 - When compliance is NOT perfect, we cannot estimate the average treatment effect (ATE) because treatment assignment and treatment compliance are NOT interchangeable. Instead, the average effect of treatment assignment can be thought of as the effect of the _intention to treat_, or ITT effect.

IV estimation allows us to approximate the ATE by estimating a local average treatment effect (LATE) just among the compliers. This is known as the _compliers' average causal effect_ (CACE). To estimate the CACE, we must make four assumptions about the sample:

 - _Relevance_: the instrument has a causal effect on the treatment received.
 - _Exclusion_: the instrument affects the outcome ONLY indirectly through the treatment received.
 - _Exchangeability_: there are no confounders that influence both the instrument AND the outcome.
 - _Monotonicity_: there are no "defiers" in the sample.

The relevance assumption implies that there are no "never-takers" and "always-takers" in the sample. In these subgroups, the treatment received doesn't depend on the value of the assigned treatment. And in order to estimate the treatment effect among the "compliers," we must also assume that there are no "defiers" in the sample.

Assumptions of IV Estimation

In a standard ordinary least squares (OLS) regression model, the outcome is predicted directly from the treatment assignment and other confounding variables. Such an OLS model would take the form:

<b>Outcome = &#946;<sub>0</sub> + &#946;<sub>1</sub> * Treatment Assignment</b>.

In OLS regression, we would use &#946;<sub>1</sub> as an estimate of the treatment effect. To fit the OLS regression model using the recycling data, we would use the following code:

```r
lm(recycled ~ rebate, #outcome ~ treatment
   data = recycle_df #dataset
)

# Output:

Call:
lm(formula = recycled ~ rebate, data = recycle_df)

Coefficients:
(Intercept)   rebate
  126.00       38.04
```

The OLS estimate suggests that participation in the rebate program leads to an average increase in recycling of 38.04 kilograms/person. However, this estimate is biased because OLS regression does not control bias introduced by unmeasured confounding variables or imperfect compliance.

In IV estimation, we account for unmeasured confounding variables and imperfect compliance via _two-stage least squares_ (2SLS) regression. This type of regression predicts the outcome in two separate steps:

 - In the first stage, treatment received is predicted by the instrument (treatment assignment): 
   
    <b> Predicted Treatment Received = &#945;<sub>0</sub> + &#945;<sub>1</sub> * Treatment Assignment</b>

 - In the second stage, the outcome is predicted as a function of the predicted treatment received from the first stage:

    <b> Outcome = &#946;<sub>0</sub> + &#946;<sub>1</sub> * Predicted Treatment Received</b>

&#946;<sub>1</sub> from the second stage of the 2SLS regression model is used as the estimate of the CACE.

In this lesson, we focus on 2SLS regression with a continuous outcome, binary instrument, and binary treatment to keep things simple. When this is the case, the first stage uses logistic regression, while the second stage uses linear regression.

Two-Stage Least Squares Regression

Performing 2SLS in R is easy if we use the `ivreg()` function from the `AER` package.

The key difference in syntax between `ivreg()` and other regression functions is that the `formula` argument of the `ivreg()` function must include the instrument. If we wanted to perform 2SLS regression with variables `outcome` as the outcome, `treatment` as the treatment, and `instrument` as the instrument, the model formula would be `outcome ~ treatment | instrument`.

To fit the 2SLS regression using the recycling data, we would use the following code:

```r
# import library
library(AER)

# run 2SLS regression
iv_mod <- ivreg(
  #outcome ~ treatment | instrument
  formula = recycled ~ rebate | distance, 
  data = recycle_df
  )
```

To view the coefficients and standard errors, we can use `summary(iv_mod)$coefficients`, which gives the following output (you may need to make this section of the screen wider to view the full table):

```
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 129.36463  0.8683141 148.98368 0.000000e+00
rebate       31.25452  1.4629239  21.36442 5.118885e-68
```

The results of 2SLS regression show that the estimate of the effect of the rebate program is 31.25, meaning participation in the rebate program led to an average increase in recycling of 31.25 kilograms/person. This only applies to compliers: those individuals who participated in the rebate program because they lived within 5 miles of a recycling center, but who would not have participated otherwise.

You may be wondering why we couldn't just fit the two separate regression models described in the previous exercise using `lm()` or `glm()` functions. The `ivreg()` function is preferred because it automatically corrects standard errors to account for the fact that the second stage regression model uses predicted values of the treatment. 

If we use incorrect standard errors, we could make incorrect conclusions about the treatment effect:

* Lower standard errors correspond with more precise treatment effect estimates and a greater likelihood that the treatment coefficient will be found to be significantly different from zero. 
* Higher standard errors correspond with less precise treatment effect estimates and a lesser likelihood that the treatment coefficient will be found to be significantly different from zero.

IV Estimation in R

While IV estimation can be effective in certain circumstances, it also has limitations. One limitation is that the causal estimand (CACE) is not generalizable. The CACE only describes the effect of those who comply with treatment.

Another limitation is that it is difficult to find a suitable instrument that has a strong relationship with the treatment. If an instrument is only weakly related to the treatment, 2SLS regression will produce inaccurate estimates of the CACE.

To illustrate this, suppose that instead of using distance as an instrument for participation in the rebate program, we used another variable, `children`. The variable indicates whether or not an individual has children. Having children wouldn't directly cause a change in recycling, but individuals with children might be less likely to participate in the rebate program. The rebate program requires individuals to take the time to drop off recycling &mdash; time that people with children might not have.

Performing the 2SLS regression again with the `ivreg()` function and the `children` variable as an instrument highlights the effect of using a weak instrument:

```r
iv_mod_weak <- ivreg(
  formula = recycled ~ rebate | children, #new weak instrument
  data = recycle_df
  )
```

Using `summary(iv_mod_weak)$coefficients` we can view just the coefficient table from the results summary.

```
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 126.10140   8.930242 14.120715 5.477652e-37
rebate       37.84692  18.018181  2.100485 3.631487e-02
```

The estimate of 37.85 is neither accurate nor precise, as highlighted by the large standard error. This estimate is similar to the estimate from OLS regression, demonstrating that this is a weak instrument.

Interpretation and Considerations of IV Estimation

Congratulations on finishing this lesson on instrumental variable estimation! We covered a lot of topics:

 - Instrumental variable (IV) estimation is a causal inference technique that can be used to estimate a causal treatment effect even in the presence of unobserved confounding variables.
 - IV estimation can be used in non-randomized studies when compliance with the assigned or encouraged treatment is not perfect.
 - An instrument is a variable that is related to an outcome of interest ONLY through the treatment variable.
 - The four assumptions of IV estimation are relevance, exclusion, exchangeability, and monotonicity.
 - IV estimation is performed via two-stage least squares (2SLS) regression.
 - The `ivreg()` function in the `AER` package performs 2SLS regression and automatically provides corrected standard errors of the treatment effect.

Instrumental Variables

In RDD, the forcing variable is NOT related to the outcome variable of interest.

In RDD, a forcing variable is a continuous variable with a cutpoint that determines the treatment group assignment.

When a forcing variable cutpoint is exact, the RDD is known as a sharp RDD. When the cutpoint is not exact, the RDD is known as a fuzzy RDD.

RDD may be used for causal inference when the treatment is assigned to observations that fall above or below a specific cutpoint.

Individuals in different treatment groups with forcing variable values close to the cutpoint will be more similar to one another with respect to confounding variables than individuals further from the cutpoint.

Individuals with forcing variable values close to the cutpoint are used in a sharp RDD analysis while individuals with forcing variable values far from the cutpoint are used in a fuzzy RDD analysis.

Individuals with forcing variable values close to the cutpoint have a low probability of being in the treatment group whereas individuals with forcing variable values far from the cutpoint have a high probability of being in the treatment group.

There can be several cutpoints that determine the treatment group assignment.

Test your knowledge on Regression Discontinuity Design (RDD) and Instrumental Variables (IV).

RDD and IV Quiz

### Why Learn Linear Regression in R? 

You hear it all the time: "correlation is not causation." By taking this course, you'll find out what causation really is. Find out why things happen using causal methods, such as matching and weighting, instrumental variables, and difference in differences. 


### Take-Away Skills 
In this course, you will learn the conceptual foundations for determining causal inference and how to work with data to understand why things happen.  In addition to the basic foundations, you will learn how to isolate variables and apply different techniques to deal with unruly datasets and interpret the results of your analysis.


Learn how to use causal inference to figure out how different variables influence your results.