In a standard ordinary least squares (OLS) regression model, the outcome is predicted directly from the treatment assignment and other confounding variables. Such an OLS model would take the form:

**Outcome = β _{0} + β_{1} * Treatment Assignment**.

In OLS regression, we would use β_{1} as an estimate of the treatment effect. To fit the OLS regression model using the recycling data, we would use the following code:

lm(recycled ~ rebate, #outcome ~ treatment data = recycle_df #dataset ) # Output: Call: lm(formula = recycled ~ rebate, data = recycle_df) Coefficients: (Intercept) rebate 126.00 38.04

The OLS estimate suggests that participation in the rebate program leads to an average increase in recycling of 38.04 kilograms/person. However, this estimate is biased because OLS regression does not control bias introduced by unmeasured confounding variables or imperfect compliance.

In IV estimation, we account for unmeasured confounding variables and imperfect compliance via *two-stage least squares* (2SLS) regression. This type of regression predicts the outcome in two separate steps:

In the first stage, treatment received is predicted by the instrument (treatment assignment):

**Predicted Treatment Received = α**_{0}+ α_{1}* Treatment AssignmentIn the second stage, the outcome is predicted as a function of the predicted treatment received from the first stage:

**Outcome = β**_{0}+ β_{1}* Predicted Treatment Received

β_{1} from the second stage of the 2SLS regression model is used as the estimate of the CACE.

In this lesson, we focus on 2SLS regression with a continuous outcome, binary instrument, and binary treatment to keep things simple. When this is the case, the first stage uses logistic regression, while the second stage uses linear regression.

### Instructions

**1.**

An online retailer is interested in studying the effect of offering video streaming services on the average amount of money that users spend on the platform. The retailer gives its paying members access to exclusive video content at no additional expense. The online retailer then starts an email campaign for a random subset of users to encourage them to use the streaming services.

The retailer tracks the amount of money spent by its users in a year and compiles the following variables into a dataframe named `video_df`

:

`email`

: whether or not an individual received the email campaign (received email vs. did not receive email).`streaming`

: whether or not an individual used the streaming services (used service vs. did not use service).`spending`

: the amount of money spent by an individual in a year (dollars)

There are many factors that could influence whether or not an individual on the platform uses the video streaming services. For example, some individuals might have more free time to watch videos. Other individuals might not be interested in the content on the platform. The online retailer has no way to capture this information.

Receiving the email does not directly impact how much someone spends — it only impacts the likelihood of using the streaming platform. Thus, the instrument is whether or not the individual was emailed.

A file named **video_data.csv** has been provided which contains data collected by the online retailer. Import this file as a dataframe and name it `video_df`

.

**2.**

Uncomment and run the code to examine the first few rows of `video_df`

.