In a standard ordinary least squares (OLS) regression model, the outcome is predicted directly from the treatment assignment and other confounding variables. Such an OLS model would take the form:

Outcome = β0 + β1 * Treatment Assignment.

In OLS regression, we would use β1 as an estimate of the treatment effect. To fit the OLS regression model using the recycling data, we would use the following code:

lm(recycled ~ rebate, #outcome ~ treatment data = recycle_df #dataset ) # Output: Call: lm(formula = recycled ~ rebate, data = recycle_df) Coefficients: (Intercept) rebate 126.00 38.04

The OLS estimate suggests that participation in the rebate program leads to an average increase in recycling of 38.04 kilograms/person. However, this estimate is biased because OLS regression does not control bias introduced by unmeasured confounding variables or imperfect compliance.

In IV estimation, we account for unmeasured confounding variables and imperfect compliance via two-stage least squares (2SLS) regression. This type of regression predicts the outcome in two separate steps:

  • In the first stage, treatment received is predicted by the instrument (treatment assignment):

    Predicted Treatment Received = α0 + α1 * Treatment Assignment

  • In the second stage, the outcome is predicted as a function of the predicted treatment received from the first stage:

    Outcome = β0 + β1 * Predicted Treatment Received

β1 from the second stage of the 2SLS regression model is used as the estimate of the CACE.

In this lesson, we focus on 2SLS regression with a continuous outcome, binary instrument, and binary treatment to keep things simple. When this is the case, the first stage uses logistic regression, while the second stage uses linear regression.



An online retailer is interested in studying the effect of offering video streaming services on the average amount of money that users spend on the platform. The retailer gives its paying members access to exclusive video content at no additional expense. The online retailer then starts an email campaign for a random subset of users to encourage them to use the streaming services.

The retailer tracks the amount of money spent by its users in a year and compiles the following variables into a dataframe named video_df:

  • email: whether or not an individual received the email campaign (received email vs. did not receive email).
  • streaming: whether or not an individual used the streaming services (used service vs. did not use service).
  • spending: the amount of money spent by an individual in a year (dollars)

There are many factors that could influence whether or not an individual on the platform uses the video streaming services. For example, some individuals might have more free time to watch videos. Other individuals might not be interested in the content on the platform. The online retailer has no way to capture this information.

Receiving the email does not directly impact how much someone spends — it only impacts the likelihood of using the streaming platform. Thus, the instrument is whether or not the individual was emailed.

A file named video_data.csv has been provided which contains data collected by the online retailer. Import this file as a dataframe and name it video_df.


Uncomment and run the code to examine the first few rows of video_df.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?