  You'll need to log in from a computer to start Learn the Basics of Causal Inference with R. But you can practice or keep up your coding streak with the Codecademy Go app. Download the app to get started.    Learn

In RDD, we know we need to look at points near the cutoff to find treatment and control groups that are similar. But how do we know how close to look?

The bandwidth describes the distance on either side of the cutoff we should use to reduce our dataset. Any points that are more than one bandwidth above or below the cutoff are discarded. Choosing the bandwidth can have a serious impact on the results of an RDD analysis:

• A wider bandwidth keeps more of the original dataset, so we have more information to estimate the treatment effect with. However, the treatment groups might be too different on confounding variables, which could decrease accuracy.
• A narrower bandwidth retains less of the original dataset, so treatment groups will be more alike. However, the smaller sample size means less information to estimate the treatment effect.

We could select the bandwidth based on what we BELIEVE is best. However, an algorithm that optimizes the bandwidth mathematically may be a better choice. A popular choice—which we will use—is the Imbens-Kalyanaraman (IK) algorithm.

The R package `rdd` contains all of the tools needed to calculate the optimal bandwidth and carry out an RDD analysis. To calculate the IK bandwidth using `rdd`, we will use the `IKbandwidth()` function, which requires three arguments:

• `X`: the forcing variable
• `Y`: the outcome variable
• `cutpoint`: the cutoff value to use.

To calculate the IK bandwidth for the contribution matching dataset, we would use the following code:

``````library(rdd)

# calculate IK bandwidth
cont_ik_bw <- IKbandwidth(
X = cont_data\$size, # forcing variable
Y = cont_data\$contribution, # outcome variable
cutpoint = cont_cutpoint # cutpoint
)

# print the IK bandwidth to the console
cont_ik_bw
 13.26322``````

The reduced dataset used in our RDD analysis will include only the companies that have between 286 and 314 employees (300 ± 13.26). Companies with between 286 and 314 employees are likely to be similar on other variables that may impact employee contributions, such as average salary or insurance costs.

To illustrate the bandwidth visually, we can add bandwidth lines to the scatterplot. We can use `geom_vline()` to add reference lines at the cutpoint ± the bandwidth to our scatter plot `rdd_scatter` from earlier:

``````rdd_scatter +
geom_vline(xintercept = 300 + c(-cont_ik_bw, cont_ik_bw)) # add lines to indicate the bandwidth`````` This plot shows us just how narrow the optimal bandwidth is for the contribution program dataset.

### Instructions

1.

The dataset `air_data` has been loaded for you in notebook.Rmd. Calculate the Imbens-Kalyanaraman (IK) optimal bandwidth using the `IKbandwidth()` function. Save the results to `air_ik_bw`.

2.

Print `air_ik_bw`. Think about whether the bandwidth seems large or small in relation to the scale of the forcing variable.

3.

A scatter plot of AQI (`aqi`) against power plant output (`watts`) with a dashed line at 600 megawatts has been created for you and saved as `air_scatter`. Modify `air_scatter` to add solid vertical lines for the bandwidth cutoff lines. Save the result to `air_scatter2`.

4.

Print `air_scatter2`. Do you think the two device groups within this bandwidth will be similar enough to compare? What might be the tradeoff if the bandwidth is too narrow?