Learn

We’ve broadcasted a dictionary over to your nodes, and everything went well! We’re now curious as to how many “East” versus “West” entries there are. We could attempt to create a couple of variables to keep track of the counts, but we might run into serialization and overhead issues when datasets get really big. Thankfully, Spark has another type of shared variable that solves this issue: accumulator variables.

Accumulator variables can be updated and are primarily used as counters or sums. Conceptually, they’re similar to the sum and count functions in NumPy.

Let’s see how we can implement accumulator variables by counting the number of distinct regions. Since this will be a new dataset, let’s create an RDD first:

region = ['East', 'East', 'West', 'South', 'West', 'East', 'East', 'West', 'North'] rdd = spark.sparkContext.parallelize(region)

We’ll start off by initializing the accumulator variables at zero:

east = spark.sparkContext.accumulator(0) west = spark.sparkContext.accumulator(0)

Let’s create a function to increment each accumulator by one whenever Spark encounters ‘East’ or ‘West’:

def countCoasts(r): if 'East' in r: east.add(1) elif 'West' in r: west.add(1)

We’ll take the function we created and run it against each element in the RDD.

rdd.foreach(lambda x: countCoasts(x)) print(east) # output: 4 print(west) # output: 3

This seems like a simple concept, but accumulator variables can be very powerful in the right situation. They can keep track of the inputs and outputs of each Spark task by aggregating the size of each subsequent transformation. Instead of counting the number of east or west coast states, we could count the number of NULL values or the resulting size of each transformation. This is important to monitor for data loss.

This doesn’t mean you should add accumulator variables to everything though. It’s best to avoid using accumulators in transformations. Whenever Spark runs into an exception, it will re-execute the tasks. This will incorrectly increment the accumulator. However, Spark will guarantee that this does not happen to accumulators in actions.

Accumulators can be great as debugging or summary tools, but they’re not infallible when used in transformations.

Instructions

1.

Use an accumulator variable to count the number of students who have scored over 1500 on their SATs.

First, create the accumulator variable that starts at 0 and name it sat_1500. Confirm sat_1500 is an accumulator variable by running the provided code in the same code cell.

2.

Next, create a function called count_high_sat_score() that we can call to increment our accumulator whenever it encounters a score over 1500.

Note: Your code will only be tested to see if you’ve created a function called count_high_sat_score(). To confirm your function works correctly, view the hint and solution, or proceed to the next step to test the output after the function is applied.

3.

Call count_high_sat_score() in an action that will apply the function to each element in rdd_broadcast. Confirm count_high_sat_score() and its application worked correctly by running the provided code in the same code cell.

Did your accumulator variable count correctly?

Note: You can reset your accumulator count by running the code from the first task again. If your count still isn’t correct, view the hint and solution code for the previous task to make sure count_high_sat_score() was written correctly in the last step.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?