Learn

You’ve broadcasted a dictionary over to your nodes, and everything went well! You’re now curious as to how many east versus west coast entries there are. We could attempt to create a couple variables to keep track of the counts, but we might run into serialization and overhead issues when datasets get really big. Thankfully, Spark has another type of shared variable that solves this issue: accumulator variables.

Accumulator variables can be updated and are primarily used as counters or sums. Conceptually, they’re similar to the sum and count functions in Numpy.

Let’s see how we can implement accumulator variables by counting the number of distinct regions. Since this is will be another new dataset, let’s create an RDD first:

region = [‘East’, ‘East’, ‘West', ‘South', ‘West’, ‘East’, ‘East’, ‘West’, ‘North’] rdd = spark.sparkContext.parallelize(region)

We’ll start off by initializing the accumulator variables to zero:

east = spark.sparkContext.accumulator(0) west = spark.sparkContext.accumulator(0)

Let’s create a function to increment each accumulator by one whenever Spark encounters ‘East’ or ‘West’:

def countCoasts(r): if 'East' in r: east.add(1) elif West in r: west.add(1)

We’ll take the function we created and run it against each element in the RDD.

rdd.foreach(lambda x: countCoasts(x)) print(east) # output: 4 print(west) # output: 3

This seems like a simple concept, but accumulator variables can be very powerful in the right situation. They can keep track of the inputs and outputs of each Spark task by aggregating the size of each subsequent transformation. Instead of counting the number of east or west coast states, we could count the number of NULL values or the resulting size of each transformation. This is important to monitor for data loss.

This doesn’t mean you should add accumulator variables to everything though. It’s best to avoid using accumulators in transformations. Whenever Spark runs into an exception, it will re-execute the tasks. This will incorrectly increment the accumulator. However, Spark will guarantee that this does not happen to accumulators in actions.

Accumulators can be great as debugging or summary tools, but they’re not infallible when used in transformations.

Instructions

1.

Let’s use an accumulator variable to count the number of students who have scored over 1500 on their SATs. First, create the accumulator variable that starts at 0 and name it sat_1500.

2.

Let’s create a function called count_high_sat_score() that we can call to increment our accumulator whenever it encounters a score of over 1500.

3.

Now we can wrap up what we developed in the past two checkpoints to call count_high_sat_score() in an action that will apply the function to each element in rdd_broadcast.

4.

Let’s see how many students scored over 1500 on their SATs by printing the accumulator variable that our loop has been incrementing!

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?