Learn

By now we’ve talked endlessly about the benefits of distributing our data across multiple nodes and allowing for parallel processing, but what happens when we don’t want our data to be distributed?

Imagine having an RDD containing two-letter state abbreviations. Maybe we run into a situation where we want the region instead of the state. Currently, our RDD is partitioned in the Spark cluster, and we don’t know which nodes contain which states.

In this situation, it makes more sense to send the conversion information to all nodes because it’s very likely that each node will contain multiple distinct states. Instead of repeatedly shuffling partitioned conversion information to each node, we can provide each node with everything instead. This distributed information is what Spark calls broadcast variables. Let’s see how we can implement them to convert the abbreviations!

Let’s start off by creating our new RDD and its conversion dictionary:

states = ['FL', 'NY', 'TX', 'CA'] rdd = spark.sparkContext.parallelize(states) region = {"NY":"East", "CA":"West", "TX":"South", "FL":"South"}

We can then broadcast our region dictionary and apply the conversion to each element in the RDD with our map function:

broadcast_var = spark.sparkContext.broadcast(region) result = rdd.map(lambda x: broadcast_var.value[x]).take(4) # output : [‘South’, ‘East’, ‘South’, ‘West’]

These are Spark’s efficient method of sharing variables amongst its nodes (aka shared variables). They ultimately improve performance by decreasing the amount of data transfer overhead because each node will already have a cached copy of the required object. However, it should be noted that we would never want to broadcast large amounts of data because the size would be too much to serialize and send through the network.

Instructions

1.

Instead of leaving the states as two-letter abbreviations, let’s convert them to their full names. The dictionary states has been provided for you in the notebook.ipynb. This dictionary contains the names and abbreviations for four states.

First, broadcast the states dictionary to Spark Cluster. Save this object as broadcastStates.

2.

Now we can reference broadcastStates to map the two-letter abbreviations to their full names. Let’s save the transformed rdd as rdd_broadcast

3.

Let’s double check our resulting RDD by printing its elements.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?