By now we’ve talked endlessly about the benefits of distributing our data across multiple nodes and allowing for parallel processing, but what happens when we don’t want our data to be distributed?
Imagine having an RDD containing two-letter state abbreviations. Maybe we run into a situation where we want the region instead of the state. Currently, our RDD is partitioned in the Spark cluster, and we don’t know which nodes contain which states.
In this situation, it makes more sense to send the conversion information to all nodes because it’s very likely that each node will contain multiple distinct states. Instead of repeatedly shuffling partitioned conversion information to each node, we can provide each node with everything instead. This distributed information is what Spark calls broadcast variables. Let’s see how we can implement them to convert the abbreviations!
Let’s start off by creating our new RDD and its conversion dictionary:
states = ['FL', 'NY', 'TX', 'CA'] rdd = spark.sparkContext.parallelize(states) region = {"NY":"East", "CA":"West", "TX":"South", "FL":"South"}
We can then broadcast our region
dictionary and apply the conversion to each element in the RDD with our map function:
broadcast_var = spark.sparkContext.broadcast(region) result = rdd.map(lambda x: broadcast_var.value[x]).take(4) # output : [‘South’, ‘East’, ‘South’, ‘West’]
These are Spark’s efficient method of sharing variables amongst its nodes (aka shared variables). They ultimately improve performance by decreasing the amount of data transfer overhead because each node will already have a cached copy of the required object. However, it should be noted that we would never want to broadcast large amounts of data because the size would be too much to serialize and send through the network.
Instructions
Instead of leaving the states as two-letter abbreviations, let’s convert them to their full names. The dictionary states
has been provided for you in the notebook.ipynb. This dictionary contains the names and abbreviations for four states.
First, broadcast the states
dictionary to Spark Cluster. Save this object as broadcastStates
.
Now we can reference broadcastStates
to map the two-letter abbreviations to their full names. Let’s save the transformed rdd as rdd_broadcast
Let’s double check our resulting RDD by printing its elements.