Learn

Imagine having an RDD containing two-letter state abbreviations.

# list of states states = ['FL', 'NY', 'TX', 'CA', 'NY', 'NY', 'FL', 'TX'] # convert to RDD states_rdd = spark.sparkContext.parallelize(states)

However, we want the region instead of the state. Regions are groupings of states based on their geographic location, such as “East” or “South”. Currently, our RDD is partitioned in the Spark cluster, and we don’t know which nodes contain data on which states.

In this situation, we need to send the conversion information to all nodes because it’s very likely that each node will contain multiple distinct states. We can provide each node with information on which states belong in each region. This information that is made available to all nodes is what Spark calls broadcast variables. Let’s see how we can implement them to convert the abbreviations!

Let’s start off by creating a conversion dictionary called region that matches each state to its region:

# dictionary of regions region = {"NY":"East", "CA":"West", "TX":"South", "FL":"South"}

We can then broadcast our region dictionary and apply the conversion to each element in the RDD with our map function:

# broadcast region dictionary to nodes broadcast_var = spark.sparkContext.broadcast(region) # map regions to states result = states_rdd.map(lambda x: broadcast_var.value[x]) # view first four results result.take(4) # output : [‘South’, ‘East’, ‘South’, ‘West’]

This is Spark’s efficient method of sharing variables amongst its nodes (also known as shared variables). They ultimately improve performance by decreasing the amount of data transfer overhead because each node already has a cached copy of the required object. However, it should be noted that we would never want to broadcast large amounts of data because the size would be too much to serialize and send through the network.

Instructions

1.

Instead of leaving the states as two-letter abbreviations, let’s convert them to their full names. The dictionary states has been provided for you in the notebook.ipynb. This dictionary contains the names and abbreviations for four states.

First, broadcast the states dictionary to Spark Cluster. Save this object as broadcastStates. Confirm you have created a broadcast variable by running the provided code in the same code cell.

2.

Now reference broadcastStates to map the two-letter abbreviations to their full names. Save the transformed RDD as rdd_broadcast. Confirm you did the transformation correctly by running the provided code in the same code cell.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?