Learn

You may have noticed that the transformation executed rather quickly! That’s because it didn’t execute at all. Unlike transformations in Pandas, which we call eager, transformations in Spark are lazy in that they are not performed until an action is called.

So, why are Spark transformations lazy? Simply put, Spark will queue up the transformations because it wants to optimize and reduce overhead once an action is called. Let’s say that you wanted to apply a map and filter on your RDD:

# input RDD = [1,2,NULL,4,5] rdd.map(lambda x: x+1).filter(lambda x: x is not NULL)

Spark might instead start by only loading the non-NULL values into memory and perform the map function last. This will save memory and time because it loaded less data and mapped the lambda to fewer elements.

Let’s go over a few familiar actions.

Starting with the basics, how do we view our RDD? There are a few options available to us, but the two we’ll focus on are collect() and take().

We use collect() if we want to return the entire transformed RDD:

rdd.map(lambda x: x+1).collect()

The collect() function can be dangerous with large amounts of data, so if we only want n elements instead, we can use take(n):

rdd.map(lambda x: x+1).take(n)

There are also quite a few actions to summarize RDDs. reduce() is one of the most important. If you’ve never used this function before, don’t worry — the structure is almost identical to map(). Let’s say we want to add up all the values in the RDD, we can use a reduce() function to calculate the summation:

rdd = spark.SparkContent.parallelize([1,2,3,4,5]) add_one = rdd.map(lambda x: x+1) # new RDD = [2,3,4,5,6] print(add_one.reduce(lambda x,y: x+y)) # output: 20

reduce() is powerful because it allows us to apply many arbitrary operations to an RDD — it unbinds us from searching for library functions that might not exist. However, it certainly has limitations, which we’ll dive into in the next exercise.

The key thing about actions is that, like transformations, they take an RDD as input, but they will always output a value instead of a new RDD.

Instructions

1.

The code in the previous exercise didn’t actually execute since Spark transformations are lazy. Let’s execute the transformation by applying an action to view 5 elements in the transformed RDD and save this result as rdd_action.

2.

Let’s double check our resulting RDD by printing its elements.

3.

Let’s apply what we’ve learned in the transformation and action lessons by applying both concepts to calculate the average grades of the students. Print the resulting class average.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?