Learn

You may have noticed that transformations execute rather quickly! That’s because they didn’t execute at all. Spark executes transformations only when an action is called to return a value. This delay is why we call Spark transformations lazy. We call the transformations we do in pandas eager because they execute immediately.

So, why are Spark transformations lazy? Spark will queue up the transformations to optimize and reduce overhead once an action is called. Let’s say that we wanted to apply a map and filter to our RDD:

rdd = spark.SparkContent.parallelize([1,2,3,4,5]) rdd.map(lambda x: x+1).filter(lambda x: x>3)

Instead of following the order that we called the transformations, Spark might load the values greater than 3 into memory first and perform the map function last. This swap will save memory and time because Spark loaded fewer data points and mapped the lambda to fewer elements.

In the last exercise, Spark executed our transformations only when the action collect() was called to return the entire contents of the new RDD as a list. We generally don’t want to use collect() to pull large amounts of data into memory, so we can use take(n) to view the first n elements of a large RDD.

# input RDD [1,2,3,4,5] rdd.take(3)
[1, 2, 3]

We can use the action reduce() to return fewer elements of our RDD by applying certain operators. For example, say we want to add up all the values in the RDD. We can use reduce() with a lambda to add each element sequentially.

# input RDD [1,2,3,4,5] rdd.reduce(lambda x,y: x+y)
15

reduce() is powerful because it allows us to apply many arbitrary operations to an RDD — it unbinds us from searching for library functions that might not exist. However, it certainly has limitations, which we’ll dive into in the next exercise.

The key thing about actions is that, like transformations, they take an RDD as input, but they will always output a value instead of a new RDD.

Instructions

1.

We have provided the code to create the RDD called rdd_transformation in the first code cell. Execute the transformation by applying take() to view the first five elements of rdd_transformation.

2.

Let’s calculate the average grade for rdd_transformation. The grades are stored in the third element of each tuple.

First, use a transformation and action together to get the sum of the grades in rdd_transformation and save the result as sum_gpa. View the results by running the provided code in the same code cell.

3.

The code rdd_transformation.count() will give a count of the number of tuples in rdd_transformation. Divide sum_gpa by rdd_transformation.count() to get the average grade.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?