Learn

Many of the Spark functions we use on RDDs are similar to those we regularly use in Python. We can also use lambda expressions within RDD functions. Lambdas allow us to apply a simple operation to an object in a single line without defining it as a function. Check out the following example of a lambda expression that adds the number 1 to its input.

add_one = lambda x: x+1 # apply x+1 to x print(add_one(10)) # this will output 11

Let’s introduce a couple of PySpark functions that we may already be familiar with:

map() applies an operation to each element of the RDD, so it’s often constructed with a lambda expression. This map example adds 1 to each element in our RDD:

rdd = spark.SparkContent.parallelize([1,2,3,4,5]) rdd.map(lambda x: x+1) # output RDD [2,3,4,5,6]

If our RDD contains tuples, we can map the lambda expression to the elements with a specific index value. The following code maps the lambda expression to just the first element of each tuple but keeps the others in the output:

# input RDD [(1,2,3),(4,5,6),(7,8,9)] rdd.map(lambda x: (x[0]+1, x[1], x[2])) # output RDD [(2,2,3),(5,5,6),(8,8,9)]

filter() allows us to remove or keep data conditionally. If we want to remove all NULL values in the following RDD, we can use a lambda expression in our filter:

# input RDD [1,2,NULL,4,5] rdd.filter(lambda x: x is not None) # output RDD [1,2,4,5]

You may have noticed that each function took an RDD as input and returned an RDD as output. In Spark, functions with this behavior are called transformations. You can find more transformations in the official Spark documentation.

We have one final note about transformations: we can only view the contents of an RDD by using a special function like collect(), which will print the data stored in the RDD. So to view the new RDD in the previous example, we would run the following:

rdd.filter(lambda x: x is not None).collect()
[1,2,4,5]

Let’s try working with some transformations!

Instructions

1.

You may have noticed that the grades in student_rdd are in decimals. Let’s convert the decimal grades into whole numbers (multiply by 100).

Use the map() function with a lambda and include the other three variables in the new RDD. Save this RDD as rdd_transformation. Confirm your transformation worked by running the provided code in the same code cell.

Note: Remember to run the first code cell in the notebook before running your solution.

2.

Filter rdd_transformation to just those rows with grades above 80 and save the new RDD as rdd_filtered. Confirm your transformation worked by running the provided code in the same code cell.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?