RDDs may seem more complicated than DataFrames, but we can also manipulate RDDs using Spark transformations. Transformations are functions that take an RDD as input and will always output a new RDD back. Thankfully, many RDD transformations function identically to DataFrame functions!
Before we get into transformations, we should revisit Lambda expressions because they’re often called within transformations. Lambdas allow us to apply a simple operation to an object without needing to define it as a function. This improves readability by condensing what could be a few lines of code into a single line. Check out the following example of a lambda expression that adds the number 1 to its input.
add_one = lambda x: x+1 # apply x+1 to x print(add_one(10)) # this will output 11
With our quick review of Lambdas complete, we can introduce a few familiar functions that you may have encountered in Python:
.map() applies an operation to each element of the RDD, so it’s often constructed with a lambda expression. This map example adds 1 to each element in our RDD:
# input RDD = [1,2,3,4,5] rdd.map(lambda x: x+1) # output RDD = [2,3,4,5,6]
.groupby() takes its parameter as a grouping category to bucket data accordingly. Let’s check out how we can group the RDD by the first element in the tuples:
# input RDD = [(1,2), (3,4) (1,5)] rdd.groupby(lambda x: x) # output RDD = [(1,(2,5)), (3,4)]
.filter() allows us to conditionally remove or keep data. If we want to remove all NULL values in the RDD we can code it like this:
# input RDD = [1,2,NULL,4,5] rdd.filter(lambda x: x is not None) # output RDD = [1,2,4,5]
There are quite a few PySpark transformations, but thankfully, they’re well documented in the official Spark documentation.
You must be excited to try and map what we’ve learned to some practical examples, and you’re almost there! But let me leave you with one final quirk about transformations: you may notice that your code won’t actually output an RDD. There’s an interesting reason why that happens, but we’ll explore further in the next exercise!
You may have noticed that the grades are in decimals in our RDD, but we want them as percentages. Let’s convert the grades into whole numbers by using the
map() function with a
lambda. Save this RDD as