reduce() function we used previously is a powerful aggregation tool, but there are limitations to the operations it can apply to RDDs. Namely,
reduce() must be commutative and associative due to the nature of parallelized computation.
You’ve probably heard of both those terms back in elementary math class, and they probably make sense to you in that context. However, what do they mean in Spark?
Well, it all ties back to the fact that Spark operates in parallel — tasks that have commutative and associative properties allow for parallelization. The commutative property allows for all parallel tasks to execute and conclude without waiting for another task to complete. The associative property allows Spark to partition and distribute our data to multiple nodes because the result will stay the same no matter how tasks are grouped.
Let’s try to break that down a bit further with math! No matter how you switch up or break down summations, they’ll always have the same result thanks to the commutative and associative properties:
However, this is not the case with division:
The flowchart represents one of the possible ways that our list was partitioned into three nodes and ultimately summed. No matter how our data was partitioned or which summations were completed first, the answer will be 15.
This shows that the commutative and associative properties enable parallel processing because it gives us two very important concepts: the output doesn’t depend on the order in which tasks complete (commutative) nor does it depend on how the data is grouped (associative).
Only operations that are both commutative and associative can be applied with
reduce(), but let’s see this in practice. A couple of code blocks that will sum and divide a dataset of varying partitions have been provided for you in the notebook.ipynb.
With the very handy transformation
glom() we can print out how our data is partitioned and the resulting summation and division of the partitions.
Let’s first run the summations - what do you notice about the result of the summation as the number of partitions grows?
Now let’s first run the divisions - what do you notice about the result of the division as the number of partitions grows?