Apache Spark is a framework that allows us to work with big data. But how do we tell Spark what to do with our data? In this lesson, we’ll get familiar with using PySpark (the Python API for Spark) to load and transform our data in the form of RDDs — resilient distributed datasets.
RDDs are the foundational data structures of Spark. Newer Spark structures like DataFrames are built on top of RDDs. While DataFrames are more commonly used in industry, RDDs are not deprecated and are still called for in certain circumstances. For example, RDDs are useful for processing unstructured data, such as text or images, that don’t fit nicely in the tabular structure of a DataFrame.
So what exactly is an RDD? According to our friends at Apache, the formal definition of an RDD is “a fault-tolerant collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.” Those are some complicated words! Let’s break down the three key properties of RDDs together:
- Fault-tolerant or resilient: data is copied and recoverable in the event of failure
- Partitioned or distributed: datasets are split up across the nodes in a cluster
- Operated on in parallel: tasks are executed on all the chunks of data at the same time
Now that we have a bit more context as to what RDDs are, let’s learn how to create one with PySpark in the next exercise!