A PySpark SQL DataFrame is a distributed collection of data with a specific row and column structure. Under the hood, DataFrames are built on top of RDDs. Like pandas, PySpark SQL DataFrames allow a developer to analyze data more easily than by writing functions directly on underlying data.
DataFrames can be created manually from RDDs using rdd.toDF(["names", "of", "columns"])
. In the example below, we create a DataFrame from a manually constructed RDD and name its columns article_title
and view_count
.
# Create an RDD from a list hrly_views_rdd = spark.sparkContext.parallelize([ ["Betty_White" , 288886], ["Main_Page", 139564], ["New_Year's_Day", 7892], ["ABBA", 8154] ]) # Convert RDD to DataFrame hrly_views_df = hrly_views_rdd\ .toDF(["article_title", "view_count"])
Let’s take a look at our new DataFrame. We can use the DataFrame.show(n_rows)
method to print the first n_rows
of a Spark DataFrame. It can also be helpful to pass truncate=False
to ensure all columns are visible.
hrly_views_df.show(4, truncate=False)
+--------------+-----------+ | article_title| view_count| +--------------+-----------+ | Betty_White| 288886| | Main_Page| 139564| |New_Year's_Day| 7892| | ABBA| 8154| +--------------+-----------+
Great! Now that this data is loaded in as a DataFrame, we can access the underlying RDD with DataFrame.rdd
. You likely won’t need the underlying data often, but it can be helpful to keep in mind that a DataFrame is a structure built on top of an RDD. When we check the type of hrly_views_df_rdd
, we can see that it’s an RDD!
# Access DataFrame's underlying RDD hrly_views_df_rdd = hrly_views_df.rdd # Check object type print(type(hrly_views_df_rdd)) # <class 'pyspark.rdd.RDD'>
Instructions
Because we learned about SparkSession
in the first exercise, all remaining exercises in this lesson will include the code to create a SparkSession
named spark
for you to use.
Using the RDD sample_page_views
, create a DataFrame named sample_page_views_df
with columns named language_code
, title
, date
, and count
.
Show the first few rows of the DataFrame in the notebook.
Access the RDD underlying sample_page_views_df
and save it as sample_page_views_rdd_restored
.
Try running sample_page_views_rdd_restored.collect()
. You may notice that the original RDD is not exactly identical to the restored data! Although the data is the same, when we converted the data to a DataFrame, PySpark automatically wrapped the original content into a Row
. Behind the scenes, rows allow for more efficient calculations over large distributed data.