Learn

A PySpark SQL DataFrame is a distributed collection of data with a specific row and column structure. Under the hood, DataFrames are built on top of RDDs. Like pandas, PySpark SQL DataFrames allow a developer to analyze data more easily than by writing functions directly on underlying data.

DataFrames can be created manually from RDDs using rdd.toDF(["names", "of", "columns"]). In the example below, we create a DataFrame from a manually constructed RDD and name its columns article_title and view_count.

# Create an RDD from a list hrly_views_rdd = spark.sparkContext.parallelize([ ["Betty_White" , 288886], ["Main_Page", 139564], ["New_Year's_Day", 7892], ["ABBA", 8154] ]) # Convert RDD to DataFrame hrly_views_df = hrly_views_rdd\ .toDF(["article_title", "view_count"])

Let’s take a look at our new DataFrame. We can use the DataFrame.show(n_rows) method to print the first n_rows of a Spark DataFrame. It can also be helpful to pass truncate=False to ensure all columns are visible.

hrly_views_df.show(4, truncate=False)
+--------------+-----------+ | article_title| view_count| +--------------+-----------+ | Betty_White| 288886| | Main_Page| 139564| |New_Year's_Day| 7892| | ABBA| 8154| +--------------+-----------+

Great! Now that this data is loaded in as a DataFrame, we can access the underlying RDD with DataFrame.rdd. You likely won’t need the underlying data often, but it can be helpful to keep in mind that a DataFrame is a structure built on top of an RDD. When we check the type of hrly_views_df_rdd, we can see that it’s an RDD!

# Access DataFrame's underlying RDD hrly_views_df_rdd = hrly_views_df.rdd # Check object type print(type(hrly_views_df_rdd)) # <class 'pyspark.rdd.RDD'>

Instructions

1.

Because we learned about SparkSession in the first exercise, all remaining exercises in this lesson will include the code to create a SparkSession named spark for you to use. Be sure to run these cells!

Using the RDD sample_page_views, create a DataFrame named sample_page_views_df with columns named language_code, title, date, and count.

In the same code cell, add code to show the first five rows of the DataFrame. Set truncate=False to ensure all columns are visible.

2.

Access the RDD underlying sample_page_views_df and save it as sample_page_views_rdd_restored. In the same code cell, run sample_page_views_rdd_restored.collect() to view the restored RDD.

Note: You may notice that the restored RDD is not identical to the original RDD! Although the data is the same, when we converted the data to a DataFrame, PySpark automatically wrapped the original content into a Row. Behind the scenes, rows allow for more efficient calculations over large distributed data.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?