Learn

Once you’ve done some analysis, the next step is often saving the transformed data back to disk for others to use. In this final topic, we’re going to cover how to efficiently save PySpark DataFrames.

Similar to the SparkSession.read() method, Spark offers a SparkSession.write() method. Let’s perform a slight modification to our original Wikipedia views dataset and save it to disk. This code just uses .select() to select all columns except the monthly_count column (recall that earlier we discovered this column only contains zeros).

Because Spark runs all operations in parallel, It’s typical to write DataFrames to a directory of files rather than a single CSV file. In the example below, Spark will split the underlying dataset and write multiple CSV files to ./cleaned/csv/views_2022_01_01_000000/. We can also use the “mode” argument of the .csv() method to overwrite any existing data in the target directory.

hrly_views_df\ .select(['language_code', 'article_title', 'hourly_count'])\ .write.csv('./cleaned/csv/views_2022_01_01_000000/', mode="overwrite")

Using SparkSession.read(), we can read the data from disk and confirm that it looks the same as the DataFrame we saved.

# Read DataFrame back from disk hrly_views_df_restored = spark.read\ .csv('./cleaned/csv/views_2022_01_01_000000/') hrly_views_df_restored.printSchema() root |-- _c0: string (nullable = true) |-- _c1: string (nullable = true) |-- _c2: string (nullable = true)

Close, but not quite! It looks like this file didn’t retain information about column headers or datatypes. Unfortunately, there’s no way for a CSV to retain information about its format. Each time we read it, we’ll need to tell Spark exactly how it must be processed.

Luckily, there is a file format called “Parquet” that’s specially designed for big data and solves this problem among many others. Parquet offers efficient data compression, is faster to perform analysis on than CSV, and preserves information about a dataset’s schema. Let’s try saving and re-reading this file to and from Parquet instead.

# Write DataFrame to Parquet hrly_views_slim_df .write.parquet('./cleaned/parquet/views_2022_01_01_000000/', mode="overwrite") # Read Parquet as DataFrame hrly_views_df_restored = spark.read\ .parquet('./cleaned/parquet/views_2022_01_01_000000/') # Check DataFrame's schema hrly_views_df_restored.printSchema() root |-- language_code: string (nullable = true) |-- article_title: string (nullable = true) |-- hourly_count: integer (nullable = true)

Great, now anyone who wants to query this data can do so with the much more efficient Parquet data format!

Instructions

1.

Create a new DataFrame from wiki_uniq_df with only two columns, domain and uniq_human_visitors. Name this new DataFrame uniq_human_visitors_df.

2.

Now that we’ve modified a DataFrame, let’s persist the results. Save uniq_human_visitors_df to a local directory, ./results/csv/uniq_human_visitors/.

3.

Parquet is PySpark’s preferred data format and saving our results in this format could expedite future analysis. Let’s persist the results as parquet files, too. Save uniq_human_visitors_df to a local directory, ./results/pq/uniq_human_visitors/.

You can check the status of these files by clicking on the Jupyter icon in the upper left and navigating through the file tree.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?