The Spark ecosystem can be quite expansive, but the skills you’ve gained from this lesson should help you as you begin to branch out and run your own analyses. In this lesson you’ve learned:
How to construct Spark DataFrames from raw data in Python and Spark RDDs.
How to read and write data from disk into Spark DataFrames, including an introduction to file formats optimized for big-data workloads.
How to perform data exploration and cleaning on distributed data.
How the PySpark SQL API can allow you to perform analysis on distributed data more easily than working directly with RDDs by using DataFrames.
How to use the PySpark SQL API to query your datasets with standard SQL.
Instructions
The workspace has been loaded with the dataset we’ve been working on for the past few exercises. Feel free to query or modify this data here, or you can access this dataset in its raw form from Wikipedia’s Archives.