While we can directly analyze data using Spark’s Resilient Distributed Datasets (RDDs), we may not always want to perform complicated analysis directly on RDDs. Luckily, Spark offers a module called Spark SQL that can make common data analysis tasks simpler and faster. In this lesson, we’ll introduce Spark SQL and demonstrate how it can be a powerful tool for accelerating the analysis of distributed datasets.
The name Spark SQL is an umbrella term, as there are several ways to interact with data when using this module. We’ll cover two of these methods using the PySpark API:
First, we’ll learn the basics of inspecting and querying data in a Spark DataFrame.
Then, we’ll perform these same operations using standard SQL directly in our PySpark code.
Before using either method, we must start a
SparkSession, the entry point to Spark SQL. The session is a wrapper around a
sparkContext and contains all the metadata required to start working with distributed data.
The code below uses
SparkSession.builder to set configuration parameters and create a new session. In the following example, we set one configuration parameter (
spark.app.name) and call the
.getOrCreate() method to initialize the new
spark = SparkSession.builder\ .config('spark.app.name', 'learning_spark_sql')\ .getOrCreate()
We can access the
SparkContext for a session with
print(spark.sparkContext) # <SparkContext master=local[*] appName=learning_spark_sql>
From here, we can use the
SparkSession to create DataFrames, read external files, register tables, and run SQL queries over saved data. When we’re done with our analysis, we can clear the Spark cache and terminate the session with
SparkSession.stop(). Now that we’re familiar with the basics of
SparkSession, the next step is to begin using Spark SQL to interact with data!
How to Use Your Jupyter Notebook:
- You can run a cell in the Notebook to the right by placing your cursor in the cell and clicking the
Runbutton or the
- When you are ready to evaluate the code in your Notebook, press the
Savebutton at the top of the Notebook or
skeys before clicking the
Test Workbutton at the bottom. Be sure to save your solution code in the cell marked
## YOUR SOLUTION HERE ##or it will not be evaluated.
- When you are ready to move on, click
This notebook contains a
SparkSession with a few configuration options passed to the builder. You can refer to the
SparkSession configuration documentation for a set of common options passed to the builder.
Try running the notebook and looking at some of the configuration options used. What do you think the
spark.app.name option does? Try stopping the session and restarting it with a new
spark.app.name. Did it have the expected effect?