Learn

While we can directly analyze data using Spark’s Resilient Distributed Datasets (RDDs), we may not always want to perform complicated analysis directly on RDDs. Luckily, Spark offers a module called Spark SQL that can make common data analysis tasks simpler and faster. In this lesson, we’ll introduce Spark SQL and demonstrate how it can be a powerful tool for accelerating the analysis of distributed datasets.

The name Spark SQL is an umbrella term, as there are several ways to interact with data when using this module. We’ll cover two of these methods using the PySpark API:

  • First, we’ll learn the basics of inspecting and querying data in a Spark DataFrame.

  • Then, we’ll perform these same operations using standard SQL directly in our PySpark code.

Before using either method, we must start a SparkSession, the entry point to Spark SQL. The session is a wrapper around a SparkContext and contains all the metadata required to start working with distributed data.

The code below uses SparkSession.builder to set configuration parameters and create a new session. In the following example, we set one configuration parameter (spark.app.name) and call the .getOrCreate() method to initialize the new SparkSession.

spark = SparkSession.builder\ .config('spark.app.name', 'learning_spark_sql')\ .getOrCreate()

We can access the SparkContext for a session with SparkSession.sparkContext.

print(spark.sparkContext) # <SparkContext master=local[*] appName=learning_spark_sql>

From here, we can use the SparkSession to create DataFrames, read external files, register tables, and run SQL queries over saved data. When we’re done with our analysis, we can clear the Spark cache and terminate the session with SparkSession.stop(). Now that we’re familiar with the basics of SparkSession, the next step is to begin using Spark SQL to interact with data!


How to Use Your Jupyter Notebook:

  • You can run a cell in the Notebook to the right by placing your cursor in the cell and clicking the Run button or the Shift+Enter/Return keys.
  • When you are ready to evaluate the code in your Notebook, press the Save button at the top of the Notebook or command+s keys before clicking the Test Work button at the bottom. Be sure to save your solution code in the cell marked ## YOUR SOLUTION HERE ## or it will not be evaluated.
  • When you are ready to move on, click Next.

screenshot of the buttons at the top of the Jupyter Notebook interface with Save and Run highlighted

Instructions

1.

This notebook contains a SparkSession with a few configuration options passed to the builder. You can refer to the SparkSession configuration documentation for a set of common options passed to the builder.

Try running the notebook and looking at some of the configuration options used. What do you think the spark.app.name option does? Try stopping the session and restarting it with a new spark.app.name. Did it have the expected effect?

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?