PySpark SQL
Lesson 1 of 1
  1. 1
    While we can directly analyze data using Spark’s Resilient Distributed Datasets (RDDs), we may not always want to perform complicated analysis directly on RDDs. Luckily, Spark offers a module calle…
  2. 2
    A PySpark SQL DataFrame is a distributed collection of data with a specific row and column structure. Under the hood, DataFrames are built on top of RDDs. Like pandas, PySpark SQL DataFrames allow …
  3. 3
    In this exercise, we’ll learn how to pull in larger datasets from external sources. To start, we’ll be using a dataset from Wikipedia that counts views of all articles by hour. For demonstration’s …
  4. 4
    In this exercise, we’re going to start to analyze our pageview data and learn how Spark can help with data exploration. Like Pandas, Spark DataFrames offer a series of operations for cleaning, insp…
  5. 5
    It’s time to start performing some analysis–this is where PySpark SQL really shines. PySpark SQL DataFrames have a variety of built-in methods that can help with analyzing data. Let’s get into a fe…
  6. 6
    PySpark DataFrame’s query methods are an improvement on performing analysis directly on RDDs. However, working with DataFrame methods still requires some practice, and the code can become quite ver…
  7. 7
    Once you’ve done some analysis, the next step is often saving the transformed data back to disk for others to use. In this final topic, we’re going to cover how to efficiently save PySpark DataFram…
  8. 8
    The Spark ecosystem can be quite expansive, but the skills you’ve gained from this lesson should help you as you begin to branch out and run your own analyses. In this lesson you’ve learned: - How…

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo