[Beta] Introduction to Big Data with PySpark


Why Big Data with PySpark?

This course is an introduction to the underlying concepts behind big data with a practical and hands-on approach with PySpark. Big data is everywhere, and touches data science, data engineering, and machine learning. It is becoming central to marketing, strategy, and research. This course covers the applications and implications of big data on finance, social media, health, and medicine. PySpark makes it easy to start analyzing big data, making the potential of big data accessible to anyone who knows Python.

Take-Away Skills

In this course, you will learn how to handle big data with PySpark. In addition to learning how to manage the data, you will also be exposed to the conceptual underpinnings that make working with big data possible.

Codecademy courses have been taken by employees at

Google LogoFacebook LogoNASA LogoIBM LogoDropbox Logo
  1. 1
    Learn about how we define big data, how big data is stored and processed, and what ethical considerations we need to keep in mind.
  2. 2
    Learn one way that Spark handles big data – through Resilient Distributed Datasets (RDDs).
  3. 3
    Learn about how PySpark lets you do SQL-like queries on big data datasets.
  4. 4
    Combine everything you’ve learned so far about PySpark to work with a big data dataset!

What you'll create

Portfolio projects that showcase your new skills

Pro Logo

How you'll master it

Stress-test your knowledge with quizzes that help commit syntax to memory

Pro Logo

— Madelyn, Pinterest

I know from first-hand experience that you can go in knowing zero, nothing, and just get a grasp on everything as you go and start building right away.

Course Description

Learn how to work with big data using PySpark!


Earn a certificate of completion
5 hours to complete in total

5 articles, 1 quiz

Learn one way that Spark handles big data – through Resilient Distributed Datasets (RDDs).

1 lesson, 1 project, 1 quiz

1 project, 2 articles