Spark RDDs with PySpark
Learn one way that Spark handles big data -- through Resilient Distributed Datasets (RDDs).
StartKey Concepts
Review core concepts you need to learn to master this subject
Spark Overview
Spark Overview
Spark is an application that was designed to process large amount of data. Originally creating data pipelines for Machine Learning (ML) workloads, Spark is capable of querying, transforming, and analyzing Big Data on a variety of data systems.
RDDs with PySpark
Lesson 1 of 1
- 1Apache Spark is a framework that allows us to work with big data. But how do we tell Spark what to do with our data? In this lesson, we’ll get familiar with using PySpark (the Python API for Spark)…
- 2The entry point to Spark is called a SparkSession. There are many possible configurations for a SparkSession, but for now, we will simply start a new session and save it as spark: from pyspark…
- 3RDDs may seem more complicated than DataFrames, but we can also manipulate RDDs using Spark transformations. Transformations are functions that take an RDD as input and will always output a new…
- 4You may have noticed that the transformation executed rather quickly! That’s because it didn’t execute at all. Unlike transformations in Pandas, which we call eager, transformations in Spark ar…
- 5The reduce() function we used previously is a powerful aggregation tool, but there are limitations to the operations it can apply to RDDs. Namely, reduce() must be commutative and **associative…
- 6By now we’ve talked endlessly about the benefits of distributing our data across multiple nodes and allowing for parallel processing, but what happens when we don’t want our data to be distributed?…
- 7You’ve broadcasted a dictionary over to your nodes, and everything went well! You’re now curious as to how many east versus west coast entries there are. We could attempt to create a couple variabl…
How you'll master it
Stress-test your knowledge with quizzes that help commit syntax to memory