Learn

The entry point to Spark is called a SparkSession. There are many possible configurations for a SparkSession, but for now, we will simply start a new session and save it as spark:

from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate()

We can use Spark with data stored on a distributed file system or just on our local machine. Without additional configurations, Spark defaults to local with the number of partitions set to the number of CPU cores on our local machine (often, this is four).

The sparkContext within a SparkSession is the connection to the cluster and gives us the ability to create and transform RDDs. We can create an RDD from data saved locally using the parallelize() function. We can add an argument to specify the number of partitions, which is generally recommended as 2-4 partitions per machine. Otherwise, Spark defaults to the total number of CPU cores.

# default setting rdd_par = spark.sparkContext.parallelize(dataset_name)

If we are working with an external dataset, or possibly a large dataset stored on a distributed file system, we can use textFile() to create an RDD. Spark’s default is to partition the text file in 128 MB blocks, but we can also add an argument to set the number of partitions within the function.

# with partition argument of 10 rdd_txt = spark.sparkContext.textFile("file_name.txt", 10)

We can verify the number of partitions in rdd_txt using the following line:

rdd_txt.getNumPartitions() # output: 10

Finally, we need to know how to end our SparkSession when we are finished with our work:

spark.stop()

Now that we know how to get started with PySpark, let’s introduce the dataset we’ll be working with throughout this lesson and set it up as an RDD!


How to Use Your Jupyter Notebook:

  • You can run a cell in the Notebook to the right by placing your cursor in the cell and clicking the Run button or the Shift+Enter/Return keys.
  • When you are ready to evaluate the code in your Notebook, press the Save button at the top of the Notebook or command+s keys before clicking the Test Work button at the bottom. Be sure to save your solution code in the cell marked ## YOUR SOLUTION HERE ## or it will not be evaluated.
  • When you are ready to move on, click Next.

screenshot of the buttons at the top of the Jupyter Notebook interface with Save and Run highlighted

Instructions

1.

We’ll be working with data about students applying for college. We usually use PySpark for extremely large datasets, but it’s easier to see how functions work when we start with a smaller example.

In the Jupyter notebook named notebook.ipynb, we give you a list of tuples called student_data. Each tuple contains a name, an SAT score out of 1600, a GPA out of 100% (in decimals), and a state.

Start a SparkSession and assign it the name spark. Confirm your session is connected by running the provided code to print spark from the same code cell.

Reminder: You will need to run the previous code cells in the notebook before running your solution code.

2.

Now that we have our connection to a Spark cluster, we can create an RDD using the student data. Change the list student_data into an RDD called student_rdd that is divided over 5 partitions. Confirm the contents of student_rdd are correct by running the provided code in the same code cell.

3.

View the number of partitions for student_rdd to confirm that it is 5.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?