The entry point to Spark is called a SparkSession. There are many possible configurations for a SparkSession, but for now, we will simply start a new session and save it as
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate()
We can use Spark with data stored on a distributed file system or just on our local machine. Without additional configurations, Spark defaults to local with the number of partitions set to the number of CPU cores on our local machine (often, this is four).
sparkContext within a SparkSession is the connection to the cluster and gives us the ability to create and transform RDDs. We can create an RDD from data saved locally using the
parallelize() function. We can add an argument to specify the number of partitions, which is generally recommended as 2-4 partitions per machine. Otherwise, Spark defaults to the total number of CPU cores.
# default setting rdd_par = spark.sparkContext.parallelize(dataset_name)
If we are working with an external dataset, or possibly a large dataset stored on a distributed file system, we can use
textFile() to create an RDD. Spark’s default is to partition the text file in 128 MB blocks, but we can also add an argument to set the number of partitions within the function.
# with partition argument of 10 rdd_txt = spark.sparkContext.textFile("file_name.txt", 10)
We can verify the number of partitions in
rdd_txt using the following line:
rdd_txt.getNumPartitions() # output: 10
Finally, we need to know how to end our SparkSession when we are finished with our work:
Now that we know how to get started with PySpark, let’s introduce the dataset we’ll be working with throughout this lesson and set it up as an RDD!
How to Use Your Jupyter Notebook:
- You can run a cell in the Notebook to the right by placing your cursor in the cell and clicking the
Runbutton or the
- When you are ready to evaluate the code in your Notebook, press the
Savebutton at the top of the Notebook or
skeys before clicking the
Test Workbutton at the bottom. Be sure to save your solution code in the cell marked
## YOUR SOLUTION HERE ##or it will not be evaluated.
- When you are ready to move on, click
We’ll be working with data about students applying for college. We usually use PySpark for extremely large datasets, but it’s easier to see how functions work when we start with a smaller example.
In the Jupyter notebook named notebook.ipynb, we give you a list of tuples called
student_data. Each tuple contains a name, an SAT score out of 1600, a GPA out of 100% (in decimals), and a state.
Start a SparkSession and assign it the name
spark. Confirm your session is connected by running the provided code to print
spark from the same code cell.
Reminder: You will need to run the previous code cells in the notebook before running your solution code.
Now that we have our connection to a Spark cluster, we can create an RDD using the student data. Change the list
student_data into an RDD called
student_rdd that is divided over 5 partitions. Confirm the contents of
student_rdd are correct by running the provided code in the same code cell.
View the number of partitions for
student_rdd to confirm that it is 5.