Install Pyspark Off-Platform
Installation overview
Let’s get ready to supercharge our data analysis! Apache Spark is a big data processing framework we can use with Python code by installing PySpark. In general, PySpark installation has the following requirements:
- A relatively current version of Python (as of writing this article, Python 3.6 or newer)
- A Java installation of version 8 or later
- The PySpark framework
Let’s start by making sure we have the underlying programming languages installed.
Python setup
First, let’s tackle the “Py” in “PySpark” (i.e., we need to install Python). Python will serve as the language in which we talk to the Spark framework to process our data and is one of several languages that Spark can work with (others include Scala, R, and Java). In most cases, we should be installing the latest version of Python unless we know that a package or environment has other requirements.
Check Python version
To check that we have Python installed (and the version), we can use the command line.
- Mac: Open a Terminal and enter the code
python3 --version
- Windows: Open a Command Prompt and enter the code
python --version
If Python is installed, the output should read out a line with Python
and a version number. For example, the output may read Python 3.8.6
.
Install or update Python
If we don’t have Python on our computer, or if we have a version older than 3.6, we can follow the instructions in Codecademy’s article on installing Python.
Java setup
Next, we will take a look at a key foundation for the “Spark” part of “PySpark”. Spark, as a framework, is written in the Scala programming language and runs on Java Virtual Machine (JVM). Thus, we must make sure our computer has Java installed.
Check Java version
Similar to Python, we can check our version of Java via the command line.
- Mac: Open a Terminal and enter the code
java -version
- Windows: Open a Command Prompt and enter the code
java -version
If Java is installed, the output will show the version in a format similar to java version "1.8.0_333"
.
Install or update Java
If we don’t have Java on our computer, or if our version is earlier than version 8, we will need to refer to the Java Installation Manual to download and install the correct version to our computer.
Pyspark setup
Now that we have our language foundation set up, let’s start installing PySpark. There are several ways in which we can install PySpark. We recommend using one of the two package management systems but have included a link to manual installation instructions for more advanced learners.
Using a package management system
We recommend using one of the following two options to make installation easier. Package management systems ensure that the Spark framework is installed correctly for our hardware, as well as install the following Python package dependencies:
Package | Minimum Supported Version |
---|---|
pandas | 0.23.2 |
NumPy | 1.7 |
pyarrow | 1.0.0 |
Py4J | 0.10.9.3 |
Install through conda
Conda is an open-source package management system for Python packages that was created by Anaconda. Conda provides PySpark through its conda-forge channel. This is a natural selection for anyone with the conda version of Python installed. The main drawback of using conda is that it doesn’t always support the latest version of PySpark. Conda may be slightly behind since the conda-forge channel is community-driven.
If using conda to install PySpark, we run the following command in a Terminal or Command Prompt:
conda install -c conda-forge pyspark
Install through PyPI
The Python Package Index PyPI is a large collection of all the various Python packages and frameworks. The latest version of PySpark has been added to the list of packages and is maintained with each new release of Spark.
This method can be done by running the following command in a Terminal or Command Prompt:
pip install pyspark
Manual installation
While we do not recommend this method for newer learners, more advanced learners may want to navigate to the Apache Foundation downloads page to perform a manual installation. The Apache Foundation manages and distributes the Spark framework, so we can be certain that we will have access to the most up-to-date version (and even some beta versions) if we want to be on the cutting-edge. This approach requires us to “make Spark” manually, put various packages in specific locations, and install any other dependencies ourselves.
Testing the installation
Let’s take one last step to check that our installation was successful. In a Terminal or Command Prompt, type one of the following commands:
- Conda installation:
conda list pyspark
- PyPI installation:
pip show pyspark
If the installation was successful, the output will show the version of PySpark we have just installed. If we want to use PySpark within a Jupyter notebook but don’t have Jupyter installed yet, we can find instructions in Codecademy’s Jupyter Notebook installation guide.
Congratulations on installing PySpark!
Author
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full teamRelated articles
- Article
Installing Python 3 and Python Packages
Learn how to install Python packages and download Python 3 with Anaconda and Miniconda on Mac and Windows. - Article
Programming in Python on a Chromebook
This article will teach you how to run Python code on Chromebooks so you can do off-platform Python projects on your Chromebook.