Install Pyspark Off-Platform

Install PySpark on your computer so you can analyze big data off-platform

Installation overview

Let’s get ready to supercharge our data analysis! Apache Spark is a big data processing framework we can use with Python code by installing PySpark. In general, PySpark installation has the following requirements:

  1. A relatively current version of Python (as of writing this article, Python 3.6 or newer)
  2. A Java installation of version 8 or later
  3. The PySpark framework

Let’s start by making sure we have the underlying programming languages installed.

Python setup

First, let’s tackle the “Py” in “PySpark” (i.e., we need to install Python). Python will serve as the language in which we talk to the Spark framework to process our data and is one of several languages that Spark can work with (others include Scala, R, and Java). In most cases, we should be installing the latest version of Python unless we know that a package or environment has other requirements.

Check Python version

To check that we have Python installed (and the version), we can use the command line.

  • Mac: Open a Terminal and enter the code python3 --version
  • Windows: Open a Command Prompt and enter the code python --version

If Python is installed, the output should read out a line with Python and a version number. For example, the output may read Python 3.8.6.

Install or update Python

If we don’t have Python on our computer, or if we have a version older than 3.6, we can follow the instructions in Codecademy’s article on installing Python.

Java setup

Next, we will take a look at a key foundation for the “Spark” part of “PySpark”. Spark, as a framework, is written in the Scala programming language and runs on Java Virtual Machine (JVM). Thus, we must make sure our computer has Java installed.

Check Java version

Similar to Python, we can check our version of Java via the command line.

  • Mac: Open a Terminal and enter the code java -version
  • Windows: Open a Command Prompt and enter the code java -version

If Java is installed, the output will show the version in a format similar to java version "1.8.0_333".

Install or update Java

If we don’t have Java on our computer, or if our version is earlier than version 8, we will need to refer to the Java Installation Manual to download and install the correct version to our computer.

Pyspark setup

Now that we have our language foundation set up, let’s start installing PySpark. There are several ways in which we can install PySpark. We recommend using one of the two package management systems but have included a link to manual installation instructions for more advanced learners.

Using a package management system

We recommend using one of the following two options to make installation easier. Package management systems ensure that the Spark framework is installed correctly for our hardware, as well as install the following Python package dependencies:

Package Minimum Supported Version
pandas 0.23.2
NumPy 1.7
pyarrow 1.0.0
Py4J 0.10.9.3

Install through conda

Conda is an open-source package management system for Python packages that was created by Anaconda. Conda provides PySpark through its conda-forge channel. This is a natural selection for anyone with the conda version of Python installed. The main drawback of using conda is that it doesn’t always support the latest version of PySpark. Conda may be slightly behind since the conda-forge channel is community-driven.

If using conda to install PySpark, we run the following command in a Terminal or Command Prompt:

conda install -c conda-forge pyspark

Install through PyPI

The Python Package Index PyPI is a large collection of all the various Python packages and frameworks. The latest version of PySpark has been added to the list of packages and is maintained with each new release of Spark.

This method can be done by running the following command in a Terminal or Command Prompt:

pip install pyspark

Manual installation

While we do not recommend this method for newer learners, more advanced learners may want to navigate to the Apache Foundation downloads page to perform a manual installation. The Apache Foundation manages and distributes the Spark framework, so we can be certain that we will have access to the most up-to-date version (and even some beta versions) if we want to be on the cutting-edge. This approach requires us to “make Spark” manually, put various packages in specific locations, and install any other dependencies ourselves.

Testing the installation

Let’s take one last step to check that our installation was successful. In a Terminal or Command Prompt, type one of the following commands:

  • Conda installation: conda list pyspark
  • PyPI installation: pip show pyspark

If the installation was successful, the output will show the version of PySpark we have just installed. If we want to use PySpark within a Jupyter notebook but don’t have Jupyter installed yet, we can find instructions in Codecademy’s Jupyter Notebook installation guide.

Congratulations on installing PySpark!

Author

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team