PySpark is a Python API for Apache Spark (or Spark), consisting of several modules. Spark is an analytical engine for large-scale distributed data processing and machine learning. Spark applications can run operations with very large datasets on distributed clusters about 100 times faster than traditional Python applications. PySpark has seen wide usage in the data science and machine learning communities due to the extensive number of data science libraries available in Python, and its ability to efficiently process large datasets.


Some of the primary features of PySpark include:

  • Distributed processing.
  • Support of many cluster managers.
  • Support of ANSI SQL.
  • In-memory computation.


Installation requires Java 8 or later. Details on installation can be found on the Apache Spark website.


Interested in helping build Docs? Read the Contribution Guide or share your thoughts in this feedback form.

Learn Python on Codecademy