Learn

In this exercise, we’ll learn how to pull in larger datasets from external sources. To start, we’ll be using a dataset from Wikipedia that counts views of all articles by hour. For demonstration’s sake, we’ll use the first hour of 2022. Let’s take a look at the code we might use to read a CSV of this data from a location on disk.

print(type(spark.read)) # <class 'pyspark.sql.readwriter.DataFrameReader'> # Read CSV to DataFrame hrly_views_df = spark.read\ .option('header', True) \ .option('delimiter', ' ') \ .option('inferSchema', True)\ .csv("./data/views_2022_01_01_000000.csv")

There are a few things going on in this code, let’s go through them one at a time:

This code uses the SparkSession.read function to create a new DataFrameReader

The DataFrameReader has an .option("option_name", "option_value") method that can be used to instruct Spark how exactly to read a file. In this case, we used the following options:

  • .option('header', True) — Indicate the file already contains a header row. By default, Spark assumes there is no header.

  • .option('delimiter', ' ') — Indicates each column is separated by a space (‘ ‘). By default, Spark assumes CSV columns are separated by commas.

  • .option('inferSchema', True) — Instructs Spark to sample a subset of rows before determining each column’s type. By default, Spark will treat all CSV columns as strings.

The DataFrameReader also has a .csv("path") method which loads a CSV file and returns the result as a DataFrame. There are a few quick ways of checking that our data has been read in properly. The most direct way is checking DataFrame.show().

# Display first 5 rows of DataFrame hrly_views_df.show(5, truncate=False)
+-------------+---------------------------+------------+-------------+ |language_code|article_title |hourly_count|monthly_count| +-------------+---------------------------+------------+-------------+ |en |Cividade_de_Terroso |2 |0 | |en |Peel_Session_(Autechre_EP) |2 |0 | |en |Young_Street_Bridge |1 |0 | |en |Troy,_Alabama |1 |0 | |en |Charlotte_Johnson_Wahl |10 |0 | +-------------+---------------------------+------------+-------------+

Looks Good! In this exercise, we used a DataFrameReader to pull a CSV from disk into our local Spark environment. However, Spark can read a wide variety of file formats. You can refer to the PySpark documentation to explore all available DataFrameReader options and file formats]. In the following exercise, we’ll start to analyze the contents of this file.

Instructions

1.

The file wiki_uniq_march_2022.csv contains the estimated count of unique visitors to each Wikipedia domain on March 1st, 2022. The file has the following layout:

  • Site/Project Name (string)
  • Estimated Human Visitors (int)
  • Estimated Bot Visitors (int)
  • Total Traffic (int)

You can read more about how Wikipedia estimates these values here.

First, let’s load the data from ./data/wiki_uniq_march_2022.csv as a DataFrame named wiki_uniq_df and display the first 10 rows in the notebook. For the moment, do not add any options when reading the file.

2.

We’ve read the file, but the result doesn’t quite look right! This file has a header row. Pass the option to the DataFrameReader that will read the file and create a header from the first row. Name this DataFrame wiki_uniq_w_header_df.

3.

This result is better, but we haven’t specified the types for the DataFrame yet. Read the data in again, this time passing an option to the DataFrameReader that will tell Spark to sample rows to determine the file schema. Name this DataFrame wiki_uniq_w_schema_df.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?