Learn

In this exercise, we’re going to start to analyze our pageview data and learn how Spark can help with data exploration. Like Pandas, Spark DataFrames offer a series of operations for cleaning, inspecting, and transforming data. Earlier in the lesson, we mentioned that all DataFrames have a schema that defines their structure, columns, and datatypes. We can use DataFrame.printSchema() to show a DataFrame’s schema.

# Display DataFrame schema hrly_views_df.printSchema() root |-- language_code: string (nullable = true) |-- article_title: string (nullable = true) |-- hourly_count: integer (nullable = true) |-- monthly_count: integer (nullable = true)

We can then use DataFrame.describe() to see a high-level summary of the data by column. The result of DataFrame.describe() is a DataFrame in itself, so we append .show() to get it to display in our notebook.

hrly_views_df_desc = hrly_views_df.describe() hrly_views_df_desc.show(truncate=False)
+-------+-------------+-------------+------------+-------------+ |summary|language_code|article_title|hourly_count|monthly_count| +-------+-------------+-------------+------------+-------------+ | count| 4654091| 4654091| 4654091| 4654091| | mean| null| null| 4.52417| 0.0| | stddev| null| null| 182.92502| 0.0| | min| aa| -| 1| 0| | max| zu.m.d| -| 288886| 0| +-------+-------------+-------------+------------+-------------+

From this summary, we can see a few interesting facts.

  • About 4.65 million unique pages were visited this hour
  • The most visited page had almost 289,000 visitors, while the mean page had just over 4.5 visitors.

Because this data was taken from the first hour of the month, it looks like the column monthly_count only contains zeros. Because it contains no meaningful information, we can drop this field with DataFrame.drop("columns", "to", "drop").

# Drop `monthly_count` and display new DataFrame hrly_views_df = hrly_views_df.drop('monthly_count') hrly_views_df.show(5)
+-------------+---------------------------+------------+ |language_code|article_title |hourly_count| +-------------+---------------------------+------------+ |en |Cividade_de_Terroso | 2| |en |Peel_Session_(Autechre_EP) | 2| |en |Young_Street_Bridge | 1| |en |Troy,_Alabama | 1| |en |Charlotte_Johnson_Wahl | 10| +-------------+---------------------------+------------+

The data is starting to look pretty good, but let’s make one more adjustment. The column article_title is a bit misleading: it seems this data contains articles, files, image pages, and wikipedia metadata pages. We can replace this misleading header with a better name using DataFrame.withColumnRenamed().

hrly_views_df = hrly_views_df\ .withColumnRenamed('article_title', 'page_title')

Now when we call .printSchema() we see that the schema reflects the updates we’ve made to the DataFrame.

root |-- language_code: string (nullable = true) |-- page_title: string (nullable = true) |-- hourly_count: integer (nullable = true)

You may have noticed that Spark assigned all columns nullable = true. Intuitively, we know that article_title shouldn’t be null, but when the DataFrameReader reads a CSV, it assigns nullable = true to all columns. This is fine for now, but in some scenarios, you may wish to explicitly define a file’s schema. If interested, you can refer to [PySpark’s documentation on defining a file’s schema.] (https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.schema.html).

Instructions

1.

The code to read in the Wikipedia unique data has already been written. Let’s find out if the number and types of columns in the DataFrame look correct. Print the schema of the DataFrame.

2.

Let’s summarize this data. Save a high-level summary of the DataFrame to a new DataFrame named uniq_counts_df_desc and display it in the notebook. From this summary, can you determine the mean number of total visitors per site?

3.

Let’s assume our analysis is focused on only the uniq_human_visitors. Write code to drop total_visitor_count and uniq_bot_visitors. Save the result to a DataFrame named uniq_counts_human_df.

4.

Finally, let’s rename the column uniq_human_visitors to something a bit more descriptive. Rename uniq_human_visitors to unique_site_visitors. Save the new DataFrame as uniq_counts_final_df.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?