Learn

It’s time to start performing some analysis–this is where PySpark SQL really shines. PySpark SQL DataFrames have a variety of built-in methods that can help with analyzing data. Let’s get into a few examples!

Imagine we’d like to filter our data to pages from a specific Wikipedia language_code (e.g., “kw.m”). This site is not very active, so it’s easy to use all of this hour’s data for demonstration purposes. We can display this result with the code below:

hrly_views_df\ .filter(hrly_views_df.language_code == "kw.m"')\ .show(truncate=False)
+-------------+-----------------------+------------+-------------------+ |language_code|article_title |hourly_count|monthly_count| +-------------+-----------------------+------------+-------------------+ |kw.m |Bresel_Diabarth_Spayn |1 |0 | |kw.m |Can_an_Pescador_Kernûak|1 |0 | |kw.m |Ferdinand_Magellan |1 |0 | |kw.m |Justė_Arlauskaitė |16 |0 | |kw.m |Lithouani |2 |0 | |kw.m |Nolwenn_Leroy |1 |0 | |kw.m |Ohio |1 |0 | |kw.m |Taywan |1 |0 | +-------------+-----------------------+------------+-------------------+

This code uses the DataFrame.filter() method to select relevant rows. This is analogous to a SQL “WHERE” clause. In this case, our condition checks the column language_code for the value “kw.m”. What if we want to remove the monthly_count column and display the data ordered by the hourly_count? To do so we could use the following:

hrly_views_df\ .filter('language_code == "kw.m"')\ .select(['language_code', 'article_title', 'hourly_count'])\ .orderBy('hourly_count', ascending=False)\ .show(5, truncate=False)
+-------------+-----------------------+------------+-------------------+ |language_code|article_title |hourly_count|total_monthly_count| +-------------+-----------------------+------------+-------------------+ |kw.m |Justė_Arlauskaitė |16 |0 | |kw.m |Lithouani |2 |0 | |kw.m |Bresel_Diabarth_Spayn |1 |0 | |kw.m |Can_an_Pescador_Kernûak|1 |0 | |kw.m |Nolwenn_Leroy |1 |0 | +-------------+-----------------------+------------+-------------------+
  • DataFrame.select() is used to choose which columns to return in our result. You can think of DataFrame.select(["A", "B", "C"]) as analogous to SELECT A, B, C FROM DataFrame in SQL.

  • DataFrame.orderBy() is analogous to SQL’s ORDER BY. We use .orderBy('hourly_count', ascending=False) to specify the sort column and order logic. This would be analogous to ORDER BY hourly_count DESC in SQL.

What if we’d like to select the sum of hourly_count by language_code? This could help us answer questions like “Which sites were most active this hour?” We can do that with the following:

hrly_views_df\ .select(['language_code', 'hourly_count'])\ .groupBy('language_code')\ .sum() \ .orderBy('sum(hourly_count)', ascending=False)\ .show(5, truncate=False)
+-------------+-----------------+ |language_code|sum(hourly_count)| +-------------+-----------------+ |en.m |8095763 | |en |2693185 | |de.m |1313505 | |es.m |963835 | |ru.m |927583 | +-------------+-----------------+

This code uses DataFrame.groupBy('language_code').sum() to calculate the sum of all columns grouped by language_code, .groupBy(field) and .sum() are analogous to SQL’s GROUP BY and SUM functions respectively. This code also orders our results with .orderBy(), using the name of the constructed column, 'sum(hourly_count)'.

There are many different ways to use the DataFrame methods to query your data. However, if you’re familiar with SQL, you may prefer to use standard SQL statements. In the next section, we’ll explore how you can use standard SQL to explore data with PySpark.

Instructions

1.

For this exercise, a slightly modified version of the dataset you’ve been working with has been created (‘./data/wiki_uniq_march_2022_w_site_type.csv’). The dataset now has two additional columns: language_code and site_type. Let’s find out how many domains there are for the "ar" language code. Filter the dataset to the rows for the language code "ar". Save your result as a DataFrame named ar_site_visitors and display it in the notebook.

2.

Now let’s find out which domain was visited the most. Filter the dataset to the rows for the language code "ar" and the columns "domain" and "uniq_human_visitors". Save your result as a DataFrame named ar_visitors_slim and display it in the notebook.

3.

Calculate the sum of all uniq_human_visitors grouped by site_type and ordered from highest to lowest by sum of visitors. Save your result as a DataFrame named top_visitors_site_type and display it in the notebook.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?