Learn

PySpark DataFrame’s query methods are an improvement on performing analysis directly on RDDs. However, working with DataFrame methods still requires some practice, and the code can become quite verbose. Luckily, we can analyze data in Spark with standard SQL through the SparkSession.sql() method. This exercise will closely mirror the previous one, and we’ll answer the same questions from that exercise using standard SQL.

Before querying a DataFrame with SQL in Spark, it must be saved to the SparkSession’s catalog. The following code saves the DataFrame as a local temporary view in memory. As long as the current SparkSession is active, we can use SparkSession.sql() to query it.

hrly_views_df.createOrReplaceTempView('hourly_counts')

Each of the three sections of SQL below performs the same function as the DataFrame query methods described in the previous exercise. With the query below, we can filter our data to pages from a specific Wikipedia language_code (e.g., “kw.m”) using a WHERE clause.

query = """SELECT * FROM hourly_counts WHERE language_code = 'kw.m'""" spark.sql(query).show(truncate=False)
+-------------+-----------------------+------------+-------------+ |language_code|article_title |hourly_count|monthly_count| +-------------+-----------------------+------------+-------------+ |kw.m |Bresel_Diabarth_Spayn | 1| 0| |kw.m |Can_an_Pescador_Kernûak| 1| 0| |kw.m |Ferdinand_Magellan | 1| 0| |kw.m |Justė_Arlauskaitė | 16| 0| |kw.m |Lithouani | 2| 0| |kw.m |Nolwenn_Leroy | 1| 0| |kw.m |Ohio | 1| 0| |kw.m |Taywan | 1| 0| +-------------+-----------------------+------------+-------------+

In the query below, we display all pages with “kw.m” as their language_code ordered by the hourly_count using an ORDER BY clause.

query = """SELECT language_code, article_title, hourly_count FROM hourly_counts WHERE language_code = 'kw.m' ORDER BY hourly_count DESC""" spark.sql(query).show(truncate=False)
+-------------+-----------------------+------------+-------------------+ |language_code|article_title |hourly_count|total_monthly_count| +-------------+-----------------------+------------+-------------------+ |kw.m |Justė_Arlauskaitė | 16| 0| |kw.m |Lithouani | 2| 0| |kw.m |Bresel_Diabarth_Spayn | 1| 0| |kw.m |Can_an_Pescador_Kernûak| 1| 0| |kw.m |Nolwenn_Leroy | 1| 0| +-------------+-----------------------+------------+-------------------+

Finally, we select the sum of hourly_count by language_code over the entire DataFrame using a SQL statement with GROUP BY, SUM, and ORDER BY.

query = """SELECT language_code, SUM(hourly_count) as sum_hourly_count FROM hourly_counts GROUP BY language_code ORDER BY sum_hourly_count DESC""" spark.sql(query).show(5, truncate=False)
+-------------+-----------------+ |language_code|sum(hourly_count)| +-------------+-----------------+ |en.m |8095763 | |en |2693185 | |de.m |1313505 | |es.m |963835 | |ru.m |927583 | +-------------+-----------------+

Although querying data with SQL and DataFrame methods may look quite different, behind the scenes, Spark SQL translates everything to the same internal code. This means that as a developer, you can focus more on writing code for analysis in your preferred style rather than low-level execution details.

Instructions

1.

Save the DataFrame to the current SparkSession as a temporary view named uniq_visitors_march

2.

Write the SQL that filters the dataset to the rows for the language_code “ar”. Save your SQL string as a variable named ar_site_visitors_qry, run that query and display the resulting DataFrame in the notebook.

3.

Filter the dataset to the rows for the language_code “ar” and the columns “domain” and “uniq_human_visitors”. Save your SQL string as a variable named ar_site_visitors_slim_qry, run that query and display the resulting DataFrame in the notebook.

4.

Calculate the sum of all uniq_human_visitors grouped by site_type and ordered from highest to lowest by sum of visitors. Save your SQL string as a variable named site_top_type_qry, run that query and display the resulting DataFrame in the notebook.

Take this course for free

Mini Info Outline Icon
By signing up for Codecademy, you agree to Codecademy's Terms of Service & Privacy Policy.

Or sign up using:

Already have an account?