Intro to Amazon EMR

Introduction to using Amazon EMR clusters to process data.

Introduction

Amazon EMR is a managed cluster platform for running big data frameworks. It is designed to use frameworks such as Apache Spark to process huge (petabyte-scale) amounts of data. It allows you to develop applications in EMR Studio using tools such as EMR notebooks. If you’re interested in learning about Amazon EMR, this tutorial will give you a basic start on the platform.

Note: Amazon EMR is not a free service. For pricing details you can visit this site: https://aws.amazon.com/emr/pricing/

Setting Up Amazon EMR

There are a few steps you need to complete to start using Amazon EMR.

  1. First, you’ll need an AWS account. If you don’t have one, sign up here.

    Note: while AWS has a free tier, Amazon EMR is not a free service.

  2. When setting up AWS, it is best practice to set up an administrative user for daily tasks like working with Amazon EMR, so you aren’t doing everything with a root account.
  3. If you want to connect to the Amazon EMR clusters over SSH you’ll need to create a key pair to authenticate. Instructions are here.

    Note: This last step isn’t necessary for this tutorial.

Setting Up an Application

To process data with Amazon EMR, you need to load the data and the instructions to process it. For this tutorial, you’ll be storing these in an Amazon S3 bucket. To create the bucket, sign into the AWS Management Console and open the Amazon S3 console here: Amazon S3. Open the navigation pane to the left and click on “Buckets”.

Amazon S3 Menu with "buckets" highlighted

This will take you to the list of buckets available in your AWS account. On this screen click on the button labeled “Create Bucket”

Amazon S3 bucket list with "create bucket" highlighted

This will take you to a page of options for your new bucket. You’ll be prompted for a region and a name. Choose a region close to you and name the bucket something like “[your name]-my-emr-test.” You can accept the defaults for other settings on this page. On the bottom of the page, click “Create Bucket.” You’ll be taken back to the list of buckets with your new bucket listed.

You now have a location for your data and your application.

Note: If nothing happens when you click “Create Bucket”, there may be a name conflict. Scroll up and see if there’s a warning next to the name textbox. If there is, alter the name to make it unique.

To upload files to your bucket, click on its name in the bucket list. You’ll be taken to a home page for your bucket, and there will be an “Upload” button to the right. You’ll click on this to upload your files.

Amazon S3 bucket home page with "upload" highlighted

Before you do this, you’ll need data and an application to access it. For data, we’ll load a CSV dataset. To get one for this tutorial, go to Kaggle, and download a dataset of Google Play Store Apps. (If you don’t have a Kaggle account, you’ll have to register. It’s free.) Once you download the archive.zip file, unzip it into three files. The one you’ll be using is googleplaystore.csv. Once you have this file locally, you can upload it into your bucket.

Clicking “Upload” on your bucket’s home screen will take you to an upload page where you can add files and upload them. Click on “Add Files” and select googleplaystore.csv from where you saved it.

Amazon S3 upload page with "add files" highlighted

Once you add the file, scroll to the bottom of the page and click “Upload”. The CSV file will be uploaded to your bucket.

Next, we need an application, something the cluster can use to process the data. To do this we’ll use a Python script using PySpark. Below is a sample script to use for this tutorial:

import argparse
from pyspark.sql import SparkSession
def get_top_ten_categories(source, output):
"""
Process Google Play Store data and return the top 10 categories by average rating.
source: The URI of the Play Store dataset, 's3://MY-BUCKET-NAME/googleplaystore.csv'
output: The URI where output is written, 's3://MY-BUCKET-NAME/output'
"""
with SparkSession.builder.appName("Get Top Ten Categories").getOrCreate() as spark:
# Load CSV data
if source is not None:
google_df = spark.read.option("header","true").csv(source)
# Create an in-memory DataFrame to query
google_df.createOrReplaceTempView("google_play_store")
# Create a DataFrame of our query
top_ten = spark.sql("""SELECT Category, AVG(Rating) AS avg_rating
FROM google_play_store
WHERE NOT Rating = "NaN" AND Rating <= 5
GROUP BY Category
ORDER BY avg_rating DESC LIMIT 10""")
# Write output
top_ten.write.option("header","true").mode("overwrite").csv(output)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--source')
parser.add_argument('--output')
args = parser.parse_args()
get_top_ten_categories(args.source, args.output)

Save the above script as google_top_ten.py and then upload it to your bucket like you did with the googleplaystore.csv file.

Once both files are in place, you’re ready to launch your cluster.

Launching a Cluster

To launch a cluster you need to be signed into the AWS Management console. Then you need to open the Amazon EMR Console located here: Amazon EMR Console. On the menu to the left in the Amazon EMR Console, you want to choose “Clusters” if it hasn’t been selected already. This will open a page listing your EMR clusters, and to the right will be a button, “Create Cluster.”

Amazon EMR menu with "clusters" highlighted

Amazon EMR cluster list with "create cluster" highlighted

Click on “Create Cluster,” This will open a Create Cluster page. Most of the options will be filled with default values that we can leave as is for this tutorial.

  • For the “Cluster name” enter a unique cluster name to identify your cluster.
  • Under “Application Bundle” make sure you select the option “Spark” because we want Apache Spark installed on the cluster. You won’t be able to change the application selection once the cluster is created.
  • Under “Cluster Logs”, change the path to the bucket you created earlier and add /logs to the path to create a separate folder for the cluster logs. The path should look something like this: s3://MY-BUCKET-NAME/logs.
  • Under “Amazon EMR Service Role” select “Create a service role.” Accept the defaults for “Virtual Private Cloud” and “Subnet” and under “Security Group” select the default.
  • Under “EC2 instance profile for Amazon EMR” select “Create an instance profile” and set access to “all s3 buckets in this account”.

Once these options are filled out, you can click on the “Create cluster” button on the right. After a few moments, you will be taken to the information page of your new cluster. To the right, is a box labeled “Status and time” which will tell you the status of your cluster, either “Starting,” “Running” or “Waiting”.

Wait for the cluster to finish starting before continuing. You can click on the “refresh” button to update the current status.

Amazon EMR cluster status with "refresh" highlighted

Once the status is waiting, the cluster is ready for a job.

Submitting Work to the Cluster

Now is the point where you get to submit your work to the EMR cluster. You do this by submitting a step. A step is a set of actions you want the cluster to perform. To do this, navigate to the Amazon EMR Console and select “Clusters” on the menu to the left. You should see a list of clusters with your new cluster listed. To make sure the status is listed as “waiting”, click on the expand button to the left of your cluster.

Amazon EMR cluster list with expand button highlighted

This will expand the information about your cluster and the steps attached to it. It will also reveal an “Add Step” button.

Amazon EMR cluster list expanded with "add step" button highlighted

Click on “Add step.” This will take you to the add step page. To create our step, set the following:

  • Change the type to “Spark application.”
  • Enter a name to identify the step, such as “Google Play Store Script”
  • Leave “Deploy Mode” as the default.
  • Enter “Application Location.” This is the URI of the Python script you saved earlier. Something like s3://MY-BUCKET-NAME/google_top_ten.py.
  • Leave “Spark submit options” blank.
  • Under “Arguments” we’ll enter the command line arguments for our Python script, using your bucket name in place of “MY-BUCKET-NAME”:
--source "s3://MY-BUCKET-NAME/googleplaystore.csv"
--output "s3://MY-BUCKET-NAME/output"
  • Under “Action if step fails” accept the default, “Continue”

After configuring the step, click on the “Add step” button at the bottom of the page. This will take you back to the page for your cluster with the “steps” tab selected. Your new step will be listed with a status of “Pending” The step should progress from “Pending” to “Running” to “Completed”. You can use the refresh button to update the status.

After the step is completed, you can download the results. To do this, go to the Amazon S3 console here:https://console.aws.amazon.com/s3/, then select the bucket you created earlier. There should be a new folder there named output.

Click on the output folder, and you should see the results of your step. There should be a small file named _SUCCESS, there should also be a CSV file that begins with part-. This CSV file is the result of our PySpark query. To look at it, check the box next to it and click on the “Download” button. Once you save it locally, you can open it up. It should look something like this:

Category,avg_rating
EVENTS,4.435555555555557
EDUCATION,4.389032258064517
ART_AND_DESIGN,4.358064516129031
BOOKS_AND_REFERENCE,4.346067415730338
PERSONALIZATION,4.335987261146501
PARENTING,4.300000000000001
GAME,4.2863263445761195
BEAUTY,4.278571428571428
HEALTH_AND_FITNESS,4.2773648648648654
SHOPPING,4.259663865546221

Congratulations, you’ve just processed data using an Amazon EMR cluster.

Clean Up

You should not leave any resources you’re no longer using in your AWS account, so after you’ve used your EMR Cluster, you need to clean up after yourself.

First, you need to terminate your cluster.

Note: Terminated clusters cannot be restarted. However, Amazon EMR retains the cluster metadata for two months, and you may use that to clone a new cluster with the same settings.

To terminate the cluster, return to the Amazon EMR console at https://console.aws.amazon.com/emr. Choose “Clusters” on the menu to the left, and then select the checkbox next to your cluster. Then click on the “Terminate” button above the cluster list. Then confirm you want to terminate the cluster.

Amazon EMR cluster list with "terminate" button highlighted

Next, delete the S3 bucket and the contents used for this tutorial. Go to the Amazon S3 console at https://console.aws.amazon.com/s3/ and select the bucket you created earlier. Then click on “Empty.” Confirm that you want to empty the bucket.

Note: You cannot delete an S3 bucket that isn’t empty.

After emptying the bucket you can select it and click on “Delete”. Confirm that you want to delete the bucket.

Amazon S3 bucket list with "empty" and "delete" buttons highlighted

After those steps, you shouldn’t have any resources left to charge your AWS account.

Conclusion

In this tutorial, we’ve covered prepping data and an application (script) for processing by a cluster. We’ve covered using the Amazon EMR console, creating a cluster, and creating and running a step on the cluster. We also covered cleaning up after yourself to avoid unnecessary charges to your AWS account. You should now have a foundation to start using Amazon EMR for data processing.

If you’re interested in continuing to learn about big data and Apache Spark you can check out our course, “Introduction to Big Data with PySpark.” or the article, “What is Spark?”. You can delve into Amazon AWS by looking at “What is AWS?” and “Big Data Storage and Computing”.

Author

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team