What is AWS EMR? A Complete Beginner’s Guide
What is AWS EMR?
AWS EMR (Elastic MapReduce) is a managed cluster platform for running big data frameworks. It is designed to use frameworks such as Apache Spark to process huge (petabyte-scale) amounts of data. It allows you to develop applications in EMR Studio using tools such as EMR notebooks.
Key features:
- Scalable clusters: You can easily spin up, resize, or terminate clusters of virtual servers (EC2 instances).
- Integrated tools: Supports popular big data tools like Hadoop, Spark, Hive, HBase, Presto, etc.
- Pay-as-you-go: You pay only for the compute and storage we use.
Note: AWS EMR is not a free service. For pricing details, you can visit the official website.
Now that we’ve got a brief overview of AWS EMR, let’s learn how to set it up.
How to set up AWS EMR
There are a few steps that you need to complete to start using AWS EMR:
- You’ll need an AWS account. If you don’t have one, sign up on the official website.
- When setting up AWS, it is best practice to set up an administrative user for daily tasks like working with AWS EMR, so you aren’t doing everything with a root account.
- If you want to connect to the AWS EMR clusters over SSH, you’ll need to create a key pair to authenticate. Here are the instructions.
Next, let’s see how to set up an AWS application.
Setting up an AWS application
To process data with AWS EMR, you need to load the data and the instructions to process it. For this tutorial, you’ll store these in an Amazon S3 bucket. To create the bucket, sign into the AWS Management Console and launch the Amazon S3 console. Open the navigation pane to the left and click on “Buckets”.
This will take you to the list of buckets available in your AWS account. On this screen, click on the “Create Bucket” button:
This will take you to a page of options for your new bucket. You’ll be prompted for a region and a name. Choose a region close to you and name the bucket something like “[your name]-my-emr-test.” You can accept the defaults for other settings on this page. On the bottom of the page, click on “Create Bucket.” You’ll be taken back to the list of buckets with your new bucket listed.
You now have a location for your data and your application.
Note: If nothing happens when you click “Create Bucket”, there may be a name conflict. Scroll up and see if there’s a warning next to the name textbox. If there is, alter the name to make it unique.
To upload files to your new bucket, click on its name in the bucket list. You’ll be taken to a home page for your bucket, and there will be an “Upload” button to the right. You’ll click on this to upload your files.
Before you do this, you’ll need data and an application to access it. For data, you’ll load a CSV dataset. To get one for this tutorial, go to Kaggle, and download a dataset of Google Play Store Apps. Once you download the archive.zip file, unzip it into three files. The one you’ll be using is googleplaystore.csv. Once you have this file locally, you can upload it to your bucket.
Clicking “Upload” on your bucket’s home screen will take you to an upload page, in which you can add files and upload them. Click on “Add Files” and select googleplaystore.csv from where you saved it.
Once you add the file, scroll to the bottom of the page and click “Upload”. The CSV file will then be uploaded to your bucket.
Next, we need an application, something the cluster can use to process the data. To do this, you’ll use a Python script using PySpark. Below is a sample script to use for this tutorial:
import argparsefrom pyspark.sql import SparkSessiondef get_top_ten_categories(source, output):"""Process Google Play Store data and return the top 10 categories by average rating.source: The URI of the Play Store dataset, 's3://MY-BUCKET-NAME/googleplaystore.csv'output: The URI where output is written, 's3://MY-BUCKET-NAME/output'"""with SparkSession.builder.appName("Get Top Ten Categories").getOrCreate() as spark:# Load CSV dataif source is not None:google_df = spark.read.option("header","true").csv(source)# Create an in-memory DataFrame to querygoogle_df.createOrReplaceTempView("google_play_store")# Create a DataFrame of our querytop_ten = spark.sql("""SELECT Category, AVG(Rating) AS avg_ratingFROM google_play_storeWHERE NOT Rating = "NaN" AND Rating <= 5GROUP BY CategoryORDER BY avg_rating DESC LIMIT 10""")# Write outputtop_ten.write.option("header","true").mode("overwrite").csv(output)if __name__ == "__main__":parser = argparse.ArgumentParser()parser.add_argument('--source')parser.add_argument('--output')args = parser.parse_args()get_top_ten_categories(args.source, args.output)
Save the above script as google_top_ten.py and then upload it to your bucket like you did with the googleplaystore.csv file.
Once both files are in place, you’re ready to launch your cluster.
Launching a cluster in AWS EMR
To launch a cluster, you need to be signed into the AWS Management console. Then, you need to open the Amazon EMR Console. On the menu to the left in the Amazon EMR Console, you want to choose “Clusters” if it hasn’t been selected already. This will open a page listing your EMR clusters, and to the right will be a button, “Create Cluster.”
Click on “Create Cluster.” This will open a Create Cluster page. Most of the options will be filled with default values that we can leave as is for this tutorial.
- For the “Cluster name” enter a unique cluster name to identify your cluster.
- Under “Application Bundle” make sure you select the option “Spark” because we want Apache Spark installed on the cluster. You won’t be able to change the application selection once the cluster is created.
- Under “Cluster Logs”, change the path to the bucket you created earlier and add
/logsto the path to create a separate folder for the cluster logs. The path should look something like this:s3://MY-BUCKET-NAME/logs. - Under “Amazon EMR Service Role” select “Create a service role.” Accept the defaults for “Virtual Private Cloud” and “Subnet” and under “Security Group” select the default.
- Under “EC2 instance profile for Amazon EMR” select “Create an instance profile” and set access to “all s3 buckets in this account”.
Once these options are filled out, you can click on the “Create cluster” button on the right. After a few moments, you will be taken to the information page of your new cluster. To the right is a box labeled “Status and time” which will tell you the status of your cluster, either “Starting,” “Running” or “Waiting”.
Wait for the cluster to finish starting before continuing. You can utilize the “refresh” button to update the current status.
Once the status is waiting, the cluster is ready for a job.
Submitting work to the EMR cluster
Now is the point where you get to submit your work to the EMR cluster. You do this by submitting a step. A step is a set of actions you want the cluster to perform. To do this, navigate to the Amazon EMR Console and select “Clusters” on the menu to the left. You should see a list of clusters with your new cluster listed. To make sure the status is listed as “waiting”, click on the expand button to the left of your cluster.
This will expand the information about your cluster and the steps attached to it. It will also reveal an “Add Step” button.
Click on “Add step.” This will take you to the add step page. To create our step, set the following:
- Change the type to “Spark application.”
- Enter a name to identify the step, such as “Google Play Store Script”
- Leave “Deploy Mode” as the default.
- Enter “Application Location.” This is the URI of the Python script you saved earlier. Something like s3://MY-BUCKET-NAME/google_top_ten.py.
- Leave “Spark submit options” blank.
- Under “Arguments” we’ll enter the command line arguments for our Python script, using your bucket name in place of “MY-BUCKET-NAME”:
--source "s3://MY-BUCKET-NAME/googleplaystore.csv"
--output "s3://MY-BUCKET-NAME/output"
- Under “Action if step fails” accept the default, “Continue”
After configuring the step, click on the “Add step” button situated at the bottom of the page. This will take you back to the page for your cluster with the “steps” tab selected. Your new step will be listed with a status of “Pending” The step should progress from “Pending” to “Running” to “Completed”. You can utilize the refresh button to update the status.
After the step is completed, you can download the results. To do this, go to the Amazon S3 Console, then select the bucket you created earlier. There should be a new folder there named output.
Click on the output folder, and you should see the results of your step. There should be a small file named _SUCCESS, and there should also be a CSV file that begins with part-. This CSV file is the result of our PySpark query. To look at it, check the box next to it and click on the “Download” button. Once you save it locally, you can open it up. It should look something like this:
Category,avg_rating
EVENTS,4.435555555555557
EDUCATION,4.389032258064517
ART_AND_DESIGN,4.358064516129031
BOOKS_AND_REFERENCE,4.346067415730338
PERSONALIZATION,4.335987261146501
PARENTING,4.300000000000001
GAME,4.2863263445761195
BEAUTY,4.278571428571428
HEALTH_AND_FITNESS,4.2773648648648654
SHOPPING,4.259663865546221
Congratulations, you’ve just processed data using an AWS EMR cluster.
Lastly, we need to clean up the EMR cluster. Let’s learn how to do it.
Cleaning up the EMR cluster
You should not leave any resources you’re no longer using in your AWS account, so after you’ve used your EMR Cluster, you need to clean up after yourself.
First, you need to terminate your cluster.
Note: Terminated clusters cannot be restarted. However, AWS EMR retains the cluster metadata for two months, and you may use that to clone a new cluster with the same settings.
To terminate the cluster, return to the Amazon EMR Console. Choose “Clusters” from the menu to the left, and then select the checkbox next to your cluster. Then click on the “Terminate” button above the cluster list. Then, confirm you want to terminate the cluster.
Next, delete the S3 bucket and the contents used for this tutorial. Go to the Amazon S3 Console and select the bucket you created earlier. Then click on “Empty.” Confirm that you want to empty the bucket.
Note: You cannot delete an S3 bucket that isn’t empty.
After emptying the bucket, you can select it and click on “Delete”. Confirm that you wish to delete the bucket.
After those steps, you shouldn’t have any resources left to charge your AWS account.
Now, we’ll discuss the advantages and disadvantages of AWS EMR.
Advantages and disadvantages of AWS EMR
AWS EMR offers several advantages, including:
- Cost-effective pricing: AWS EMR offers a pay-as-you-go pricing model, enabling us to pay only for the compute and storage they use. Additionally, the option to utilize Spot Instances can significantly reduce costs for large-scale processing jobs.
- Highly scalable infrastructure: AWS EMR can automatically scale your cluster size up or down depending on workload demands. This elasticity ensures optimal performance for both small and large data processing needs without manual intervention.
- Fully managed service: AWS takes care of cluster provisioning, configuration, monitoring, and optimization. This managed approach reduces the operational burden on teams and speeds up deployment and development cycles.
- Deep AWS integration: AWS EMR integrates seamlessly with core AWS services such as S3, RDS, DynamoDB, CloudWatch, and Redshift. This connectivity streamlines data movement, enhances monitoring, and simplifies building end-to-end data workflows.
However, AWS EMR has some disadvantages as well:
- Steep learning curve for beginners: Despite being managed, AWS EMR requires a solid understanding of big data tools and AWS services. For newcomers, the initial setup and configuration can be complex and overwhelming.
- Startup latency: Cluster initialization, especially with custom bootstrap scripts, can be time-consuming. This makes AWS EMR less ideal for quick, ad-hoc jobs that require immediate execution.
- AWS lock-in: AWS EMR is deeply tied to the AWS ecosystem, making it challenging to migrate workflows to other cloud platforms. This vendor lock-in can be a limitation for hybrid or multi-cloud strategies.
- Unpredictable costs if unmanaged: While AWS EMR can be affordable, poorly managed clusters or long-running jobs can lead to unexpectedly high bills. It’s essential to implement cost monitoring and lifecycle management practices.
Next, let’s go through some best practices for using AWS EMR.
Best practices for using AWS EMR
Applying these best practices will help us make the most out of AWS EMR:
- Choose the right instance types: Select instance types based on workload requirements. Use memory-optimized instances (e.g., r5, r6) for Spark or Hive, and compute-optimized instances (e.g., c5, c6) for CPU-intensive tasks. Consider mixing instance types in core and task nodes based on performance needs.
- Use spot instances wisely: Leverage Spot Instances for task nodes to reduce costs, but always use On-Demand or Reserved Instances for master and critical core nodes to maintain cluster stability. Use EMR’s instance fleet or instance group feature to define fallback options in case Spot capacity isn’t available.
- Enable auto-scaling: Configure automatic scaling based on metrics like YARN memory or HDFS usage. This ensures your cluster adjusts to changing workloads without manual intervention, improving cost efficiency and performance.
- Store data in Amazon S3 instead of HDFS: Use Amazon S3 with EMRFS (EMR File System) for storage instead of HDFS. This decouples storage from compute, allows easier data sharing between clusters, and ensures persistence after the cluster is terminated.
Following these best practices will ensure effective usage of AWS EMR.
Conclusion
In this tutorial, we had a detailed discussion on AWS EMR, covering setup, launching an EMR cluster, submitting work to the cluster, and cleaning it up after use. We also went through its advantages and disadvantages, and some best practices for using it efficiently.
AWS EMR is a powerful, scalable, and cost-effective solution for processing large-scale data workloads in the cloud. By leveraging open-source big data frameworks and integrating seamlessly with the broader AWS ecosystem, EMR enables organizations to perform complex data analysis, ETL jobs, and machine learning at scale without the overhead of managing physical infrastructure.
If you’re interested in continuing to learn about big data and Apache Spark, you can check out our Introduction to Big Data with PySpark course on Codecademy.
Frequently asked questions
1. What is the difference between EC2 and AWS EMR?
- Amazon EC2 (Elastic Compute Cloud) is a general-purpose virtual server hosting service that enables you to run any software, including big data tools, on scalable virtual machines.
- AWS EMR (Elastic MapReduce) is a managed service built on top of EC2 that simplifies running big data frameworks like Apache Spark, Hadoop, and Hive. EMR manages provisioning, configuring, and tuning EC2 instances specifically for distributed data processing.
2. Is AWS EMR an ETL tool?
Not exactly. AWS EMR is not an ETL tool by itself, but it can be used to perform ETL (Extract, Transform, Load) tasks using frameworks like Spark or Hive. You write the ETL logic, and EMR provides the infrastructure to run it at scale.
3. What is the difference between AWS EMR and Redshift?
- AWS EMR is a big data processing service that runs tools like Spark, Hive, and Hadoop to handle raw or unstructured data stored in S3 or HDFS. It’s ideal for tasks like ETL, data transformation, and machine learning.
- Amazon Redshift, on the other hand, is a data warehouse designed for structured data and high-performance analytics using SQL. It stores data in a columnar format and is best suited for BI and reporting.
4. Is AWS EMR serverless?
Partially. AWS offers EMR Serverless, which allows you to run Spark and Hive workloads without managing clusters. You just submit your jobs, and EMR Serverless automatically provisions and scales resources for you.
5. What is the difference between AWS EMR and Spark?
- Apache Spark is a popular open-source distributed data processing engine.
- AWS EMR is a managed service that can run Spark, along with other big data frameworks like Hadoop, Hive, and Presto.
'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'
Meet the full team