Articles

Big Data Storage and Computing

  • Study for Google Professional Data Engineer certification exam covering data pipelines, big data processing, GCP storage, analysis and machine learning tools.
    • Includes 15 Courses
    • Intermediate.
      20 hours
  • BI Data Analysts use Python and SQL to query, analyze, and visualize data — and Tableau and Excel to communicate findings.
    • Includes 18 Courses
    • With Certificate
    • Beginner Friendly.
      50 hours

Big Data Challenges

Every single day, over 2.5 quintillion bytes of data are created. That’s 2.5 with 18 zeroes after it! From transactional sales data to Internet of Things (IoT) devices, data sources grow in both size and velocity at a rapid rate. When thinking about the massive scale of data, we might wonder: where are all of these data stored? And how do we get enough computing power to process it?

Traditionally, we can view a basic dataset as a table in Excel or an equivalent application. These standard solutions require that we pull an entire dataset into memory on a single processing machine. When a data table becomes very large, it will exceed the random access memory (RAM) available for computation and either crash or take too long to process, making analysis impossible. Thus, we need to find alternative ways to store and process this big data!

Big Data Storage

A popular solution for big datasets is a distributed file system on a network of hardware called a cluster. A cluster is a group of several machines called nodes, with a cluster manager node and multiple worker nodes.

Illustration showing the structure of a cluster. The cluster manager has computing power and sends commands to three worker nodes. The worker nodes have both storage and computing power.

The cluster manager manages resources and sends commands to the worker nodes that store the data. Data saved on worker nodes are replicated multiple times for fault tolerance. This allows access to the complete dataset even in the event that one of the worker nodes goes offline. This type of file storage system is also easily and infinitely scalable, as additional worker nodes can be added indefinitely.

Hadoop Distributed File System

One commonly used framework for a cluster system is called Hadoop Distributed File System (HDFS), which is part of a set of tools distributed by Apache. HDFS was designed to store vast amounts of data to be processed using another framework called MapReduce. However, implementing a distributed file system like this requires a specific hardware configuration that can be a costly barrier to entry for many companies. For this reason, cloud-hosted HDFS is a popular fix. Microsoft Azure and Amazon Web Services (AWS) offer cloud-based HDFS solutions, allowing companies to outsource a system’s setup and hardware management for a fixed monthly cost.

Because HDFS solutions both store and process data on each worker node, they ensure that we have enough computing power to tackle our data problems. When data grow in size, our number of nodes may be increased to add more storage and computing power. This is advantageous for scaling but can become expensive as the number of nodes increases.

Object Storage

Another type of distributed file system is growing quickly in popularity because it separates storage from computing power. Object storage is a framework that is only for storage so that we can use any kind of computing power or framework on top of our data. Cloud providers like Microsoft Azure, Amazon Web Services (AWS), and Google Cloud host object storage layers, where we can store any kind of file and dataset.

These storage layers have an advantage over HDFS in that they have a low barrier to entry and are very flexible. Users can store any kind of file in a variety of formats, from CSVs and Parquet to other open-source formats that provide better performance and reliability such as Delta and Iceberg. This separation also allows us to grow either storage or computing power independently of the other to meet needs more efficiently.

Big Data Computing

Clusters can be composed of any kind of computing power resource. Traditionally, clusters would be a collection of server racks that a person or organization could connect to directly. In modern systems, these machines are typically virtual machines (VMs) that are “spun up” in one of the cloud providers’ environments. This kind of approach has many benefits, namely that it is much more cost-efficient and more scalable than dealing with physical machines.

So how exactly does a big data computing system work? The first, main method by which data are collected and analyzed across each node is called MapReduce. MapReduce is a framework composed of two actions: map and reduce. The map function collects specifically defined elements of data from each node as key-value tuple pairs. The reduce function is an analytical function applied to each key-value pair dataset whose solution is returned as output.

Check out the following interactive diagram to see a representation of MapReduce in action. Select each function sequentially to split the data, map the count of each shape, shuffle the counts by shape, reduce to just three counts, and return the results. MapReduce speeds up the processing of big data by having each worker node perform these operations on its own chunk of the dataset so that all workers are engaged and not waiting on another process to finish.

MapReduce was the standard for big data processing for a while, but over time it could not keep up with the rate at which data were growing and changing. With time, Apache Spark emerged as a better alternative for processing. Spark’s main benefit was the ability to process data in the node’s memory instead of processing on disk as MapReduce does. This provided much better performance and unlocked new capabilities for working with big data.

In order to get value out of big data, we need to utilize the best strategies for storing our data and providing computing power for our analysis. With these in place, we can be ready to scale and grow our analyses!

Codecademy Team

'The Codecademy Team, composed of experienced educators and tech experts, is dedicated to making tech skills accessible to all. We empower learners worldwide with expert-reviewed content that develops and enhances the technical skills needed to advance and succeed in their careers.'

Meet the full team

Learn more on Codecademy

  • Study for Google Professional Data Engineer certification exam covering data pipelines, big data processing, GCP storage, analysis and machine learning tools.
    • Includes 15 Courses
    • Intermediate.
      20 hours
  • BI Data Analysts use Python and SQL to query, analyze, and visualize data — and Tableau and Excel to communicate findings.
    • Includes 18 Courses
    • With Certificate
    • Beginner Friendly.
      50 hours