Big Data Storage and Computing

Codecademy Team
Learn about the challenges of storing and analyzing big data

Big Data Challenges

Every single day, over 2.5 quintillion bytes of data are created. That’s 2.5 with 18 zeroes after it! From transactional sales data to Internet of Things (IoT) devices, data sources grow in both size and velocity at a rapid rate. When thinking about the massive scale of data, we might wonder: where are all of these data stored? And how do we get enough computing power to process it?

Traditionally, we can view a basic dataset as a table in Excel or an equivalent application. These standard solutions require that we pull an entire dataset into memory on a single processing machine. When a data table becomes very large, it will exceed the random access memory (RAM) available for computation and either crash or take too long to process, making analysis impossible. Thus, we need to find alternative ways to store and process this big data!

Big Data Storage

A popular solution for big datasets is a distributed file system on a network of hardware called a cluster. A cluster is a group of several machines called nodes, with a cluster manager node and multiple worker nodes.

Illustration showing the structure of a cluster. The cluster manager has computing power and sends commands to three worker nodes. The worker nodes have both storage and computing power.

The cluster manager manages resources and sends commands to the worker nodes that store the data. Data saved on worker nodes are replicated multiple times for fault tolerance. This allows access to the complete dataset even in the event that one of the worker nodes goes offline. This type of file storage system is also easily and infinitely scalable, as additional worker nodes can be added indefinitely.

Hadoop Distributed File System

One commonly used framework for a cluster system is called Hadoop Distributed File System (HDFS), which is part of a set of tools distributed by Apache. HDFS was designed to store vast amounts of data to be processed using another framework called MapReduce. However, implementing a distributed file system like this requires a specific hardware configuration that can be a costly barrier to entry for many companies. For this reason, cloud-hosted HDFS is a popular fix. Microsoft Azure and Amazon Web Services (AWS) offer cloud-based HDFS solutions, allowing companies to outsource a system’s setup and hardware management for a fixed monthly cost.

Because HDFS solutions both store and process data on each worker node, they ensure that we have enough computing power to tackle our data problems. When data grow in size, our number of nodes may be increased to add more storage and computing power. This is advantageous for scaling but can become expensive as the number of nodes increases.

Object Storage

Another type of distributed file system is growing quickly in popularity because it separates storage from computing power. Object storage is a framework that is only for storage so that we can use any kind of computing power or framework on top of our data. Cloud providers like Microsoft Azure, Amazon Web Services (AWS), and Google Cloud host object storage layers, where we can store any kind of file and dataset.

These storage layers have an advantage over HDFS in that they have a low barrier to entry and are very flexible. Users can store any kind of file in a variety of formats, from CSVs and Parquet to other open-source formats that provide better performance and reliability such as Delta and Iceberg. This separation also allows us to grow either storage or computing power independently of the other to meet needs more efficiently.