Codecademy Logo

Introduction to Big Data

Print Cheatsheet

Big Data Definition

Big Data is a term that refers to any dataset that is too large and complex to be processed by normal computing capabilities. Data that is “big” will be a relative definition, and will be unique for each person, organization, or system.

Big Data 3 V’s

Big Data can be categorized by what are known as the 3 V’s:

1) Volume - Big Data must have a volume larger than the amount of available computing power 2) Velocity - Big Data must be growing in size 3) Variety - Big Data comes in a variety of different formats, types, and speeds

Big Data and RAM

Big Data analysis is limited by the amount of Random Access Memory (RAM) that the available computing resources have. Many Big Data systems will use a computing cluster to increase the amount of total RAM.

HDFS Overview

One paradigm for Big Data storage is called Hadoop Distributed File Storage (HDFS). In this paradigm, the system has a cluster of computing resources that are large enough to store and process the data. This cluster consists of a manager node, which sends commands to the worker nodes that do the data processing.

MapReduce Overview

MapReduce is a framework that can be used to process large datasets stored in HDFS. MapReduce consists of two main functions, map and reduce, which can perform complex operations over a distributed system.

MapReduce Process Overview

MapReduce works by sending commands from the manager node down to the numerous worker nodes, which process subsets of data in parallel. This speeds up processing when compared to traditional data processing frameworks.