Big data is a term that refers to data that are too large and complex to be processed by normal computing capabilities. Describing data as “big” is relative to our modern computing power.
Big Data can be characterized by what are known as the 3 Vs:
Big data analysis is limited by the amount of Random Access Memory (RAM) that the available computing resources have. Many big data systems will use a computing cluster to increase the amount of total RAM.
One system for big data storage is called Hadoop Distributed File Storage (HDFS). In this system, a cluster of computing resources stores the data. This cluster consists of a manager node, which sends commands to the worker nodes that house the data.
MapReduce is a framework that can be used to process large datasets stored in a Hadoop Distributed File System (HDFS) cluster. MapReduce consists of two main functions, map and reduce, which can perform complex operations over a distributed system.
MapReduce works by sending commands from the manager node down to the numerous worker nodes, which process subsets of data in parallel. This speeds up processing when compared to traditional data processing frameworks.