What is Big Data?
Data is Everywhere
We generate data from all kinds of activities, whether a transaction at a local store, a website we visit, or even the location of our cell phone. At the time of this writing, an average of 500 million tweets are written on Twitter each day. Imagine being the person at Twitter who has to analyze this data! Data of this size is often referred to as big data.
What exactly is big data? In general, big data is any data that is too big for a typical modern computer to process and analyze. This means, however, that the definition of big data is relative to the amount of computing power we have available. For example:
- Most current personal computers have somewhere between 8-32 GB of random access memory (RAM) available for data processing. That means, from the perspective of a personal computer, any dataset larger than 10-20 GB might be too large to process.
- A large enterprise can take advantage of larger computing resources (i.e., a warehouse of servers or the cloud), so 100+ GB might be the upper limit for the size of data the enterprise can handle.
- Data measured in terabytes is perhaps the largest amount of data being worked with at this time (1 TB = 1000 GB).
In the following applet, try adjusting the slider to see what qualifies as big data as we increase our modern computing power
Big data hasn’t always been a concept with respect to data analysis. For most of history, we have been able to handle the amount of data we collect. Before computers, scientists would perform calculations on handwritten data for a research sample. With the invention of computers, we were able to process data more quickly and were generally able to keep up with the amount of data we had available.
In more recent history, however, sources of data have continued to grow and are outpacing the growth of computing power. In the mid-2000s, after the massive growth of the internet, many analysts in the industry were struggling to handle their own data. Roger Mougalas coined the term “big data” when referring to a dataset that was unmanageable with current business intelligence tools.