I have seen more than my share of MR conference presentations on big data over the last three or four years and it’s hard not to conclude that we still don’t have a clue. Sure, there have been some really good presentations on the use of non-survey data—what we might call “other data”—but most of it falls well short of both the reality and the promise of big data.
This is the first of three planned posts on big data and it asks the simple question: what is big data? The most often heard definition is the 3Vs (now morphed to 7 at last count). But while a neat summary of the challenges posed by big data, that’s hardly a definition. Some folks in the Berkeley School of Information asked 40 different thought leaders for their definitions and got 40 different responses. But they did produce this cool word cloud.
In the same vein, two British academics reviewed the definitions of big data most often used by various players in the big data ecosystem of IT consultants, hardware manufacturers, software developers, and service providers. They noted that most definitions touch on three primary attributes: size, complexity, and tools. They also suggest a definition that, while not especially elegant, seems to hit the key points:
Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce, and machine learning.
Put another way, we might simply say that:
Big data is a term that describes datasets so large and complex that they cannot be processed or analyzed with conventional software systems.
We might further elaborate on that by noting three principal sources:
- Transaction data
- Social media data
- Data from the Internet of Things
This is the world of terabytes, petabytes, exabytes, and zettabytes. MR is still very much stuck in a world of gigabytes. As I write this I am surrounded by 5.5TB of storage with another TB in the cloud, but I don’t confuse any of this with big data. I have been known to fill the 64GB SD card in my Nikon over the course of a two-week vacation, but that’s not big data either.
Big data is Walmart capturing over a million transactions per hour and uploading them to a database in excess of 3PB or the Weather Channel gathering 20TB of data from sensors all around the world each and every day. The amount of data being generated every minute in transactions, on social media, and by interconnected smart devices boggles the mind. We simply are not operating in that league.
And then there is the issue of tools. Most of the software we routinely use grinds to a halt with big data. It’s just not built to process files at the petabyte scale. There is more to processing big data than learning R. There is a whole suite of tools, virtually all of which relay on massive parallel processing, that are well beyond what most of us are even thinking about.
So let’s get real. MR is doing some interesting and worthwhile things with what has been described as “found data,” but let’s not dress it up as big data. It’s not. And even if it were, we’ve still not grasped the importance of the analytic shift required to really exploit big data’s potential. More on that in my next post.