Big Data: Part 3
October 19, 2015
This post is the third and last, at least for now, on my series about MR’s struggles with big data. Its theme is simple: big data is hard.
For starters, the quality of the data is not what we are accustomed to. More often than not the data were collected for some other purpose and the attention paid to the accuracy of individual items, their overall completeness, their consistency over time, their full documentation, and even their meaning pose serious challenges to their reuse. Readers familiar with the Total Survey Error (TSE) model will recognize that big data is vulnerable to all of the same deficiencies as surveys—gaps in coverage, missing data, poor measurement, etc. The key difference is that survey researchers, at least in theory, design and control the data making process in a way that users of big data do not. For users of big data the first step is data munging, often a long and very difficult process with uncertain results.
Then there is the technology. We all have heard about the transition from a world where data are scarce and expensive to one where they are plentiful and cheap, but the reality is that taking big data seriously requires a significant investment in people and technology. There is more to big data than hiring a data scientist. The AAPOR Report on Big Data has a nice summary of the skills and technology needed to do big data. While the report does not put a price tag on the investment, it likely is well beyond what all but the very large market research companies can afford.
Much of the value of big data lies in the potential to merge multiple data sets together (e.g. customer transaction data with social media data or Internet of Things data), but that, too, can be an expensive and difficult process. The heart of this merging process are bits of computer code called ETLs that specify what data are extracted from the source databases, how they are edited and transformed for consistency, and then merged to the output database, typically some type of data warehouse. Take a moment and consider the difficulty of specifying all of those rules.
If you have ever written editing specs for a survey dataset then you have some inkling of the difficulty. Now consider that in a data merge from multiple sources you can have the same variable with different coding; the same variable name used to measure different things; differing rules for determining when a item is legitimately missing and when it is not; detailed rules for matching a record from one data source with a record from another; different entities (customers, products, stores, GPS coordinates, tweets) that need to resolved; and so on. This is difficult, tedious, unglamorous, and error-prone work. Get it wrong, and you have a mess.
To sum up this and the previous two posts, I worry that big data is a much bigger deal than most of us realize. We may fancy ourselves as pioneering in this space but it’s not clear to me that we understand just how hard this is going to be. For all of the talk about paradigm shifts and disruption, this is the real deal if for no other reason than it is the right methodology (if the word applies here) to support the other big disruption, a shift away from focusing on attitudes and opinions to focusing on behavior.
Back in 2013 I put up a post about a big data event put on by ESOMAR in which John Deighton from Harvard gave a talk that was the most compelling description I had heard of the threat of big data to traditional MR. The reaction in the room struck me as more whistling by the graveyard than taking the challenge for what it is. Two years later things don’t feel that much different. We are still in a state of denial. We had better get cracking.