I stole the title of this post from Simon Chadwick's editorial in the November/December issue of Research World. It reminded me that I, like many young people, began my career as something of an idealist. My first two jobs were with nonprofits and then in 1984 I joined NORC at the University of Chicago, whose tagline was and still is, "Social Science Research in the Public Interest." I spent 11 years of my life there and learned an enormous amount before moving to what I still self-mockingly refer to "the dark side" in 1995.
I was reminded of this while reading Simon's editorial and was especially struck by this sentence:
Polls are where research loses its soul; commercial MR is where it forgets it has one; and social research is where we find it.
This post is the third and last, at least for now, on my series about MR’s struggles with big data. Its theme is simple: big data is hard.
For starters, the quality of the data is not what we are accustomed to. More often than not the data were collected for some other purpose and the attention paid to the accuracy of individual items, their overall completeness, their consistency over time, their full documentation, and even their meaning pose serious challenges to their reuse. Readers familiar with the Total Survey Error (TSE) model will recognize that big data is vulnerable to all of the same deficiencies as surveys—gaps in coverage, missing data, poor measurement, etc. The key difference is that survey researchers, at least in theory, design and control the data making process in a way that users of big data do not. For users of big data the first step is data munging, often a long and very difficult process with uncertain results.
Then there is the technology. We all have heard about the transition from a world where data are scarce and expensive to one where they are plentiful and cheap, but the reality is that taking big data seriously requires a significant investment in people and technology. There is more to big data than hiring a data scientist. The AAPOR Report on Big Data has a nice summary of the skills and technology needed to do big data. While the report does not put a price tag on the investment, it likely is well beyond what all but the very large market research companies can afford.
Much of the value of big data lies in the potential to merge multiple data sets together (e.g. customer transaction data with social media data or Internet of Things data), but that, too, can be an expensive and difficult process. The heart of this merging process are bits of computer code called ETLs that specify what data are extracted from the source databases, how they are edited and transformed for consistency, and then merged to the output database, typically some type of data warehouse. Take a moment and consider the difficulty of specifying all of those rules.
If you have ever written editing specs for a survey dataset then you have some inkling of the difficulty. Now consider that in a data merge from multiple sources you can have the same variable with different coding; the same variable name used to measure different things; differing rules for determining when a item is legitimately missing and when it is not; detailed rules for matching a record from one data source with a record from another; different entities (customers, products, stores, GPS coordinates, tweets) that need to resolved; and so on. This is difficult, tedious, unglamorous, and error-prone work. Get it wrong, and you have a mess.
To sum up this and the previous two posts, I worry that big data is a much bigger deal than most of us realize. We may fancy ourselves as pioneering in this space but it’s not clear to me that we understand just how hard this is going to be. For all of the talk about paradigm shifts and disruption, this is the real deal if for no other reason than it is the right methodology (if the word applies here) to support the other big disruption, a shift away from focusing on attitudes and opinions to focusing on behavior.
Back in 2013 I put up a post about a big data event put on by ESOMAR in which John Deighton from Harvard gave a talk that was the most compelling description I had heard of the threat of big data to traditional MR. The reaction in the room struck me as more whistling by the graveyard than taking the challenge for what it is. Two years later things don’t feel that much different. We are still in a state of denial. We had better get cracking.
This second post in my series about MR’s ongoing struggle with big data is focused on our stubborn resistance to the analytic techniques that are an essential part of the big data paradigm. It’s hard to talk about those analytic challenges without referring to Chris Anderson’s 2008 Wired editorial, “The end of theory: The data deluge makes the scientific method obsolete.”
Faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. . . Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
This is a pretty good statement of the data science perspective and its faith in machine learning--the use of algorithms capable of finding patterns in data unguided by a set of analytic assumptions about the relationship among data items. To paraphrase Vasant Dhar, we are used to asking the question, “Do these data fit this model?” The data scientist asks the question, “What model fits these data?”
The research design and analytic approaches that are at the core of what we do developed at a time when data were scarce and expensive, when the analytic tools at our disposal were weak and under powered. The combination of big data and rapidly expanding computing technology has changed that calculus.
So have the goals of MR. More than ever our clients look to us to predict consumer behavior, something we have often struggled with. We need better models. The promise of data science is precisely that: more data and better tools leads to better models.
All of this is anathema to many of us in the social sciences. But there also is a longstanding argument within the statistical profession about the value of algorithmic analysis methods. For example, in 2001 the distinguished statistician Leo Breiman described two cultures within the statistical profession.
One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. . .If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.
One can find similar arguments from statisticians going back to the 1960s.
There are dangers, of course, and arguments about correlation versus causality and endogeneity need to be taken seriously. (Check out Tyler Vigen’s spurious correlation website for some entertaining examples.) But any serious data scientist will be quick to note that doing this kind of analysis requires more than good math skills, massive computing power, and a library of machine learning algorithms. Domain knowledge and critical judgment are essential. Or, as Nate Silver reminds us, “Data-driven predictions can succeed—and they can fail. It is when we deny our role in the process that the odds of failure rise”