This second post in my series about MR’s ongoing struggle with big data is focused on our stubborn resistance to the analytic techniques that are an essential part of the big data paradigm. It’s hard to talk about those analytic challenges without referring to Chris Anderson’s 2008 Wired editorial, “The end of theory: The data deluge makes the scientific method obsolete.”
Faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. . . Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
This is a pretty good statement of the data science perspective and its faith in machine learning--the use of algorithms capable of finding patterns in data unguided by a set of analytic assumptions about the relationship among data items. To paraphrase Vasant Dhar, we are used to asking the question, “Do these data fit this model?” The data scientist asks the question, “What model fits these data?”
The research design and analytic approaches that are at the core of what we do developed at a time when data were scarce and expensive, when the analytic tools at our disposal were weak and under powered. The combination of big data and rapidly expanding computing technology has changed that calculus.
So have the goals of MR. More than ever our clients look to us to predict consumer behavior, something we have often struggled with. We need better models. The promise of data science is precisely that: more data and better tools leads to better models.
All of this is anathema to many of us in the social sciences. But there also is a longstanding argument within the statistical profession about the value of algorithmic analysis methods. For example, in 2001 the distinguished statistician Leo Breiman described two cultures within the statistical profession.
One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. . .If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.
One can find similar arguments from statisticians going back to the 1960s.
There are dangers, of course, and arguments about correlation versus causality and endogeneity need to be taken seriously. (Check out Tyler Vigen’s spurious correlation website for some entertaining examples.) But any serious data scientist will be quick to note that doing this kind of analysis requires more than good math skills, massive computing power, and a library of machine learning algorithms. Domain knowledge and critical judgment are essential. Or, as Nate Silver reminds us, “Data-driven predictions can succeed—and they can fail. It is when we deny our role in the process that the odds of failure rise”