Previous month:
August 2015
Next month:
December 2015

Posts from October 2015

Big Data: Part 3

This post is the third and last, at least for now, on my series about MR’s struggles with big data. Its theme is simple: big data is hard.

For starters, the quality of the data is not what we are accustomed to. More often than not the data were collected for some other purpose and the attention paid to the accuracy of individual items, their overall completeness, their consistency over time, their full documentation, and even their meaning pose serious challenges to their reuse. Readers familiar with the Total Survey Error (TSE) model will recognize that big data is vulnerable to all of the same deficiencies as surveys—gaps in coverage, missing data, poor measurement, etc. The key difference is that survey researchers, at least in theory, design and control the data making process in a way that users of big data do not. For users of big data the first step is data munging, often a long and very difficult process with uncertain results.

Then there is the technology. We all have heard about the transition from a world where data are scarce and expensive to one where they are plentiful and cheap, but the reality is that taking big data seriously requires a significant investment in people and technology. There is more to big data than hiring a data scientist. The AAPOR Report on Big Data has a nice summary of the skills and technology needed to do big data. While the report does not put a price tag on the investment, it likely is well beyond what all but the very large market research companies can afford.

Much of the value of big data lies in the potential to merge multiple data sets together (e.g. customer transaction data with social media data or Internet of Things data), but that, too, can be an expensive and difficult process. The heart of this merging process are bits of computer code called ETLs that specify what data are extracted from the source databases, how they are edited and transformed for consistency, and then merged to the output database, typically some type of data warehouse. Take a moment and consider the difficulty of specifying all of those rules. Bigdatamiracle

If you have ever written editing specs for a survey dataset then you have some inkling of the difficulty. Now consider that in a data merge from multiple sources you can have the same variable with different coding; the same variable name used to measure different things; differing rules for determining when a item is legitimately missing and when it is not; detailed rules for matching a record from one data source with a record from another; different entities (customers, products, stores, GPS coordinates, tweets) that need to resolved; and so on. This is difficult, tedious, unglamorous, and error-prone work. Get it wrong, and you have a mess.

To sum up this and the previous two posts, I worry that big data is a much bigger deal than most of us realize. We may fancy ourselves as pioneering in this space but it’s not clear to me that we understand just how hard this is going to be. For all of the talk about paradigm shifts and disruption, this is the real deal if for no other reason than it is the right methodology (if the word applies here) to support the other big disruption, a shift away from focusing on attitudes and opinions to focusing on behavior.

Back in 2013 I put up a post about a big data event put on by ESOMAR in which John Deighton from Harvard gave a talk that was the most compelling description I had heard of the threat of big data to traditional MR. The reaction in the room struck me as more whistling by the graveyard than taking the challenge for what it is. Two years later things don’t feel that much different. We are still in a state of denial. We had better get cracking. 

Big Data: Part 2

This second post in my series about MR’s ongoing struggle with big data is focused on our stubborn resistance to the analytic techniques that are an essential part of the big data paradigm. It’s hard to talk about those analytic challenges without referring to Chris Anderson’s 2008 Wired editorial, “The end of theory: The data deluge makes the scientific method obsolete.”

Faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. . . Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

This is a pretty good statement of the data science perspective and its faith in machine learning--the use of algorithms capable of finding patterns in data unguided by a set of analytic assumptions about the relationship among data items. To paraphrase Vasant Dhar, we are used to asking the question, “Do these data fit this model?” The data scientist asks the question, “What model fits these data?”

The research design and analytic approaches that are at the core of what we do developed at a time when data were scarce and expensive, when the analytic tools at our disposal were weak and under powered. The combination of big data and rapidly expanding computing technology has changed that calculus.

So have the goals of MR. More than ever our clients look to us to predict consumer behavior, something we have often struggled with.  Gartner Predict We need better models. The promise of data science is precisely that: more data and better tools leads to better models.

All of this is anathema to many of us in the social sciences. But there also is a longstanding argument within the statistical profession about the value of algorithmic analysis methods. For example, in 2001 the distinguished statistician Leo Breiman described two cultures within the statistical profession.

One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.  . .If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.

One can find similar arguments from statisticians going back to the 1960s.

There are dangers, of course, and arguments about correlation versus causality and endogeneity need to be taken seriously. (Check out Tyler Vigen’s spurious correlation website for some entertaining examples.) But any serious data scientist will be quick to note that doing this kind of analysis requires more than good math skills, massive computing power, and a library of machine learning algorithms. Domain knowledge and critical judgment are essential. Or, as Nate Silver reminds us, “Data-driven predictions can succeed—and they can fail. It is when we deny our role in the process that the odds of failure rise”

Big Data: Part 1

I have seen more than my share of MR conference presentations on big data over the last three or four years and it’s hard not to conclude that we still don’t have a clue. Sure, there have been some really good presentations on the use of non-survey data—what we might call “other data”—but most of it falls well short of both the reality and the promise of big data.

This is the first of three planned posts on big data and it asks the simple question: what is big data? The most often heard definition is the 3Vs (now morphed to 7 at last count). But while a neat summary of the challenges posed by big data, that’s hardly a definition. Some folks in the Berkeley School of Information asked 40 different thought leaders for their definitions and got 40 different responses. But they did produce this cool word cloud.


In the same vein, two British academics reviewed the definitions of big data most often used by various players in the big data ecosystem of IT consultants, hardware manufacturers, software developers, and service providers. They noted that most definitions touch on three primary attributes: size, complexity, and tools. They also suggest a definition that, while not especially elegant, seems to hit the key points:

Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce, and machine learning.

Put another way, we might simply say that:

 Big data is a term that describes datasets so large and complex that they cannot be processed or analyzed with conventional software systems.

We might further elaborate on that by noting three principal sources:

  1. Transaction data
  2. Social media data
  3. Data from the Internet of Things

This is the world of terabytes, petabytes, exabytes, and zettabytes. MR is still very much stuck in a world of gigabytes. As I write this I am surrounded by 5.5TB of storage with another TB in the cloud, but I don’t confuse any of this with big data. I have been known to fill the 64GB SD card in my Nikon over the course of a two-week vacation, but that’s not big data either.  

Big data is Walmart capturing over a million transactiDataNeverSleeps_2.0_v2ons per hour and uploading them to a database in excess of 3PB or the Weather Channel gathering 20TB of data from sensors all around the world each and every day. The amount of data being generated every minute in transactions, on social media, and by interconnected smart devices boggles the mind. We simply are not operating in that league.

And then there is the issue of tools. Most of the software we routinely use grinds to a halt with big data. It’s just not built to process files at the petabyte scale. There is more to processing big data than learning R. There is a whole suite of tools, virtually all of which relay on massive parallel processing, that are well beyond what most of us are even thinking about.

So let’s get real. MR is doing some interesting and worthwhile things with what has been described as “found data,” but let’s not dress it up as big data. It’s not. And even if it were, we’ve still not grasped the importance of the analytic shift required to really exploit big data’s potential. More on that in my next post.

ESOMAR Congress 2015: All behavior all the time

OK, so that’s a bit of an exaggeration and at a conference with as many concurrent sessions as we have here in Dublin there is a strong element of self-selection. Nonetheless, I’m coming away from these three days with an even stronger sense than I had coming in that the future of research is about behavior, passive data collection and the somewhat misnomered big data (more on that some other time). Over the years the industry has worked terms like “paradigm shift” and “disruption” to death, so much so that the terms have lost all meaning. But this strikes me as the real deal.

All of which is not to say that there is not an awful lot of work to do and concerns to be addressed. The papers presented here are the vanguard, mostly case studies to show that a focus on behavior can work. But there is a lot of research and experimentation to be done in building out an infrastructure of method and practice that lays out how to do this work well. More often than not the theoretical underpinnings are not as solid (or at least well expressed) as they need to be. Will we do the serious work to understand the science, or will it be like so many imagined disruptions before it, just pop science. As with surveys, cognitive psychology would seem to be key, but the foundation of hundreds if not thousands of experiments that gradually built the survey research cannon of best practices simply is not there. More than ever before the mood of MR is just do it. It has always seemed to me that the history of innovation in the research industry has been for MR to be the first mover and for the academics to come in a clean it up. We are finally seeing that with online and with mobile. Here’s hoping the academics catch the behavioral data train more quickly.

As exciting as some of this might be, there are some unsettling aspects to it. Privacy is obviously at the top of the list and there is a creep factor in some of what might be possible. I’m not sure I want my purchase decisions manipulated by “hidden forces” set up by some smartass marketer. I’ve got my own hidden forces, thank you. For years now we have struggled with the word “respondent.” Is “data subject” next?

The goal of market, opinion, and social research has always been to understand people better, learn what makes humans tick. Exposing people to 30 minute surveys seems pretty tame if the next phase is to treat them like rats in a Skinner box. Let's keep exploring but stay true to our basic values.