Big Data: Part 3

This post is the third and last, at least for now, on my series about MR’s struggles with big data. Its theme is simple: big data is hard.

For starters, the quality of the data is not what we are accustomed to. More often than not the data were collected for some other purpose and the attention paid to the accuracy of individual items, their overall completeness, their consistency over time, their full documentation, and even their meaning pose serious challenges to their reuse. Readers familiar with the Total Survey Error (TSE) model will recognize that big data is vulnerable to all of the same deficiencies as surveys—gaps in coverage, missing data, poor measurement, etc. The key difference is that survey researchers, at least in theory, design and control the data making process in a way that users of big data do not. For users of big data the first step is data munging, often a long and very difficult process with uncertain results.

Then there is the technology. We all have heard about the transition from a world where data are scarce and expensive to one where they are plentiful and cheap, but the reality is that taking big data seriously requires a significant investment in people and technology. There is more to big data than hiring a data scientist. The AAPOR Report on Big Data has a nice summary of the skills and technology needed to do big data. While the report does not put a price tag on the investment, it likely is well beyond what all but the very large market research companies can afford.

Much of the value of big data lies in the potential to merge multiple data sets together (e.g. customer transaction data with social media data or Internet of Things data), but that, too, can be an expensive and difficult process. The heart of this merging process are bits of computer code called ETLs that specify what data are extracted from the source databases, how they are edited and transformed for consistency, and then merged to the output database, typically some type of data warehouse. Take a moment and consider the difficulty of specifying all of those rules. Bigdatamiracle

If you have ever written editing specs for a survey dataset then you have some inkling of the difficulty. Now consider that in a data merge from multiple sources you can have the same variable with different coding; the same variable name used to measure different things; differing rules for determining when a item is legitimately missing and when it is not; detailed rules for matching a record from one data source with a record from another; different entities (customers, products, stores, GPS coordinates, tweets) that need to resolved; and so on. This is difficult, tedious, unglamorous, and error-prone work. Get it wrong, and you have a mess.

To sum up this and the previous two posts, I worry that big data is a much bigger deal than most of us realize. We may fancy ourselves as pioneering in this space but it’s not clear to me that we understand just how hard this is going to be. For all of the talk about paradigm shifts and disruption, this is the real deal if for no other reason than it is the right methodology (if the word applies here) to support the other big disruption, a shift away from focusing on attitudes and opinions to focusing on behavior.

Back in 2013 I put up a post about a big data event put on by ESOMAR in which John Deighton from Harvard gave a talk that was the most compelling description I had heard of the threat of big data to traditional MR. The reaction in the room struck me as more whistling by the graveyard than taking the challenge for what it is. Two years later things don’t feel that much different. We are still in a state of denial. We had better get cracking. 

Big Data: Part 2

This second post in my series about MR’s ongoing struggle with big data is focused on our stubborn resistance to the analytic techniques that are an essential part of the big data paradigm. It’s hard to talk about those analytic challenges without referring to Chris Anderson’s 2008 Wired editorial, “The end of theory: The data deluge makes the scientific method obsolete.”

Faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. . . Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

This is a pretty good statement of the data science perspective and its faith in machine learning--the use of algorithms capable of finding patterns in data unguided by a set of analytic assumptions about the relationship among data items. To paraphrase Vasant Dhar, we are used to asking the question, “Do these data fit this model?” The data scientist asks the question, “What model fits these data?”

The research design and analytic approaches that are at the core of what we do developed at a time when data were scarce and expensive, when the analytic tools at our disposal were weak and under powered. The combination of big data and rapidly expanding computing technology has changed that calculus.

So have the goals of MR. More than ever our clients look to us to predict consumer behavior, something we have often struggled with.  Gartner Predict We need better models. The promise of data science is precisely that: more data and better tools leads to better models.

All of this is anathema to many of us in the social sciences. But there also is a longstanding argument within the statistical profession about the value of algorithmic analysis methods. For example, in 2001 the distinguished statistician Leo Breiman described two cultures within the statistical profession.

One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.  . .If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.

One can find similar arguments from statisticians going back to the 1960s.

There are dangers, of course, and arguments about correlation versus causality and endogeneity need to be taken seriously. (Check out Tyler Vigen’s spurious correlation website for some entertaining examples.) But any serious data scientist will be quick to note that doing this kind of analysis requires more than good math skills, massive computing power, and a library of machine learning algorithms. Domain knowledge and critical judgment are essential. Or, as Nate Silver reminds us, “Data-driven predictions can succeed—and they can fail. It is when we deny our role in the process that the odds of failure rise”

Big Data: Part 1

I have seen more than my share of MR conference presentations on big data over the last three or four years and it’s hard not to conclude that we still don’t have a clue. Sure, there have been some really good presentations on the use of non-survey data—what we might call “other data”—but most of it falls well short of both the reality and the promise of big data.

This is the first of three planned posts on big data and it asks the simple question: what is big data? The most often heard definition is the 3Vs (now morphed to 7 at last count). But while a neat summary of the challenges posed by big data, that’s hardly a definition. Some folks in the Berkeley School of Information asked 40 different thought leaders for their definitions and got 40 different responses. But they did produce this cool word cloud.


In the same vein, two British academics reviewed the definitions of big data most often used by various players in the big data ecosystem of IT consultants, hardware manufacturers, software developers, and service providers. They noted that most definitions touch on three primary attributes: size, complexity, and tools. They also suggest a definition that, while not especially elegant, seems to hit the key points:

Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce, and machine learning.

Put another way, we might simply say that:

 Big data is a term that describes datasets so large and complex that they cannot be processed or analyzed with conventional software systems.

We might further elaborate on that by noting three principal sources:

  1. Transaction data
  2. Social media data
  3. Data from the Internet of Things

This is the world of terabytes, petabytes, exabytes, and zettabytes. MR is still very much stuck in a world of gigabytes. As I write this I am surrounded by 5.5TB of storage with another TB in the cloud, but I don’t confuse any of this with big data. I have been known to fill the 64GB SD card in my Nikon over the course of a two-week vacation, but that’s not big data either.  

Big data is Walmart capturing over a million transactiDataNeverSleeps_2.0_v2ons per hour and uploading them to a database in excess of 3PB or the Weather Channel gathering 20TB of data from sensors all around the world each and every day. The amount of data being generated every minute in transactions, on social media, and by interconnected smart devices boggles the mind. We simply are not operating in that league.

And then there is the issue of tools. Most of the software we routinely use grinds to a halt with big data. It’s just not built to process files at the petabyte scale. There is more to processing big data than learning R. There is a whole suite of tools, virtually all of which relay on massive parallel processing, that are well beyond what most of us are even thinking about.

So let’s get real. MR is doing some interesting and worthwhile things with what has been described as “found data,” but let’s not dress it up as big data. It’s not. And even if it were, we’ve still not grasped the importance of the analytic shift required to really exploit big data’s potential. More on that in my next post.

"Just trust us."

While skimming an article in the current issue of Businessweek on the Weather Channel’s emerging use of big data for things other than predicting the weather, I saw the graphic at the right and immediately thought of the causality versus correlation debate.  Feat_weatherchart42b_630They have gotten hold of tons of data from Walmart and P&G, which they are putting together with local weather data. Then they make predictions like these.

I was intrigued enough to look at the text more closely and found this from an interview with Vikram Somaya, who seems to sit on top of their big data operations:

“We can tell you that on a January morning in Miami, if a set of weather conditions occurs, people will buy a certain brand of raspberry,” he says. Not just any fruit. Raspberries. When advertisers ask for an explanation—why raspberries?—Somaya can’t always provide a clear answer. “A lot of times we have to tell them to just trust us.

This from a company whose core business is predicting the weather. The world is changing faster than I have imagined.

"Is it legal?" is not enough

I just posted a link to this Computerworld article on my Twitter feed, but I think it's so important that I have decided to mention it here as well. The article describes the dangers brands are beginning to face with over aggressive big data and data mining practices. The key point is that it's not just about what is legal, but that consumers are can also be sensitive to what they view as privacy violations and over aggressive marketing.

These are extra legal areas where codes of conduct developed by industry and trade associations have traditionally protected both research agencies and their clients from public backlash. It has become fashionable in some quarters to argue that these quaint notions are holding back market research and providing an opening for new entrants to realign the competitive balance in the industry. This is a good reminder that respect for consumers never goes out of fashion.