Failure to replicate

The not-so-big new last week was the NYT article with the intriguing title, “Many Psychology Findings Not as Strong as Claimed, Study Says,” a rehash of this article in Science. In case you missed it, the bottom line is that findings from roughly two-thirds of studies in peer-reviewed journals could not be replicated.

This should notPuzzled surprise us. Way back in 1987 a group of British researchers compared results published on various cancer-related topics and “found 56 topics in which the results of a case control study were in conflict with the results from other studies of the same relationship.” And, I expect all of us can think of a few things we do that used to be healthy but now are not. And vice versa.

What might all of this mean for MR? 

On more than one occasion Ray Poynter has warned us about publication bias and the fact that results from just one study do not always point to truth. The former is cited as one potential culprit in the Science article, and the latter ought to be obvious to anyone calling him or herself a researcher. But is there something even more troubling at the heart of the problem, something that MR has in spades: questionable sampling practices.

Psychologists are notorious for their use of relatively small convenience samples and the belief that randomizing sample members across treatments cures all. I did not look at all 100 studies in the Science article, but the handful I looked at all collected new samples ranging in size from 60 to 220. Students and people in the street were popular choices. If this is indicative of all 100, I am shocked that only 60 didn’t replicate. Then again, I drew a very small convenience sample.

For the most part, MR avoids small samples, except in qualitative research where we generally are smart enough to characterize the results as “directional” at best but seldom representative. Good for us. But we are knee-deep in convenience samples these days. A little less certainty in what we claim they represent wouldn’t hurt.

Representivity ain't what it used to be

I am on my way back from ESOMAR APAC in Singapore where I gave a short presentation with the title, “What you need to know about online panels.” Part of the presentation was about the evolution of a set of widely-accepted QA practices that while standard in the US and much of Europe are sometimes missing in countries where online is still a maturing methodology. The other part was about the challenge of creating representative online samples, especially in countries with relatively low Internet penetration. How can you learn anything meaningful about the broader market in a country like India with only about 20% Internet penetration using an online panel that has only a tiny fraction of that 20%?    Random

At the same time I have tried to keep an eye on what has been happening at the AAPOR conference in Florida this past week and am delighted to see the amount of attention that online nonprobability sampling is getting from academics and what the Europeans like to call “social policy” researchers. Their business is all about getting precise estimates out of representative samples, something that thus far has mostly eluded online researchers despite its increasing dominance in the research industry as a whole.

The conversation in Singapore was much different where solving this problem seems less important. The cynical view is that there is little impetus to do better because clients aren’t willing to spend the money it takes to get more accurate data.  The more generous view is that market researchers paint with a much broader brush. Trends over time and large differences in estimates are more important than really precise numbers; research outcomes are an important part of the decision making process, but not the only part.

That said, it sill seems to be that we have a responsibility to understand just how soft our numbers might be, where the biases are, and what all of that implies for how results are used. The obvious danger is in ascribing a precision to our results that just is disconnected from reality. There already is way too much of that and not just with online. Social media, mobile, big data, text analytics, neuroscience—all of it is being oversold. And the thing is, when you talk to people one on one, they know it.

I subscribe to the idea that our future is one in which data will be plentiful and cheap. It also will almost always be imperfect, every bit as imperfect as online today. The most important skill for market researchers to develop is how to learn from imperfect data, a task that starts by recognizing those imperfections and then figuring out how to deal with them rather than pretending they don’t exist.

Online Sampling Again

Last week two posts on the GreenBook Blog, one by Scott Weinberg and a response by Ron Sellers, bemoaned the quality of online research and especially its sampling. And who can blame them? All of us, including me, have been known to go a little Howard Beale on this issue from time to time. BealeWe all know the familiar villains—evil suppliers, dumb buyers, margin-obsessed managers, tight-fisted clients, and so on. I was reminded of a quote from an ancient text (circa 1958) that a friend sent me a few months back:

Samples are like medicine. They can be harmful when they are taken carelessly or without adequate knowledge of their effects. We may use their results with confidence if the applications are made with due restraint. It is foolish to avoid or discard them because someone else has misused them and suffered the predictable consequences of his folly. Every good sample should have a proper label with instructions about its use.

Every trained researcher has an idealized notion of what constitutes a good quality sample. Every experienced researcher understands that the real world imposes constraints, that any one who says you can have it all—fast, cheap, and high quality—is selling snake oil. So we make tradeoffs and that’s ok, as long as we have “adequate knowledge of their effects.”

It helps to be an informed buyer, and ESOMAR has for about ten years now offered some version of their 28 Questions to Help Buyers of Online Samples to help. It’s an excellent resource, often overlooked. More recently, ESOMAR has teamed up with the Global Research Business Network to develop the soon-to-be-released ESOMAR/GRBN Guideline on Online Sample Quality (here I disclose that I was part of the project team). In the meantime, there is an early draft here. Its most prominent feature, IMO, is its insistence on transparency so that sample buyers are at least informed about exactly what they are getting and in as much detail as they can stand.

Granted, that’s not the same as “a proper label with instructions about its use.” That still is a job left to the researcher. If you can’t or won’t do that, well, shame on you. Getting depressed about it is not an option. Or, as another of the ancients has said, “You're either part of the problem or part of the solution.”

Accuracy of US election polls

Nate Silver does a nice job this morning of summarizing the accuracy of and bias in the 2012 results of the 23 most prolific polling firms.   I’ve copied his table below. Before we look at it we need to remember that there is more involved in these numbers than different sampling methods.  The target population for most of these polls is likely voters and polling firms all have a secret sauce for filtering those folks into their surveys.  Some of the error probably can be sourced to that NateSilver step.


But to get back to the table, the first thing that struck me was the consistent Republican bias.  The second was the especially poor performance by two of the most respected electoral polling brands, Mason-Dixon and Gallup.  But my guess is that readers of this blog are going to look first at how the polls did by methodology.  In that regard there is some good news for Internet methodologies, although we probably should not make too much of it.

  As far back as the US elections of 2000 Harris Interactive showed that with the right adjustments online panels could perform as well as RDD.  When the AAPOR Task Force on Online Panels (which I chaired) reviewed the broader literature on online panels we concluded this about their performance in electoral polling:

A number of publications have compared the accuracy of final pre-election polls forecasting election outcomes (Abate, 1998; Snell et al, 1999; Harris Interactive, 2004, 2008; Stirton and Robertson, 2005; Taylor, Bremer, Overmeyer, Sigeel, and Terhanian, 2001; Twyman, 2008; Vavreck and Rivers, 2008).  In general, these publications document excellent accuracy of online nonprobability sample polls (with some notable exceptions), some instances of better accuracy in probability sample polls, and some instances of lower accuracy than probability sample polls. “ POQ 74:4, p.743

So there is an old news aspect to Nate’s analysis and one would hope that by 2012 the debate has moved on from the research parlor trick of predicting election outcomes to addressing the broader and more complicated problem of accurately measuring a larger set of attributes than the relatively straightforward question of whether people are going to vote for Candidate A or Candidate B.  In Nate’s table there are nine firms with an average error of 2 points or less and four of the nine use an Internet methodology of some sort.  I say “of some sort” because as best I can determine there are three methodologies at play.  Two of the four (Google and Angus Reid) draw their samples to match population demographics (primarily age and gender).  IPSOS, on the other hand, tries to calibrate its samples to using a combination of demographic, behavioral and attitudinal measures drawn from a variety of what it believes to be “high quality sources.”  (YouGov, which is further down the list, does something similar.)  RAND uses a probability-based method to recruit its panel.  So there are a variety of methodologies at play in these numbers.

Back in 2007, Humphrey Taylor argued that the key to generating accurate estimates from online panels is understanding their biases and how to correct them.  I tried to echo that point in a post about #twittersurvey a few weeks back.  Ray Poynter commented on that post.

My feeling is that the breakthrough we need is more insight into when the reactions to a message or question are broadly homogeneous, and when it is heterogeneous . . . When most people think the same thing, the sample structure tends not to matter very much. . .However, when views, attitudes, beliefs differ we need to balance the sample, which means knowing something about the population. This is where Twitter and even online access panels create dangers.

 I think Ray has said it pretty well.

Representativiteit is dood, lang leve representativiteit!

I'm in Amsterdam where for the last two days I've attended an ESOMAR conference that began as a panels conference in 2005, morphed into an online conference in 2009 and became 3D (a reference to a broader set of themes for digital data collection) in 2011. This conference has a longstanding reputation for exploring the leading edge of research methods and this one has been no different. There have been some really interesting papers and I will try to comment on a few of them in the days ahead.

But as an overriding theme it seemed to me that mobile has elbowed its way to the front of the pack and, in the process, has become as much art as science. People are doing some very clever things with mobile, so clever that sometimes it takes on the character of technology for technology's sake. Occasionally it even becomes a solution in search of a problem. This is not necessarily a bad thing; experimentation is what moves us forward. But at some point we need to connect all of this back to the principles of sampling and the basic requirement of all research that it deliver some insight about a target population, however defined. Much of the so-called NewMR has come unmoored from that basic principle and the businesses that are our clients are arguing about whether they should be data driven at all or simply rely on "gut."

At that same time we've just seen this fascinating story unfold in the US elections that has been as much about data versus gut as Obama versus Romney. The polls have told a consistent story for months but there has been a steady chorus of "experts" who have dismissed them as biased or simply missing the real story of the election. An especially focused if downright silly framing of the argument by Washington Post columnist (and former George W. Bush advisor) Michael Gerson dismissed the application of science to predict electoral behavior of the US population as "trivial."

So today, regardless of their political preferences, researchers should take both pleasure and perhaps two lessons from the election results. The first is that we are at our best when we put the science we know to work for our clients and do them a major disservice when we let them believe that representivity is not important or magically achieved. Shiny new methods attract business but solid research is what retains it. The second is that while the election results were forecast by the application of scientific sampling, it was won with big data. The vaunted Obama ground game was as much about identifying who to get to the polls as it was about actually getting them there.