Previous month:
January 2006
Next month:
March 2006

Posts from February 2006

Separating the Good from the Bad

It seems like every week I hear about a new Web panel.  One recent favorite whose name makes you wonder:

Our Sampling folks work hard to stay on top of this and to sort out the contenders from the pretenders.CASRO has been threatening to develop some standards to help us sort out the good from the bad, but ESOMAR is working harder to get out in front of the issue.  To spare us from reading through their entire code they've come up with 25 questions to ask a Web panel vendor.  Here they are verbatim:

  1. Is it an actively managed panel (nurtured community) or just a database?
  2. "Truthfully" how large is it?
  3. What is the percentage of "active" members and how are they defined?
  4. Where are the respondents sourced from and how are they recruited?
  5. Have members clearly opted-in? If so, was this double opt-in?
  6. What exactly have they been asked to opt-in to?
  7. What do panel members get in return for participating?
  8. Is the panel used solely for market research?
  9. Is there a Privacy Policy in place? If so, what does it state?
  10. What research industry standards are complied with?
  11. Is the panel compliant with all regional, national and local laws with respect to privacy, data protection and children e.g. EU Safe Harbour, and COPPA in the US?
  12. What basic socio-demographic profile information, usership, interests data, etc. is kept on members?
  13. How often is it updated?
  14. In what other ways can users be profiled (e.g. source of data)?
  15. What is the (minimum and typical) turn-around time from initial request to first deployment of the emails to activate a study?
  16. What are likely response rates and how is response rate calculated?
  17. Are or can panel members who have recently participated in a survey on the same subject be excluded from a new sample?
  18. Is a response rate (over and above screening) guaranteed?
  19. How often are individual members contacted for market research or anything else in a given time period?
  20. How is the sample selection process for a particular survey undertaken?
  21. Can samples be deployed as batches/replicates, by time zones, geography, etc? If so, how is this controlled?
  22. Is the sample randomized before deployment
  23. Can the time of sample deployment be controlled and, if so, how?
  24. Can panel members be directed to specific sites for the survey questionnaire to be undertaken?
  25. What guarantees are there to guard against bad data i.e. respondent cheating or not concentrating/caring in their responses (e.g. click happy)?

Still More on Shirking

It seems that everyone is getting concerned about Web panel members satisficing.  Someone recently sent me a presentation that Burke has been giving that has gotten at least one of our clients concerned.  Here is my response:

I think there is an element of alarmism that may not be justified.

I'll begin by pointing out that Jon Krosnick first started talking about satisficing way back in 1991, before anyone ever even thought about Web surveys or online access panels. His work suggests that there is some level of satisficing in all survey modes and the right question to ask is whether it is any worse with Web using access panels than with, say, CATI using RDD (what many think of as the gold standard).

Jon has actually done some work on this question and I saw him present some of it at this year's AAPOR conference. His work looked at comparison of seven panels and an RDD study. He designed a set of six experiments that varied response order and minor word changes designed to measure satisficing. For example, he asked this question:

Which of the following do you think is the most important problem facing the United States today?

  1. The government budget deficit
  2. Drugs
  3. Unemployment
  4. Crime and violence

Half of each panel got this version and the other half got a version that reversed Deficit and Crime. According to satsifcing theory, position should impact the likelihood of an answer being chosen, in this case, Deficit should be chosen more often when listed first than when it is listed last.  So his outcome measure was the difference between the percent choosing Deficit when it is at the top vs. when it is at the bottom. The RDD study produce a 1% differential and the panels produced a range from 1.1% (SSI) to 12.9% (Greenfield). Across all six experiments he found an average difference of .8% for RDD, three panels at 1% or less (SSI, Harris, and GoZing), Survey Direct at 1.9%, SPSS at 3.2% and Greenfield at 5.5%. His conclusion: "Overall, Web survey responses have about the same robustness to changes in question order as the telephone survey. . . " although "Greenfield panelists seem more sensitive than any of the others."

His colleagues at the Stanford Institute for the Quantitative Study of Society also reported on comparisons of a broad set of measures including demographics, life style issues, general attitudes and beliefs, and technology use. In several instances they compared these same panels and RDD against some established benchmarks such as Census data. Their succinctly stated conclusion: "Remarkable comparability of results."

My point is that while there may be some satisifcing going on with Web surveys there is little evidence that it is seriously affecting results. Clearly, some panels are not as good as others and we need to sort them out, but based on research I saw presented in April at ESOMAR's Worldwide Panel Research Conference the main thing we need to worry most about is the number of surveys the panelist is completing. It seems clear people who do lots and lots of surveys respond differently from those who do few, but we can typically select these people out in advance by working with our panel providers.

I also would tend to disagree with some of the Burke suggestions for dealing with satisficing. The classic measure (advocated by Krosnick among others) is to compute measures of non-differentiation of responses within gridded items where studies have shown the strongest enticement to satisficing (straightlining). Levels of non-substantive reponse (missing data) is another. And some people like survey length, but that is an inherently dirty measure given varying connection speeds and ISP traffic volumes.

Multiple Response Questions on the Web

It seems almost second nature to us to randomize response options for multiple response questions in Web surveys.   Why do we do that?  Well, I hope the answer is that we know there are primacy effects, that is, Rs give greater attention to the items at the top of the response list than they do to items further down.  As a result, those items at the top are more likely to be selected.  So we randomize to give every option a reasonably equal probability of being at the top of the list.  Of course, Rs still tend away from "deep cognitive processing" of the entire list and, in all likelihood, Rs select fewer responses than they might if they seriously considered every response in the list.

We try to avoid this problem on the telephone by using a technique called "forced choice" in which we read every response in the list as a yes/no question.  Not only does this help with the primacy issue but it also results in more responses being selected.

It turns out that this forced choice technique works equally well on the Web. (See, for example, ) Instead of a multiple response or check all that apply format you put the responses into a grid with yes or no required on each one.  The technique works against primacy and typically results in a larger number of responses being selected.  It is superior to randomization and ought to be our standard.

It also helps with programming efficiency, especially when the answers to the multiple response question are use to drive future questions.  While it is relatively easy to do the initial randomization, retaining that order through future questions is very labor intensive.

Scales Can Be Problematic in Mixed Mode Studies

This interesting problem showed up in my office this morning.

A while back we did a large Web study with physicians that used an extensive battery of seven point scales.  More recently, we completed a telephone study using the same battery with a similar population, but we are seeing somewhat different results.  Mean scores on the the scale questions are sometimes higher than they were on the Web.  What's going on?

There are a few potential explanations but the one that probably explains most of what we are seeing here comes down to how respondents process a scale when they hear it read to them as opposed to seeing it displayed, in this case, on a Web page. 

Tarnai and Dillman (1992) were probably the first to report a tendency for telephone respondents to choose extreme values in scales more often than mail survey respondents, who tend to use more of the entire scale.  This may result in different means (higher or lower depending on the question and scale direction) across modes. More recently, Steiger, Keil and Gaertner (2005) saw evidence of this in a Telephone/Web comparison of satisfaction scores.  They did not see the effect on all of the items they considered, but they saw enough to suggest that Web respondents distributed themselves more across the entire scale than do telephone respondents.  When you stop to think about it this is not all that surprising.  Visualizing the scale in your head versus seeing it displayed on paper or on a computer monitor might produce subtle differences in the value you select.

Unfortunately, the issue gets a bit more complex if you introduce variation in how scales are displayed on the Web.  Tourangeau, Couper, and Conrad (2004) first reported that including non-substantive answer categories (such as Don't Know, Refused, Not Applicable, etc.) on the far right of a horizontal scale display or the bottom of a vertical display can cause the center of the distribution to shift visually.  For example, in a seven point scale where the respondent sees seven radio buttons across the screen the visual center is the fourth radio button from the left.  Adding, for example, two non-substantive codes on the far right means that there now are nine radio buttons displayed and the visual center is the fifth  bottom from the left.  So choosing from the visual middle of the scale can produce a slight elevation in the overall mean.   Baker, Conrad, Couper, and Tourangeau (2004) replicated this result and show that it can be mitigated by such things as not displaying the non-substantive answer categories, displaying them but separating them from the substantive codes with a vertical line, or labeling all points in the scale.  Simply labeling the midpoint of the scale also may help.

More unfortunately still, when we did some experimental comparisons between phone and Web (Speizer, Schneider, Wiitala, and Baker (2004)) the effect reversed and when there were significant differences on satisfaction items across modes the tendency was for Web respondents to use more of the top boxes than telephone.  My untested hypothesis there is that the display (non-substantive answer categories on the right) mitigated the effect.  But I have yet to prove that.

To sum up, it should not surprise us that hearing the scale read and then interpreting it in one's head can sometimes lead to subtle differences compared to seeing the scale displayed on paper or on the Web.  The research records leads us to expect that we may see differences in means and those differences could be in either direction, depending on the question and how it is displayed on the Web.