Wednesday, July 10, 2013

You Can Sometimes Trust Research Done on Mechanical Turk, But It Depends on the Research Question

Dan Kahan has an interesting post on some of the validity problems with research conducted on Mechanical Turk (MTurk). I think I largely agree with his main point, which is that the evolution of the marketplace has been such that it's become less useful for conducting certain kinds of research. However, I do worry there's a potential baby/bathwater problem if researchers decide that "unrepresentative" or "experiment-savvy" means a useless subject pool (e.g., Andrew Gelman titled his blog post about Kahan's article "Don't Trust the Turk").

I haven't done MTurk research in several years, but the external validity issue raised by the blog post is something I thought about quite a bit when I was running experiments on the platform. I wrote a section about external validity in my ExpEcon paper with Richard Zeckhauser and Dave Rand). They key portion is excerpted below (the source code and data for that paper are available here):
Representativeness 
People who choose to participate in social science experiments represent a small segment of the population. The same is true of people work online. Just as the university students who make up the subjects in most physical laboratory experiments are highly selected compared to the U.S. population, so too are subjects in online experiments, although along different demographic dimensions.

The demographics of MTurk are in flux, but surveys have found that U.S.-based workers are more likely to be younger and female, while non-U.S. workers are overwhelmingly from India and are more likely to be male (Ipeirotis, 2010). However, even if subjects "look like" some population of interest in terms of observable characteristics, some degree of self-selection of participation is unavoidable. As in the physical laboratory, and in almost all empirical social science, issues related to selection and "realism'" exist online, but these issues do not undermine the usefulness of such research (Falk, 2009).

Estimates of changes versus estimates of levels 
Quantitative research in the social sciences generally takes one of two forms: it is either trying to estimate a level or a change. For "levels" research (for example, what is the infant mortality in the United States? Did the economy expand last quarter? How many people support candidate X?), only a representative sample can guarantee a credible answer. For example, if we disproportionately surveyed young people, we could not assess X's overall popularity.

For "changes" research (for example, does mercury cause autism? Do angry individuals take more risks? Do wage reductions reduce output?), the critical concern is the sign of the change's effect; the precise magnitude of the effect is often secondary. Once a phenomenon has been identified, "changes'" research might make “levels” research desirable to estimate magnitudes for the specific populations of interest. These two kinds of empirical research often use similar methods and even the same data sources, but one suffers greatly when subject pools are unrepresentative, the other much less so.

Laboratory investigations are particularly helpful in "changes" research that seeks to identify phenomena or to elucidate causal mechanisms. Before we even have a well-formed theory to test, we may want to run experiments simply to collect more data on phenomena. This kind of research requires an iterative process of generating hypotheses, testing them, examining the data and then discarding hypotheses. More tests then follow and so on. Because the search space is often large, numerous cycles are needed, which gives the online laboratory an advantage due to its low costs and speedy accretion of subjects.