Tuesday, May 22, 2012

Data openness by private firms

The New York Times has a story today about social scientists working with company data and being unable or unwilling to make it public. The story begins:
When scientists publish their research, they also make the underlying data available so the results can be verified by other scientists.
I think the first sentence is probably more a description of how we'd like the world to be than how it actually is right now, especially in the social sciences. The main so-what of the story is that private companies are collecting enormous amounts of high quality data that lets you do fascinating social science, but companies are understandably reluctant to make this data public, primarily for privacy reasons (and probably also because they are afraid of giving up some competitive advantage).

I think the options for any organization that does or might do research are:

1) Do research for business purposes. Make neither the findings nor the data public.
2) Do research for business purposes. Make the findings but not the full data public.
3) Do research for business purposes. Make the findings and data public.  
4) Do research. Make findings and data public.

Most companies probably aren't interested in (4) and this is probably academia's biggest comparative advantage. Barring (4), I think from a social perspective, privacy issues aside, the best outcomes in order are (3) > (2) > (1).   I can understand (1) in some cases, but at least in the kind of companies I'm familiar with, the advantages of keeping everything secret probably aren't that great. 

The advantages of (2) or (3) over (1):  

a)  If you're a software company and you release a feature that works, it will probably get copied anyway, regardless of whether you publish a paper, so you might as well get the thought leadership credit for coming up with the idea in the first place. This paper is/was the basis for Google's secret sauce---posting it to the InfoLab servers back in 1999 didn't doom the company and probably did a lot to increase the perceptions that they were doing something smarter (even though there were antecedents of this idea going back many years---including in Economics, by my academic grandfather).  

b) If you give them access and them publish, you can get outside academics to work on your problems for free (the Netflix prize is an obvious example).  You can recruit those academics to come work for you, or at least get their grad students to come work for you. 

c) If you let your internal researchers publish, you can get them to work at reduced cost or get researchers you otherwise wouldn't be able to attract (see Scott Stern's paper on scientists "paying" to do science).

On (2) versus (3), I think there is a real dilemma: openness and privacy concerns are in tension. Furthermore, just releasing more aggregated or somehow obfuscated versions of the data is not risk free: there's actually an emerging literature in Computer Science on how to release data in ways that are guaranteed to still have the right privacy properties (CMU UPenn professor Aaron Roth recently taught a course on the topic). The fact that smart people are working on it is exciting, since they might figure out provably risk-free ways to release data publicly, but it's also evidence that this isn't a trivially easy problem---seemingly innocuous data disclosures would let someone unravel the obfuscation.  

As a coda, I have a personal anecdote to share about this story. One of the people discussed in the article is Bernardo Huberman: 
The chairman of the conference panel — Bernardo A. Huberman, a physicist who directs the social computing group at HP Labs here — responded angrily. In the future, he said, the conference should not accept papers from authors who did not make their data public. He was greeted by applause from the audience. 
When I was a grad student, I taught a course to Harvard sophomore economics majors called "Online Labor" (syllabus).  I assigned some of Huberman's papers on motivation. I emailed him to ask for the data from one of his papers. He wrote back: 
Dear Dr. Horton:
Thank you for your interest in my work and I certainly feel pleased when I learn that you liked my paper enough to assign it to your class.
As to your request, let me talk with the person who now handles the youtube data (we lately used it to uncover the persistence paradox) and I'll get back to you.
Incidentally if you are interested in the role that attention and status (its marker) play among people I could send you a paper that reports on a experiment (as opposed to observational data) that elucidates it quite cleanly across cultures.

I got the data within days---I can state that he privately practices what he preaches publicly.

Update: I incorrectly stated that Aaron Roth was a professor at CMU---he did his PhD at CMU. He's a professor at UPenn. Apologies.