The CRU emails and the secret life of bugs

At Scholars and Rogues, Brian Angliss draws a nice link between the paper I wrote with Gina Venolia at Microsoft Research and the “Climategate” emails.

He argues that we cannot make any strong conclusions from a fraction of a scientific unit’s emails taken out of context. We get much more out of inquiries that probe the details behind these emails and that are grounded on a better, more comprehensive data collection, including input from the people involved. There’s been several of these inquiries already, all of them exonerating the climate scientists.

This kind of inquiry is what we did for each of the bug records in our case study. As I summarize here, we found that a deeper analysis of the interactions behind bug records provides a much-needed context to understand important aspects of their stories. Their electronic traces paint an incomplete (and often misleading) picture, and it’s risky to use them on their own. Overall I don’t think this is a particularly surprising finding; it simply was a necessary study in our field, considering the recent trend to mine data out of context.

I guess that some people feel suspicious about official inquiries performed by strangers and coming out empty handed—it feeds their conspiracy paranoia. But objectively speaking, we can’t just read a bunch of emails and expect to come even close to the level of understanding of people that get to examine all the evidence and to talk to the people behind it.

(Thanks to Steve for tipping off Brian about our paper!)

About Jorge Aranda

I'm currently a Postdoctoral Fellow at the SEGAL and CHISEL labs in the Department of Computer Science of the University of Victoria.
This entry was posted in Academia. Bookmark the permalink.

3 Responses to The CRU emails and the secret life of bugs

  1. And thank you for commenting. Good luck on the defense!

  2. Neil says:

    Interesting discussion that was started. Let me see if I understand the central conceit: “there is more information (context) that is not available to you if you constrain yourself to level 1/2 sources”. In other words, if I just look at bug records, I miss things like motives for abandoning a bug for 6 months.

    How does this square with data mining of personal information? I think we were talking earlier about Netflix. Given sufficient information (large N), you can start to get very specific details about things beyond the dataset.

    In the CRU case, I would argue that there is a difference between quote-mining “hide the decline” and using the emails in their entirety to infer patterns. For example, Person A dislikes Person B, Person C and Person D agree, etc.

    The problem in the CRU case is that the available corpus may be deliberately corrupted to filter out inconvenient truths (as in, Person A and Person B made friends eventually). But if you give me a reasonably complete set of data, I argue that it is perfectly reasonable to draw level 3 inferences from it. In fact, that is exactly what many disciplines do: history, sociology, etc. In history, you almost never get access to level 3/4 information: you must construct a plausible (defensible?) narrative that fits the facts you *do* know.

    Finally, (to take a po-mo perspective you probably agree with) the idea that there is some objective truth about the bug data, or the CRU emails, is flawed. This is what the deniers understand much better than the scientists involved: that the ‘reality’ of what happened is very much subjective.

    • Jorge Aranda says:

      “But if you give me a reasonably complete set of data, I argue that it is perfectly reasonable to draw level 3 inferences from it.”

      I agree. It seems as if the CRU case is more of an attempted level 3 analysis with very incomplete (and cherry-picked) level 1 data, whereas the East Anglia inquiries would be a level 4 analysis. There also seems to be some malice involved in the original analysis. So the link to our paper is not that straightforward.

      But just to clarify: history is *all* about doing analysis at levels 3/4. The distinction is that at lower levels the analyst doesn’t go through the data herself: she delegates the task to a machine instead, and the machine won’t pick up some knowledge that would be obvious to the analyst.

      As to your final point—sure, objectivity is flawed, etc, etc. And yes, deniers exploit this. But that doesn’t mean that interpretations of the sentence “hide the decline” as “obfuscate the truth” and as “handle the data properly” are equally valid.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s