The thorny and the obvious

Posted on January 12, 2011 by Jorge Aranda

This discussion between Laurent Bossavit and Steve McConnell makes for very interesting reading: Bossavit critiques McConnell’s Making Software chapter on differences in programming productivity (original in French here), arguing that the studies it cites do not establish as a fact that some programmers are an order of magnitude better than others; McConnell responds, nobly and patiently, justifying the citations and the order-of-magnitude claim that they support.

Bossavit’s critique seems slightly tinged with indignation at discovering how scientific sausages are made. (Incidentally, Bruno Latour, whom he discusses at some length throughout his piece, is a prime exponent of such sausage making.) Bossavit goes back to some of the studies cited by McConnell and finds that they are not controlled laboratory experiments, or that their sample size is fairly small, or that the participants were debugging instead of programming properly, or some other problem. He therefore finds McConnell’s litany of citations suspect, as none of them conclusively establish as a fact that there are programmers that are an order of magnitude better than others, though taken all together they form an intimidating wall of academic texts that encourage the reader to just take McConnell’s summary, erroneously, for a fact. In his reply, McConnell convincingly shows that the scientific evidence for an order-of-magnitude difference in individual programming productivity is far more solid than Bossavit makes it out to be. Not conclusive, perhaps, but as strong as it gets in our field to date.

However, having settled the issue of McConnell’s chapter, there are still several important observations to extract from Bossavit’s critique, taken more generally as discussing software development research (which was, I believe, Bossavit’s intent), that were overlooked in the subsequent discussion.

The first is our tendency to protect some of our questionable claims with a layer of citations. Whole subfields of software research have sprouted from such clever gardening; by the time they wither their creators will have long secured themselves, having achieved tenure and respectability years before. It is truly a pain, in some occasions, to dig through the list of references in an initially exciting paper, only to find that it is supported by the flimsiest empirical support. Even if in this case Bossavit’s criticism was unwarranted, it holds for many other academic papers in our area.

On the other hand, paired with this tendency to offer questionable citations is the tendency to demand them in the literature we read and review, even to support fairly obvious statements. In a sense, we’ve become rather lazy, preferring an (Author, Year) string over the (slight) effort of considering whether a claim makes sense through simple argumentation or experience. We demand statistical significance, rather than clarity of thought.

This is the case with the whole productivity issue. I know that there are people who are at least an order of magnitude better programmers than others; I have seen them, and I suspect most other software developers or researchers have, too. It’s just part of the difficulty of the task and of the variety of human nature. I also know runners, jugglers, writers, cooks, managers, and scientists, who, by any sensible criteria, are far beyond the abilities of some of their peers. We don’t really need a series of double-blind controlled experiments with thousands of participants around the globe to establish this; our resources are better spent otherwise, and in the case of programmer productivity the sources that McConnell offers, methodologically weak as they might be, are more than enough to convince ourselves that there are no surprises here, and move on.

The real question, the thorny issue, is the nebulousness of our constructs—in the case in point, the development productivity construct. Bossavit gets to this near the end of his critique, but his arguments appear to have been ignored in the posterior debate. What is productivity? For starters, Bossavit reminds us (and again, there’s no need to demand citations or studies here; our experience confirms the statement) that some people have a net negative productivity. Initial measurement efforts are naive: lines of code have long been discredited as an accurate indicator of anything. Other programming-centric elements (such as function points) risk missing the essence of productivity: as Greg Wilson likes to say, “a week of work saves an hour of thought,” and such hours of thought are not amenable to straightforward measurement, as they tend to produce very little code. And what about the more subtle components of productivity? Perhaps you, like I, have worked with someone who may not be particularly skilled, technically speaking, but that holds some other attribute (charisma, empathy, drive, a sense of purpose) that greatly amplifies the productivity of the team many times over. How can we include such considerations in our productivity construct, seeing as they are meshed within our understanding of it?

The problem is that, for as long as a construct is as weakly built as the one of development productivity, any experimentation we carry out is bound to be unsatisfactory. We know that some people are more productive than others; whether they are exactly five, ten, or twenty seven times more productive is not a question we can settle at this point (or, perhaps, ever—and by the way, I’m not sure this is really where we want to go as a society, but that’s a different topic). If we can spare research efforts in exploring productivity in more detail, I suggest we should aim them at settling these theoretical and conceptual issues first, rather than at more careful and methodical experimentation.

About Jorge Aranda

I'm currently a Postdoctoral Fellow at the SEGAL and CHISEL labs in the Department of Computer Science of the University of Victoria.

View all posts by Jorge Aranda →

This entry was posted in Academia, Software development. Bookmark the permalink.

23 Responses to The thorny and the obvious

Greg Wilson says:

January 13, 2011 at 3:13 am

1. You say, “…our tendency to protect some of our questionable claims with a layer of citations.” Can you provide examples?

2. It’s worth pointing out that Bossavits didn’t actually bother to read some of the things McConnell cited, and in at least one case (DeMarco and Lister) he assumed McConnell was citing one work when in fact he was citing another.

Reply
- Jorge Aranda says:
  
  January 13, 2011 at 12:35 pm
  
  1. Yes, several. Listing them here won’t make me many new friends, but as one fairly typical example of what I mean, consider the Intro section on Damas et al. ICSE 2009 paper on process modeling.
  
  2. Yes, Bossavit’s critique was loaded with gaffes. Not only did he not read some of McConnell’s references, but he mistook an IST journal paper for an executive report, and interpreted as weaknesses some actual strengths of the DeMarco and Lister study. Still, despite all these major flaws, my point is that his construct concerns are relevant.
  
  Reply
- Laurent Bossavit says:
  
  January 17, 2011 at 7:47 pm
  
  “Didn’t bother” is incorrect; “wasn’t able to find at reasonable effort” is the case.
  
  I’m pretty sure that, to an overwhelming majority of readers of “Making Software”, tracking down articles that are paywalled in different places (IEEE, Springer, etc.) and books that are out of print is a pretty substantial effort. People who do make the effort, even in part, are paying the book and those who worked on it a rare homage: taking its message to heart.
  
  Greg, your opinion in late 2009, when you commented on John Cook’s “Why programmers are not paid in proportion to their productivity”, was that of the studies McConnell cited “the only two that would pass muster today are Valett & McGarry and DeMarco & Lister (and even the latter would probably be bounced if submitted today)”.
  
  What new evidence has changed your mind on these citations?
  
  As a co-editor of the book, the quality of these citations is partly your responsibility. In at least one case (Mills 1983), Steve referring to it as one of “many studies” sent me on a wild goose chase for empirical evidence, ultimately turning up nothing but a statement of personal opinion. At least one reader – me – experienced that as a serious defect, one that undermines the book’s overall message.
  
  Are you OK with that situation?
  
  Reply
  - Greg Wilson says:
    
    January 19, 2011 at 5:38 am
    
    > “Didn’t bother” is incorrect; “wasn’t able to find at reasonable
    > effort” is the case.
    
    We clearly disagree on what constitutes “reasonable effort” when you’re going to imply that people have been trying to
    pull a fast one.
    
    > Greg, your opinion in late 2009, when you commented on
    > John Cook’s “Why programmers are not paid in proportion
    > to their productivity”, was that of the studies McConnell cited
    > “the only two that would pass muster today are Valett &
    > McGarry and DeMarco & Lister (and even the latter would
    > probably be bounced if submitted today)”.
    > What new evidence has changed your mind on these
    > citations?
    
    What changed was my understanding of what constitutes
    evidence—putting it another way, I’m a little humbler than
    I used to be. Five years ago, I sneered at anything that wasn’t
    a double-blind randomized trial followed by a t-test; working
    with Jorge and others, I learned that there are many other
    ways to discover things. They may not be statistical, but they
    are just as rigorous, and just as revealing.
jonathan lung says:

January 13, 2011 at 1:36 pm

And now for a totally unrelated New York Times article. Your mention of sausage-making reminded me of it.

Reply
- Jorge Aranda says:
  
  January 13, 2011 at 2:17 pm
  
  Sausage-making has come a long way, I guess! 🙂
  
  Reply
Lorin Hochstein says:

January 13, 2011 at 8:47 pm

Along these lines, there was a fascinating article in the New Yorker a couple of weeks ago entitled The truth wears off. It describes how false theories come to be believed by scientific communities because a promising experiment shows support for an interesting theory, followed by publication bias in subsequent studies.

Reply
- Jorge Aranda says:
  
  January 14, 2011 at 11:30 am
  
  Lorin,
  
  Yes, it’s an interesting article. Incidentally, the author didn’t go into this, but as I read it I thought this is partly a consequence of chasing statistical significance, rather than theoretical soundness…
  
  Reply
Neil says:

January 14, 2011 at 8:05 am

Isn’t it contradictory to accept McConnell’s 10x argument, and on the other hand say we don’t actually know what we mean by ‘productivity’?

I’ll confess to not having read the literature on the topic, but it does seem full of self-reports and anecdote. Moreover, a lot seems very old for this field (mostly from the 1980s).

As an aside, what other human endeavours do we see a factor of 10 difference in reasonably similar professionals? It seems enormous to me. Does a litigator win 10x more cases than another? I did a marathon in 4 hours, and the winners aren’t even twice as fast as slow, untrained me. Why would we expect two Microsoft programmers to be so different in (let’s say) how fast they fix the same bug?

It seems to me that this argument is really saying that people experienced with the project, tools, and team are more useful than newcomers. Has any study controlled for that?

Reply
- Jorge Aranda says:
  
  January 14, 2011 at 12:13 pm
  
  Neil,
  
  Well, my only quibble with McConnell on this would be that he takes all of these reports about huge differences in productivity and summarizes them as a “10x” difference, which is an unwarranted level of precision. I can’t accept that there is a 10x difference between best and worst, because we don’t really know how to evaluate productivity differences precisely. But I can easily accept that there is a very large difference in productivity between them. In this post, perhaps also with inappropriate precision, I called that an “order of magnitude” difference, to give an idea that the difference is very probably quite large. I don’t see a contradiction in this: we do know what productivity is, roughly (desirable work achieved per unit of time), it’s just that we don’t know how to operationalize the construct.
  
  You ask “what other human endeavours do we see a factor of 10 difference in reasonably similar professionals?” But I think part of the problem is the realization that we’re not dealing with reasonably similar professionals: there’s variety of experience, training, motivation, domain knowledge, architectural sense, etc. Beyond a general tendency to discuss geeky films and books, software professionals do not seem a homogeneous group to me. There are no entry barriers to the profession, no filters. I don’t know if there are litigators that win ten times more cases than others; I’m sure there are many who would win ten times more cases than I would, if I were allowed to represent others in court.
  
  Similarly, I’d still be unable to finish a marathon: my knees wouldn’t allow it. So in the business of Finishing Marathons so far you’re effectively an infinite number of times more productive than I am. Enjoy it! (In the business of Placing First—a professional runner’s ultimate goal—, we’re both abysmal, though).
  
  Reply
  - Steve McConnell says:
    
    January 26, 2011 at 9:59 pm
    
    Jorge, I appreciate the balanced criticism. The way Iuse the phrase, “10x” is intended not to be precise. I don’t say “11.7x.” I use a round number (1 significant digit) that, not coincidentally, is the same as the “order of magnitude” that you like.
    
    Most of the studies I cite equate “productivity” with time. I.e., all the subjects get the same task, and the observed differences are the difference from best time to worst time. In the studies the range goes from about 4x to about 25x — I summarize this with the round number 10x.
  - Jorge Aranda says:
    
    January 26, 2011 at 10:21 pm
    
    Thanks for dropping by, Steve. I figured your use of “10x” is not meant to be taken precisely. My (very minor) concern is that this lack of precision is lost when people read and then repeat that some programmers are “ten times” more productive than others.
John Cook says:

January 18, 2011 at 10:45 am

The objection that studies did not distinguish debugging from programming proper is particularly academic. What exactly is programming proper? If it means typing new lines of code into an editor, then programming proper may consume 30 minutes of a typical programmer’s day.

Reply
- Laurent Bossavit says:
  
  January 18, 2011 at 11:31 am
  
  Possibly; however debugging is particularly problematic, as programmers know from experience it can be a particularly open-ended type of task.
  
  When you start chasing a bug, you rarely can tell if it’s going to take you a minute, an hour, a day or a week to track down. As such, any study that focuses on or include substantial times spent debugging is going to overstate variability in times-to-complete.
  
  There is the additional problem that “programming” and “debugging” are categories we recognize in daily work, that seem for us to carve nature at its joints, but which may turn out to be very hard to artificially isolate in lab settings. (Some programmers like to write unit tests as they code. Does that count as programming or as testing? Ditto refactoring?)
  
  For some, programming may consist of staring at the ceiling; for others, it may however consist in getting together with a peer and throwing ideas back and forth. (Studies focusing on individual productivity may artificially place this kind of person at a disadvantage, which will tend to overstate variability.)
  
  Reply
Laurent Bossavit says:

January 25, 2011 at 3:17 am

Jorge: “McConnell convincingly shows that the scientific evidence for an order-of-magnitude difference in individual programming productivity is far more solid than Bossavit makes it out to be”

I’m sincerely puzzled by your saying this, and Greg apparently agreeing. I find McConnell’s defense of his original citations dismissive and heavily biased. (And I keep doing more and more homework on this, hence the glacial pace of my responses.)

The difficulty here is that “Curtis 1981” requires only two words, but these two words weigh heavily by contributing to the “wall of citations” effect you mention.

By contrast, to make a proper argument critiquing each of these two words I need, not just to track down and obtain (at a few tens of dollars each) all of these references, but also write a few hundred words of text.

Take this Curtis 1981. On closer reading you find that it’s an indirect citation, referencing a study conducted two years earlier (“Sheppard et al.”). It suffers from at least one thing that McConnell himself thinks is a problem with another study I posted: it was never intended to *directly* study performance variability, but in fact is a subset of data collected in the *pretest* of an experiment on the impact of structured control flow and program size on debugging performance.

Other citations of McConnell’s are not just indirect, but incestuous: on closer reading something that is cited as a “study” which “confirms” the original 1968 Sackman conclusions in fact turns out to be an unrelated work, with only a tangential reference to… the original 1968 experiment.

When I pointed this out, specifically for the Boehm and Pappacio 1988 reference, McConnell’s response was “I will acknowledge that this wasn’t the clearest citation for the underlying research I meant to refer to”. (Contrast: when my goofs are pointed out, what I say is “I goofed” – I don’t sugarcoat it.)

However, this isn’t the only instance of incestuous citation in the original “wall of cites”. (Homework for you: can you find at least one other?)

Reply
- Jorge Aranda says:
  
  January 25, 2011 at 12:43 pm
  
  Laurent, I don’t know why you’re puzzled. McConnell did show that the evidence he cited was stronger than you claimed; there’s no way around it. That evidence may still not convince you, but that’s a different matter.
  
  Perhaps you need to ask yourself two questions. First, what would it take to convince you of statements like this? That is, what kind of studies are valid for you, and are your standards realistic for the software research community? And second, what alternative explanations are there for the all-too-commonly reported observation that there is a very large difference in productivity between people that develop software? Only by offering a compelling and sound alternative explanation can you hope to present a challenge to the claim (shown by everyday experience and seemingly confirmed by peer-reviewed research) that some developers are much more productive than others.
  
  Reply
  - Laurent Bossavit says:
    
    January 25, 2011 at 3:07 pm
    
    Jorge: “McConnell did show that the evidence he cited was stronger than you claimed.”
    
    McConnell lists a total of 8 references in the “wall of cites”. My bibliographic work shows that these 8 refer to 4 (at best 5) distinct data sets; only double-counting some of them brings the citation count to 8. I believe that I did show that the evidence he cited is weaker than he claimed. The only way around that is to refute the double-counting; Steve has strenuously refused to do so.
    
    Do you, or do you not, believe that there is double-counting of some data? That is an empirically verifiable claim.
    
    “what kind of studies are valid for you”
    
    This isn’t so much about what studies I find valid: the paper published in 1979 (“Sheppard et al.”) seems OK enough. My beef in this instance is with improper reporting of the results found, especially when this reporting overstates the conclusions.
    
    In particular, when you cite something as providing empirical support for a claim, the rules of engagement dictate that this something should have an abstract that summarize the claim, a description of the methodology, and so on. And that you shouldn’t count the same dataset twice.
    
    “are your standards realistic for the software research community”
    
    We should accept standards that are realistic, but also strong enough for research in general – just because this is software is no reason to lower your standards to accept sloppy bibliographical work.
    
    My standard is simply: when you cite something that is supposed to provide empirical support for a claim, do your homework and make sure that the reader who bothers to track down your source can answer some basic questions about the research: how was the data collected, what was the sample size, what was the operational definition of the construct investigated, and ideally what threats to validity were noted.
    
    (Note for instance that my standard places no constraints on statistical tooling, nor does it require quantitative research, vs qualitative. Anything goes, as long as the interested reader can check your facts if they are so inclined. This standard is basically a “maxim of helpfulness”.)
    
    “what alternative explanations are there for the all-too-commonly reported observation that there is a very large difference in productivity between people that develop software”
    
    Oh, that’s an easy one. We’d all like to believe we are a “10x” developer, so we’re inclined to believe the claim. Confirmation bias can easily account for the rest. (What seems “obvious” often turns out not to be so obvious, and in some cases to be false.)
    
    It’s more complicated than that, of course, and the popularity of the claim is overdetermined. In addition to ego-flattery, there is the underlying ideological position that “great developers are born not made”. That lets some kinds of people off the hook for managing the training and cultivation of great developers. (I acknowledge the ideological underpinning of the opposite claim, by the way: if all developers are basically the same then the “throw a bunch of people at the problem” approach has some chance to work.)
    
    I’m quite prepared to believe that some people in this business are exceptionally gifted, others exceptionally unsuited – in fact I believe that if you take a large enough sample you can get the “x-factor” to come out to nearly any number you care to name.
    
    But such positive outliers are too rare: we cannot base a reasonable strategy for improving our management of programming efforts solely on the hope that we will find such outliers. What matters is the shape of the bulk of the curve. The “10x” claim offers us absolutely no insight on this.
  - Jorge Aranda says:
    
    January 25, 2011 at 3:35 pm
    
    Laurent, it’s strange that you keep pushing this point. I see no way you can “win” it. You go through McConnell’s references and claim that, for instance, one is an executive report when it is in fact an IST journal paper, or that another one (the DeMarco and Lister one) is vulnerable precisely for the reasons that actually make it strong. McConnell corrects these errors and misunderstandings of yours, and fills up the gaps in the literature that you did not check. Objectively, he showed that the evidence he cited was stronger than you claimed. Arguing otherwise is pointless.
    
    As for your alternative explanation (ego flattery followed by confirmation bias), I’m afraid it’s too simplistic to be of any use. I don’t think of myself as a particularly talented programmer. Removing myself from the equation, I’ve seen people that are much, much better developers than others. Plenty of others have too, again, leaving themselves out of the equation. Though stars may be rare, very good developers (and very bad developers) aren’t. Why is that hard to accept?
  - Laurent Bossavit says:
    
    January 26, 2011 at 12:44 am
    
    (It seems that you can’t nest replies past a certain point, so I have to reply here re. “strange that you keep pushing this”.)
    
    Do you or do you not agree that Steve is double-counting some data sets? Just a yes or no will do.
    
    Every time a scientist refuses to answer a straight, factual question, a little puppy dies. 🙂
  - Jorge Aranda says:
    
    January 26, 2011 at 10:28 am
    
    Honestly, I don’t know (and I don’t care: I have better things to do than audit unsurprising data sets that point to the obvious). You say that he did; I have no reason to distrust you on that. I can tell you that it’s irrelevant to my points, though, and that I’m not sure we’re having a productive discussion here.
  - Laurent Bossavit says:
    
    January 26, 2011 at 11:24 am
    
    Abuse of citations is kind of a big deal to some people in the scientific community; see for instance here: http://www.bmj.com/content/339/bmj.b2680.full
    
    The underlying issue is information cascades: http://en.wikipedia.org/wiki/Information_cascade
    
    On the substantive question, the issue isn’t with the fact that there is a large variability in observed performance, but with what the source of variability is: the person, or something else.
    
    The “10x” claim is very close to the “rock star programmer” or “superprogrammer” notion, the idea that good programmers are “born not made”. To immediately jump to the conclusion that the variability we observe in “everyday experience” is a validation of some or all of these ideas seems dangerous to me.
    
    IMHO we are ignoring the idea that a bunch of other things could explain the perception of variability. After DeMarco and Lister we could believe that the environment trumps the intrinsic abilities of the person, for instance. They observed that the poorest performers came from workplaces where they were often interrupted; their conclusion was that if you took the same programmer from a low-productivity environment to a high-productivity environment, you’d see a factor of 2.6 improvement.
    
    Other factors of variability are the matching of task to person, of team to person, and so on. Someone who seems to be working out poorly in one context may turn out brilliantly in another for all I know; to assume that they won’t is nothing other than the Fundamental Attribution Error, a well-known bias in judgement.
    
    There’s the issue of who you’re looking at to establish this claim of “everyday observable” variability, and whether that could create bias. For instance if you look at startup founders of the Zuckerberg type, it’s easy to say after the fact that he’s a “10x” type, the interesting question is if you could have predicted it before he became famous, and how.
    
    It’s also possible that software development is an inherently variable activity, with significant variations in output expected even in the same individual faced with the same kind of task. (For instance the well-known “second system effect” cited by Brooks, which we have no reason to suppose won’t apply to individuals.)
    
    Obviously some of the everyday observed variability has to do with the fact that software isn’t a well-defined “profession” with barriers to entry. People get in who perhaps have the enthusiasm to work in software but not the abilities. But even granting that, we should be asking the question if these abilities can be trained, if so how fast, and so on. And “weed out obvious negative performers” is clearly a very different kind of advice from “only hire superprogrammers”.
    
    That’s not an exhaustive list of the ways we could be fooling ourselves into seeing something “obvious” that in fact isn’t there, just the first few off the top of my head.
  - Jorge Aranda says:
    
    January 26, 2011 at 11:37 am
    
    I never suggested that good programmers are “born not made;” merely that hugely variable performance exists and is observed frequently. You seem to agree. There’s plenty of extrinsic and intrinsic factors that may cause this; you point to some—good. I’m ready to leave this discussion behind, and I hope you are, too!
Pingback: Measuring programmer productivity is futile. | Semantic Werks