Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2008/11/11/open_source_is_not_peer_review

Open source != peer review

I gave two presentations at the German Conference on Chemoinformatics (GCC) in Goslar, German. One was an update of my EuroSciPy presentation, Python Tools in Computational Chemistry (and Biology). I included more on the history of Python and why I think it became widely used in cheminformatics. At the end I gave some ideas of what I want for the future. I'll elaborate more on that in another posting here.

The second presentation was about some of the difficulties I've seen in doing open source cheminformatics development. I tried a different presentation style: black background with one or only a few words on the slide in white font. It required more practice, but came up pretty nicely I think. I started by writing down everthing I wanted to say. I'll post that text here soon.

Open Source != peer review

One of the slides is titled "Open Source != peer review". I'm breaking that out because it's something I want feedback on, or at least arguments opposing. Here's the short version of that part:

Some argue that doing good computational-based science requires open source. The argument is that scientists need to review the source code in order to verify that it works correctly. How, they argue, can you review someone else's paper if you can't review the source code used to make that paper?

I like open source. (My talk goes into the philosophical differences between "open source" and "free software.") I think there should be support for peer review. But I don't understand why the ability to see the source code, in order to review it for scientific quality, requires the right to redistribute the source code to others.


I gave CHARMm as an example. It's a molecular dynamics program from the Karplus group at Harvard. The academic license costs $600 and all licensees get the source code. People can review the source code, and modify, and I believe even send modifications to others who have a CHARMm license. But it's not open source.

What additional peer review could be done on CHARMm if it was distributed with an open source license? Please bear in mind that the time needed to get up to speed on CHARMm is quite large, and using it for interesting science likely requires good hardware, so $600 isn't that much. The license fees go to supporting CHARMm development, so there are scientific benefits to having the fee.

Yes, I know that free software allows selling the software so it's possible to charge the $600 even for open source code. My talk goes into the difficulties of actually doing that.

My point here is to get feedback about why the right to redistribute software is a requirement for effective peer review, and to tie it down to specific examples. Mine is just one; feel free to use your own.

Clean room development

Once you've done that, please also explain how to solve this problem. Suppose I review someone else's source code, either as part of the peer-review process for a publication or because I want to verify that code I got really does work.

Several months later I write a program which has similar functionality and my implementation looks very much like the code in the first program. Perhaps I deliberately structured it after the first program, or perhaps that form just makes sense. Perhaps I forgot that I had even seen the code in the first program. Perhaps many things.

The author of the first package finds out that my code is similar. Suspiciously similar. Was there a copyright violation? A license violation? What should I do? Change my license? Rewrite the source? Apologize profusely? Claim there was no violation?

Further complications: what if the original source code was submitted in a peer review article and I was a reviewer. If that paper was rejected, so that the original source code was never actually published, then what are the license terms of the software? Eg, it might be "BSD upon publication" but it was never published.

(The peer review system has long figured out how to handle the problem of idea transfer of rejected papers to reviewer, but this essay is about copyright and licenses, not ideas.)

What if there are multiple sections of my code, each vaugely like code in other projects. If I want to be safe and generous I could change my license to match all of them, but even free software licenses can be incompatible. What if I had actually recommended that code structure to others at a conference but there's no paper trail showing the history and we forgot?

The industry solution to this is clean room design. One person or team reviews the code and describes how things work through a specification document. That specification does not contain copyrighted material. Another person or team, who never saw the original source code, takes the specification and implements it.

There's no way we can do full clean-room development in this field. That would mean some people only read code and others only write code, which rather is against the point of doing code review.

If we encourage peer review of source code, which I think we should do, then how do we deal with this issue? How can I be sure that if I review someone's program then in the future they won't accuse me of taking their code and using it in violation of its license agreement? Or what if I did directly take 20 lines of code, figuring it was too small to count? What recourse do they have?

Most of my publicly available software is under the BSD license. Hence if someone uses my software in violation of the license the fix is to add a simple copyright statement to the source code. The violator need make no other changes.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me

Copyright © 2001-2020 Andrew Dalke Scientific AB