Molecular Coding
Molecular Coding
In this interview recorded 9 June 2011, I talked with Igor Filippov. He’s the main author of ORSA, the Optical Structure Recognition Application. It extracts chemical structure information from printed material, and is used in a number of chemistry applications. Igor works at the NCI/CADD group of the Chemical Biology Laboatory at the National Cancer Institute.
AD: Hello my name is Andrew Dalke. Welcome to episode four of "Molecular Coding."
AD: In the summer of 2011 I went to the ICCS Conference in The Netherlands. That's the International Conference on Chemical Structures, which meets every three years. It was both enjoyable and informative.
AD: Many people mentioned that they used OSRA to extract chemical structure information from printed material. The OSRA author, Igor Filippov, was also at ICCS. We got a few minutes during a break to talk more about OSRA.
AD: This interview took place in Noordwijkerhout, The Netherlands on 9 June 2011.
*music*
AD: I'm here at ICCS with Igor Filippov. He's the main author of ORSA, the Optical ...
IF: Structure Recognition Application.
AD: It's the graphics program that's been used actually in a number of presentations here, plus of course you had your poster. Could you tell me a bit more about it?
IF: OSRA is a project that started about four years ago in 2007. Basically it's a utility to convert images of chemical structures such as found in articles and patents into SMILES, SD file, or pretty much anything that Open Babel can produce, [like] InChI and InChIKeys.
AD: How did you get started with it? Was it something your group started to work on or was it something that you were interested in?
IF: It was just a hobby project at first. I got interested in it then it got developed and people started using it and it got more useful to others. It's actually quite a nice feeling to have something fun to work on that is being used by other people too.
AD: When you say that it started out as a hobby project, you were doing this as a side part of work or was it just something totally different? - you saw chemicals and you thought "I would like to process that"?
IF: Pretty much like that, yes. I heard about the general idea of software like that and I thought to myself "it's impossible to do this, it's too complicated."
AD: That's what I think.
IF: Then I tried it. I tried different vectorization algorithms, I tried OCR programs, and the I tried combining them together, because basically that's what it is. You take an image, you vectorize and get your bonds. You do OCR on atomic label and you've got your atoms. You combine them together and produce the molfile.
AD: I saw in the list of programs that OSRA uses that you have a dozen or so different other tools.
IF: I prefer not to reinvent the wheel and so I try to reuse what was done before me. OSRA has a very small codebase for such a project because it's basically a glue between various libraries that existed before. There are two different OCR engines for label recognition - JOCR and ocrad. There's a vectorization library - potrace - which does the roster to vector conversion. There's the graph processing library - GraphicsMagick. There's Open Babel of course for generating the output molecular structure. There's a couple other additional technical libraries that OSRA is using.
AD: You use the two OCR readers, which I'm surprised that you're using not one but two different ones, and that's to work out the text. What you added to this then was the recognition of what double bonds are and do you do chirality and wedges and all that sort of chemistry?
IF: Yes. Well, I should say that none of those libraries were particularly developed for chemical structure recognition, so they're good at what they do but they're not 100% aligned with my project, so there's a lot of things I had to modify, pre-process, to make it fit. For example, the output of potrace - the vectorization algorithm - it's not like you get bonds directly like a single vector. You have an assembly of vectors that will constitute a single bond but you still have to recognize that vectors 1, 2, 3, 4 it's actually a single bond and that vectors 5, 6, 7, 8 it's actually a part of a double bond somewhere else. Yeah, you have to do some pre- and post-processing from the output of this program.
IF: Speaking of OCR, none of the existing OCR engines, especially open source OCR engines, are very good at recognizing single characters. Usually the focus is on recognizing the whole text. There you can do a lot of things with dictionary-based corrections where if you have a word partially recognized, the engine can correct itself, just having a huge vocabulary. You cannot do this, or not very easily do this, with single character atomic labels. I'm using two OCR engines because I feel that combined strengths leads to better recognition rate. As a matter of fact, optionally there are two more OCR engines you can compile in, so it can be up to four different OCR engines there.
AD: Nice.
AD: How does this handle superscript and subscript when you're doing isotope labeling or side groups?
IF: Poorly. It does try to recognize subscripts, especially on Markush labels R1, R2 and so on. If the scan quality is good and the characters are sufficiently large then it can recognize it, but the smaller the character the less chance it will get recognized correctly.
AD: It sounds like there are several different validation sets. There's various patent office data sets and things like that. How easy has it been to get the validation data that you need?
IF: Not easy at all. Having the validation set is essential. Otherwise you cannot benchmark your performance, you cannot move forward if you don't know how well your version X is doing than X-1, for example. Initially I had a set of very diverse structures from the web, from articles that I had to draw the SD file by hand myself. Recently, with the help of John Kinney of DuPont and Steve Boyer from IBM I acquired this huge data set from USPTO of 6,000 molecules. That was absolutely essential to get OSRA to produce better and better results. Originally it was a Complex Work Unit initiative at the USPTO. They have people who redraw the structures in a molecular editor and save molfiles, so you have both image and corresponding molfiles.
AD: When someone at DuPont, AstraZeneca - the people in the consortium - are working on their work, how much of what they do feeds back into what you do? You get test data from them. What about code changes and improvements in the algorithms?
IF: There were some suggestions and recommendations, and maybe not directly code input but John recommended some improvements in the algorithm to recognize tables in the text. If you have a table it will throw off the recognition algorithm because it's also a linear graphic and looks kind of similar to a molecule. John made some suggestions. He coded something for himself. While I didn't take his code directly, I was absolutely using his recommendations and improved my recognition engine.
IF: Right now there is another guy who's working on the code itself, Dmitry Katsubo, from the European Patent Office. He made tremendous input, especially since he's more on the formal programming side. He did a completely new compilation system where you don't have to muck around with makefiles by hand. It's regular autotools generated "configure; make; make install".
AD: And that includes checking if all the tools exist and optionally compiling them in?
IF: Yep. It's much easier now to produce [a program] out of source code.
AD: I saw several people here using their iPods to take a picture of the screen during the presentation. If somebody wanted to write a tool that was to sit inside the iPod, take a picture, process it ... has someone done that?
IF: Yes.
AD: How hard is it to do that?
IF: It wasn't really my project. There's a company, I believe the name is Eidogen-Sertanty, they have a tool that works exactly like that. You take a picture with the iPhone and get a structure back which you can edit in their own editor on the iPhone or iPad. I believe OSRA is running on a remote server so the processing is not done on the phone itself. They load the image, process it, and get the SD file from the server.
AD: Because then you would have to distribute all those different libraries; download them and add them on the iPod.
IF: Well it's possible. I think the main problem is not to compile the code on iPhone or iPod. The problem is the performance is probably not quite there yet. It might take some seconds to process an image. My feeling is that it would be much more efficient to have the processing done on a big server than on a small iPhone. So far. Maybe next year they will have better processors on the iPhone.
AD: I saw on your CHANGELOG that you spent some time now optimizing the code to make it faster.
IF: Yes. And also Dmitry was very helpful in that. I think we've done quote a good job compared to a version of a couple of years ago. We improved the performance by a factor of three or four. Some of the main changes were code refactoring, making the code more lean and efficient. Also, I changed from ImageMagick to GraphicsMagick which is compatible but much faster. Most of the improvement came from this small change.
AD: Where's most of the time being spent?
IF: There are two factors where OSRA is taking it's time. First of all, page segmentation -
AD: Sorry, what is page segmentation?
IF: Page segmentation. If you have a document where you have text and molecular structures all mixed together on the same page, you have somehow to extract the structure out of the rest of the page because, for OSRA project at least, you don't care about the text. We want to process only the structure. This process is called page segmentation. It's fairly time consuming. On the other hand I believe the OSRA algorithm is quite efficient, in that it can very often guess correctly that this is the text and we are not interested in processing that, and so we are not spending time on blocks of text or some photographs. Often there is some pictures of mice, for example, in documents.
AD: Do you ever get a mouse identified as a compound?
IF: Sometimes. It happened quite frequently actually in the past. Now it's getting better I believe.
AD: I was also seeing that how the vendor sites mentioned they work with OSRA. You support plugins for Symyx Draw, and Chem BioDraw, and BKChem. How many people are actually using your code in the world?
IF: I can only track the direct downloads. I guess we had one or two thousand downloads. We have two different distribution sites: SourceForge and our NCI/CADD web site so you have to combine them. If somebody wants to use it and they don't tell me, I won't necessarily know that they want to use it.
AD: As the up-side of open source, you're using so many open source projects. The downside is you don't get quite the same feedback of people using it. I like going to conferences like [ICCS]. People come up to me and say "oh, thank you for this project" or "I like using that tool."
IF: Yes, exactly. It's a nice feeling to hear that it's useful. People sometimes email me with questions or say "hey, it's a good tool," they're using it. From places quite unexpected such as the International Union of Crystallography, there's a university in Australia where they're using OSRA. It's a good feeling to know that something you are working on is being useful for others.
AD: What would be the best way to support the project. Would it be developers, or test data, or people with image processing experience?
IF: All of the above. The test data is absolutely essential and it's very hard to produce. It's hard to validate hundreds - I'm not even talking about thousands - of structures by hand, and it's necessary to have it and from as many diverse sources as possible. We have good test sets from USPTO and Japanese patent office. It would be nice to have some similar tests from WIPO, from EPO, from Chinese patents.
AD: Have you downloaded the images that are in Wikipedia? They have the structure and a link to the PubChem id, and many of those structures in Wikipedia are actually drawn by hand.
IF: I have not done this.
AD: I just learned this a day or two ago about people doing stuff that way. Finding good images or finding correlations between, say, CAS id and SMILES by going through Wikipedia to look up PubChem to get the actual data they want.
IF: I have not done this. This is an interesting idea, yes.
AD: You're working at NCI/CADD. What do they do and how do they support your work?
IF: Nowadays it's officially part of my responsibility. Before it was more a hobby project. It's not that I'm spending full time working on OSRA. I have other responsibilities there as well, and other projects. I'm working under the direction of Marc Nicklaus. He was very supportive. He was very appreciative of the project.
AD: How is this project funded?
IF: It's funded along with the rest of the CADD group by NCI.
AD: You were telling me you came into this field as a physicist. How did you get involved in doing image recognition of structures?
IF: I was doing my PhD at The Ohio State University. There I got acquainted with Jan Labanowski, the maintainer of the Computational Chemistry List (CCL). I was working on CCL helping him administer and maintain it for a while. That's how I got connected with the world of cheminformatics. After graduation I joined Martin Nicklaus's group, and that's how I got myself into all of this area.
AD: Were you doing software development since you were young?
IF: Pretty much. In physics, for my PhD thesis, it was basically C++ and Mathematica because it was building theoretical models and doing calculations. I got involved with programming since I was 12-13. First it was BASIC and Pascal in school, then it was C and C++ and Perl. Now I'm interested in Python. It seems like a very interesting approach. Then a little bit of a lot of things.
AD: Thank you very much for your time. It was interesting hearing more about OSRA.
IF: Thank you.
*music*
AD: Thank you for listening to Molecular Coding. This podcast and transcript are distributed under the
Creative Commons Attribution-Share-Alike 3.0 Unported license. The theme music was composed and performed by Andreas Steffen. I'm Andrew Dalke.
Saturday, November 10, 2012
Igor Filippov and OSRA