MCS background

[ previous | newer ] /home/writings/diary/archive/2012/05/12/mcs_background

MCS background

MCS background
fmcs - find the MCS of a set of compounds
Finding the MCSes for the ChEBI ontology

For the last couple of months I've been working on a new MCS implementation, named fmcs. MCS is short for "Maxium Common Substructure", and is the cheminformatics term for the maximum common subgraph isomorphism problem. I'll be posting a set of essays about it, so I'll start with a background piece.

I thought about going into the details of what that means, but the detailed description was too boring. Here's the short version: given two structures, find the largest substructure which is common to both. These can be connected or disconnected substructures; I'm only concerned with connected substructures. Even then there are still variations; does "maximum" mean to maximize the number of atoms in the subgraph, the number of bonds, the number of cycles, or perhaps maximize some more physical property?

There are also variations in how you define atom and bond equivalency. Two atoms might match if and only if they are of the same element type, or you can be more lenient and say that any halogen matches any other halogen. You can say that bonds only match bonds of the same bond type, or that aromatic bonds are allowed to match single or double bonds, or that ring bonds are only allowed to match other ring bonds.

You can also make requirements on overall structure properties, like if a ring bond is part of the MCS then it must also be a ring in the MCS.

That said, the most common MCS is to say that atoms are the same if and only if the element numbers are the same, and bonds are the same if they have the same bond type (single, double, triple, and aromatic types never match each other.) Here's an example of that MCS between acesulfame and saccharin:

(Image thanks to Daylight's 'depictmatch.cgi.')

Yo dawg, I heard you like NP-hard problems ...

The MCS problem is NP-hard. As the number of atoms increases, the run-time should tend to increase exponentially. An advantage to chemical graphs is that atoms have a limited valence; few atoms have more than 4 bonds. This makes the MCS problem a more tractable than the general case. An advantage to working in small-molecule chemistry is that N is usually in the 10s and rarely ever over 100.

So let's make the MCS problem harder (although not in the theoretical sense; it's still NP hard). Instead of finding the MCS between two compounds, find the MCS of two or more compounds. This is sometimes called the "Mulitple MCS problem."

This need often comes up during clustering, where you have have an algorithm which grouped structures together based on various properties, including experimental results. A question might be, is there a structural explanation for this grouping? One way to answer it is to look at the MCS and see if it's a significant part of the structures.

The usual solutions

The multiple MCS problem is not new. People have developed algorithms to find it for several decades. The 2002 review paper by Raymond and Willett, "Maximum common subgraph isomorphism algorithms for the matching of chemical structures", JCAMD 16: 521b lists a variety of solutions, and says "the clique-based approach has been the most prevalent technique involving MCS-based chemical structure manipulation." They mostly discuss the pairwise MCS, but there is some mention of the multiple MCS as well.

A more recent implementation paper is "MultiMCS: a fast algorithm for the maximum common substructure problem on multiple molecules" by Hariharan, Janakiraman, Nilakantan, Singh, Varghese, Landrum, and Schuffenhauer, J Chem Inf Model. 2011 Apr 25;51(4):788-806. Epub 2011 Mar 29. They use the clique-based approach to find all maximal common substructures between pairs of structures, and use those to find the maximum common substructure of the set.

I myself have worked on the multiple MCS problem before, when I was a Bioreason employee in the late 1990s. There I used a back-tracking solution based on "Backtrack Search Algorithms and the Maximal Common Subgraph Problem" by McGregor, Software-Practice and Experience, vol. 12, 23-34 (1982), and also informed by "An algorithm for the multiple common subgraph problem", Bayada, Simpson, Johnson, and Laurenco, J Chem Inf Model. 1992 v32, 680-685. It too enumerated maximal pair-wise solutions to find the MCS for the entire set. We used it for a clustering algorithm based on substructure similarity.

"Maximal" common subgraphs?

You need to be a bit careful when doing a pairwise MCS. It's easy to think that the MCS for the entire set must be a substructure of the MCS for any two pairs, but it's almost as easy to come up with a counter-example. Consider: CCCSSNNN CCCPPPSS NNNPPPOSS. The MCS between the first two structures is "CCC", between the first and third is "NNN", and between the second and third is "PPP" while the MCS between all three is "SS".

Instead, find all the "maximal" common subgraphs between a pair. These are common subgraphs where it's impossible to add a bond to the subgraph and still be common in both graphs. Any MCS must be part of at least one of the maximal subgraphs between a pair of structures. Once you have a maximal common substruture, find the maximal common substructures between that and the next molecule. Keep applying the MCS to successive pairs until done.

One of the nice advantages of this approach is that you get a partial answer early. If you've gotten to the last pairing and ended up with a common substructure for the entire set containing 6 atoms and 6 bonds, then you can quickly reject any future maximal common substructures which are smaller than that. This is called "branch and bound." The algorithm explores different branches (in this case, different maximal common substructure pairings) and sets a bound which can be used to "prune" the branch, that is exclude further searching along a branch.

That was then, this is now

That was the general approach I did in the late 1990s. It took a lot of mental strain and concentration for about 6 weeks, but at the end, it did work. I even released the implementation as PyMCS, under a Python-like license. You can see remnants of the documentation thanks to Archive.org, but no one seems to have access to a copy of it any more.

While it worked, there were some things I didn't like about it. Molecules with lots of local symmetry cause problems, because there are a large number of ways to get maximal common substructure mappings out of it. There are, after all, 12 ways to map a benzene ring to itself. Also, my implementation used threads, where each thread generated the maximimal common substructures for a pair. These days I would use a generator.

Instead, I came up with an alternate solution, based out of ideas I've been thinking of regarding subgraph enumeration. Enumerate all of the subgraphs of a reference structure, convert each subgraph into a subgraph isomorphism query, and test if the subgraph exists in the other structures. The largest such match is the MCS.

There can be exponentially many subgraphs in a structure, but there are ways to make it faster. Again, use a branch and bound technique. If a subgraph doesn't exist in all of the structures then there's no way that any containing subgraph can exist in all structures. So if you use a growth-based method to enumerate the subgraphs, you can test each subgraph and reject the ones which don't work, as well as any further growth in that direction.

Another optimization, also available in RASCAL and MultiMCS, is to check the size of the remaining search space. If the current size, plus the maximum number of atoms or bonds remaining, isn't better than the current best match, then there's no need to keep on searching.

This ends up finding all of the common substructures, and not just the maximal ones. There's a few advantage to this approach. You only need to find one substructure match for each structure, not all of them. Also, subgraph isomorphism tests are well-optimized core parts of a cheminformatics toolkit, so that's work I don't have to do.

Everything old is new again

I thought that this was a new algorithm. I was mostly wrong. It turns out that Takahashi, Satoh, and Sasaki published this idea in "Recognition of largest common structural fragment among a variety of chemical structures" Anal. Sci. 1987, 3, 23b

I didn't read this paper until this evening. It's really neat to see the similarities between their ideas and mine, as well as the differences. They omit a pre-processing step which removes obvious bond mismatches, and they prefer checking all subgraphs of size N before checking subgraphs of size N+1. I'll need to mull it over, and dig up the enumeration method of Varkony, Shiloach, and Smith in "Computer-assisted examination of chemical compounds for structural similarities", J. Chem. Inf. Comput. Sci. 1979, 19 (2), 104bI can make a more comprehensive comparison. Grrr! And that's only available from behind the ACS paywall.

In that paper they report that the MCS between acesulfame and saccharin, in "level 1" (which corresponds to my "topology" option) took their FORTRAN 77 program 21.15 CPU seconds on a Data General ECLIPSE-MV/6000. My code on my desktop takes 0.010 seconds, or 2000 times faster. With 25 years between us, it's meaningless to make any comparison of these other than the normal exclamation that computers have gotten a lot faster. BTW, my PyMCS code from the late 1990s, using larger structures, took about 10 seconds on average.

Curiously, Raymond and Willett categorize this as an "ad hoc procedure." "This group of algorithms represents a diverse set of methods that have typically been designed specifically to fulfill an immediate need for a particular application without much regard to general or wide-scale usage requirements." Based on my reading of the Takahashi algorithm, it is applicable to general and wide-scale usage requirements. I wonder how they drew that conclusion.

Comments

Do you know what Raymond and Willett commentes as they did? Have you implemented the enumeration method of Varkony? Do you know the binary language of moisture ... ooh, sorry, got distracted. Feel free to leave a comment or ask questions about MCS algorithms and implementations.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me