Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2012/05/13/fmcs

fmcs - find the MCS of a set of compounds

  1. MCS background
  2. fmcs - find the MCS of a set of compounds
  3. Finding the MCSes for the ChEBI ontology

My new MCS program is named "fmcs." It uses RDKit for the cheminformatics and is is available under the BSD 2-clause license. Many thanks to Roche for funding the project and making it available under a free license. Their hope and mine is that fmcs will be included someday as part of RDKit. For now, you can download it from the Bitbucket site. My hope is that people will fund me to continue developing the program.

Show me an example

I'll start with a set of 3669 compounds which I know have a benzotriazole core. I know this because that data set come from doing a substructure search. If you want to try this yourself, the SD file is part of the fmcs distribution.

% fmcs sample_files/benzotriazole.sdf --verbose
Loaded 3669 structures from sample_files/benzotriazole.sdf     
[#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:1:2 9 atoms 10 bonds (complete search)
Total time 5.89 seconds: load 2.76 fragment 2.66 select 0.42 enumerate 0.06 (MCS found after 3.13)
The core result is the SMARTS pattern on the second line, which represents the common substructure. The first and third line show text sent to stderr when the "--verbose" flag is used. In this case, it took 2.76 seconds to load the 3669 structures, 2.66 seconds to remove atom-bond-atom patterns which aren't in all of the structures, 0.42 seconds to select the reference structure, and only 0.06 seconds to do the actual MCS enumeration algorithm. (It's so fast for this case because the preprocessing stage was enough to identify the MCS.)

Most people can't look at a SMARTS like "[#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:1:2" and easily figure out what it means, so I'll use Karen Schomburg's excellent SMARTSviewer to render it graphically:

as well as RDKit's image renderer to show the match to the first 10 structures of that data file.

Alternate atom and bond comparisons

ignore atom and bond types

By default, the subgraph matches atoms by element and bonds by bond type. There are other options. The "--compare topology" option lets any atom match any other atom, and any bond match any other bond.

% fmcs sample_files/ar_clustered_3D_MM_3.sdf
[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6](-[#6]-[#6]-[#6]-[#6])-[#6] 14 atoms 13 bonds (complete search)
% fmcs sample_files/ar_clustered_3D_MM_3.sdf --compare topology
[*]~1~[*]~[*]~[*]~[*]~2~[*]~[*]~[*]~3~[*](~[*]~1~2)~[*]~[*]~[*]~1~[*]~[*]~[*]~[*]~1~3 17 atoms 20 bonds (complete search)
[c-2ec28d18-74736162:~/cvses/fmcs] dalke% 
Here is a graphical comparison of those two patterns:



That's very interesting. It looks like there's a common substructure, based on the topology, but not when using on the default match parameters. Why is this?

It's a matter of bond perception. Some of the rings found by the topology search are aromatic in some structures but not in others, and some of the bonds are double in one and single in others. I know I gave PubChem a SMARTS pattern to find those structures. I wonder how it ended up like it did.

Match atom types but ignore bond types

The "--compare topology" flag is a shortcut for "--atom-compare any --bond-compare any". You can chose different options for comparing atom types and bond type. For example, if you want to compare atoms by element and ignore bond types then you can do "--atom-compare elements --bond-compare any".

Coming up with a good example of when to use it is left to the student as an exercise. (Let me know the answer when you figure it out, so I can include it here.)

Bond modifiers

Chemists like rings. That's because rings are important. A problem with MCS is that sometimes it matches a long chain of atoms with a strange routing through a ring system. Chemists don't always like that.

If you're one of those people, use the "--ring-matches-ring-only" flag. This make ring atoms match only ring atoms and chain atoms only match chain atoms.

Even that might be too generous. Sometimes chemists really want to see a complete ring in the MCS, and not just a partial one. That is, if the MCS includes a bond which maps to a ring bond in the original structures, then the MCS bond must also be in a ring.

If that's what you want, use the "--complete-rings-only" option.

When I figure out good uses cases for these, I'll update the primary documentation. (I'm told that they are important, but I don't have a good example for you to understand why.)

All that, and more!

I'm still working on the documentation for the newest features, added during the last couple of weeks. You can hijack the "isotopes" atom matcher to define your own atom classes, and specify those classes through a tag in the SD file. Instead of showing the SMARTS as the result, you can see the fragment as SMILES, or you can save the results to an SD file.

Comments

Do you have any comments or questions?


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB