Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2012/05/18/nonplanar_compounds

Topologically non-planar compounds

John MacCuish, from Mesa Analytics, pointed out that the MCS problem takes polynomial time if the graphs are planar. He writes: "Are there graphs in your likely sets, that are not planar? I have never seen a non-planar small molecule drug for example, but they may be out there. Tests for planarity are also in P. Of course it doesn't mean that solutions in P will be faster than the usual methods since N is small an the overhead for planar MCS may be large, such that N may need to be large for the planar method to beat the non-planar heuristics."

This leads to a couple of questions. One is, what is the shape of the run-time of my MCS algorithm? I don't actually know. Some tests take a very long time, but that's not enough to establish that it's polynomial for real-world compounds or exponential. I'm not going to research this now.

Another is, are there real-world small molecules which are non-planar, in the topological sense, and not the chemistry sense where all of the non-hydrogen atoms line on or near a plane?

Previous work in topologically non-planar compounds

A quick literature seach finds Synthesis of the first topologically non-planar molecule, which says:

We report here the synthesis and characterization of the tris-ether 2,5,14-trioxahexacyclo-[5.5.2.1.24,10.O4,17.O10,17]-heptadecane 3; this topologically unique (graph theory) molecule is prepared via a novel intramolecular rearrangement of either of two isomeric propellane spiro-epoxides, 1 and 2.

There's also Topological stereochemistry. 9. Synthesis and cutting "in-half" of a molecular Möbius strip by Walba, Homan, Richards, and Haltiwanger in New. J. Chem., 1993, 17, 661-681.

Quoting from the above link to Modern Physical Organic Chemistry by Ansyln and Dougherty:

We mention briefly here another topological issue that has fascinated chemists. For the overwhelming majority of organic molecules, we can draw a two-dimensional representation with no bonds crossing each other. ... It may seem surprising, but most molecules have planar graphs.

Recent efforts have produced chemical structures that successfully realize many interesting and novel topologies. A landmark was certainly the synthesis of a trefoil knot using Sauvage's Cu+/phenanthroline templating strategy.... Vögtle and co-workers have described an "all organic" approach to amide-containing trefoil knots, and have been able to separate the two enantiomeric knots using chiral chromatography. Another seminal advance in the field was the synthesis and characterization of a 'Möbius strip' molecule..."

So it's well-established that there are topologically non-planar structures. But are they in compound databases which I can access?

Searching PubChem for topologially non-planar compounds

Take a look at Scaffold Topologies II: Analysis of Chemical Databases by Wester, Pollock, Coutsias, Allu, Muresan and Oprea in PMC 2010 January 15., Published in final edited form as: J Chem Inf Model 2008 July; 48(7): 1311-1324.

Only 12 nonplanar and 2,099 spiro node topologies (all of which are planar) are present in the merged database. 9 of the nonplanar topologies are found only in PubChem and the total number of molecules represented by such topologies in the merged database is a mere 44, agreeing with Walba's assessment concerning the rarity of chemicals with nonplanar graphs.

This establishes that in 2008 there were no more than 44 topologically non-planar structures in PubChem. Can I find them?

I don't think any of the database search engines support this capability, so I need to write a program.

For that I need a method for planarity testing. Various sources say that the linear time algorithm is "widely regarded as being quite complex", but that one of the linear time planarity algorithms is part of Sage

Sage is a comprehensive mathematical software system, which uses Python. There's a pre-built binary distribution for my OS, which I downloaded. It includes Python, the IPython shell, and everything else it needs, so it really is a system and not a set of Python modules.

With Sage it's easy to make a graph and test for planarity.

% sage
----------------------------------------------------------------------
| Sage Version 5.0, Release Date: 2012-05-14                         |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: d = {1: [2,3], 2: [3], 3: [4, 6], 4: [5]}
sage: g = Graph(d)
sage: g.is_planar()
True
sage: 
sage: k3_3 = {1: [2,3,4,5,6], 2: [3,4,5,6], 3: [4, 5, 6]}
sage: Graph(k3_3).is_planar()
False
sage: 

The dictionary I use as input to "Graph" contains an upper-triangle connection matrix. So to test if a molecule is topologically planar, I just need to convert its connectivity information into a dictionary of the right form, turn the dictionary into a Graph, and test the graph for is_planar().

Roadblock! My Python 2.6 modules and Sage's 2.7 Python don't mix

The prebuilt binaries include its own Python distribution, and when you use "sage", it replaces the PYTHONPATH with its own settings. This means when I run sage I don't have access to the cheminformatics tools I've already set up for my system:

% sage
----------------------------------------------------------------------
| Sage Version 5.0, Release Date: 2012-05-14                         |
| Type notebook() for the GUI, and license() for information.        |
----------------------------------------------------------------------
sage: import rdkit
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)

/Users/dalke/<ipython console> in <module>()

ImportError: No module named rdkit
sage: from openeye.oechem import *
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)

/Users/dalke/<ipython console> in <module>()

ImportError: No module named openeye.oechem
sage: os.environ["PYTHONPATH"]
'/Users/dalke/ftps/sage/local/lib/python'
I tried to force it to include the right path, and that failed.
% printenv PYTHONPATH
/Users/dalke/ftps/openeye//wrappers/v2011.Oct.1/python/:/Users/dalke/envs/RDKit_2011_12-svn
% sage
   ...
sage: import sys
sage: sys.path.insert(0, "/Users/dalke/envs/RDKit_2011_12-svn")
sage: import os
sage: os.environ["DYLD_LIBRARY_PATH"]
'/Users/dalke/ftps/sage/local/lib:/Users/dalke/ftps/sage/local/lib/R/lib::/Users/dalke/ftps/openeye//wrappers/libs:/Users/dalke/envs/RDKit_2011_12-svn/lib:/Users/dalke/ftps/sage/local/lib/R/lib'
sage: from rdkit import Chem
Fatal Python error: Interpreter not initialized (version mismatch?)

------------------------------------------------------------------------
Unhandled SIGABRT: An abort() occurred in Sage.
This probably occurred because a *compiled* component of Sage has a bug
in it and is not properly wrapped with sig_on(), sig_off(). You might
want to run Sage under gdb with 'sage -gdb' to debug this.
Sage will now terminate.
------------------------------------------------------------------------
/Users/dalke/ftps/sage/spkg/bin/sage: line 312: 56741 Abort trap              sage-ipython "$@" -i
What happened here is that Sage ships with Python 2.7, while the locally installed cheminformatics toolkits use Python 2.6.

I decided to use another technique. I would have sage call out to another program to handle the cheminformatics. For each structure it will output a line containing the identifier, the SMILES, and the needed upper-triangle dictionary data structure. One line of output will look like:

('15', 'OC1C2(C(C3C(C4(C(CC3)CC(=O)CC4)C)CC2)CC1)C', {0: [1], 1: [2, 19],
 2: [3, 20, 17], 3: [4, 18], 4: [5, 9], 5: [6, 16], 6: [7, 15, 14],
 7: [8, 10], 8: [9], 10: [11], 11: [12, 13], 13: [14], 16: [17], 18: [19]})
All the sage code needs to do is read those lines, extract the data, pass the graph into Sage for analysis, and print out those which are non-planar.

Even this wasn't as easy as I thought. The PYTHONPATH environment variable persists in spawned processes, which the python subprocess I started uses the wrong PYTHONPATH. I ended up setting the environment variables myself - including PYTHONHOME - before it would work correctly:

import sys
import os
env = os.environ.copy()
env["DYLD_LIBRARY_PATH"] = "/Users/dalke/ftps/openeye//wrappers/libs:/Users/dalke/envs/RDKit_2011_12-svn/lib"
env["PYTHONPATH"] = "/Users/dalke/envs/RDKit_2011_12-svn"
env["RDBASE"] = "/Users/dalke/envs/RDKit_2011_12-svn"
env["PYTHONHOME"] = "/System/Library/Frameworks/Python.framework/Versions/2.6"

import subprocess

# Import the "Graph" constructor
from sage.all import *

# Use the version of Python for which RDKit was built to use
p = subprocess.Popen(["/usr/bin/python2.6", "pubchem_connectivity.py"],
                     env = env,
                     stdout = subprocess.PIPE)
                 
for i, line in enumerate(p.stdout):
    id, smiles, d = eval(line)
    G = Graph(d)
    if not G.is_planar():
        print "======= Found one!!!"
        print id, smiles
    if i % 100 == 0:
        sys.stderr.write("%d ...\r" % (i,))
        sys.stderr.flush()

Now I just need the "pubchem_connectivity.py" program to generate the connectivity information.

Call another program to generate the connectivity information

I've been using RDKit these days, because the funding I got for the MCS project said I needed to use RDKit. I usually use OEChem, which (among other things) has much faster structure parsers than RDKit. Since I'm going to read tens of millions of structures, I wanted a way to reduce the number of structures to process.

Now, if a graph has no cycles then it's always planar. Even if it has one, two, or three cycles, it's still impossible for it to be non-planar. The number of cycles in a single component graph is E-V+1, which is the number of edges (bonds), minus the number of vertices (atoms), plus 1. So a simple test is to exclude molecules where the number of bonds is less than three greater than the number of atoms.

Mind you, this is an exclusion test which rejects graphs that cannot be planar. There are plenty of molecules with dozens or rings which are topologically planar. Also, PubChem has plenty of records with multiple components, which means I may miss a few cases. But mind you, my goal is to find some non-planar structures in PubChem, not find all of them.

I had previously converted my local copy of PubChem to a set of compressed SMILES files. A few months ago I wrote a set of Ragel definitions for SMILES, which includes a demonstration of how to use count the number of atoms and bonds in a SMILES file. I can use that unmodified as a co-process to do high-speed counting for me.

import glob
import subprocess
import gzip
import sys
from collections import defaultdict
from rdkit import Chem

# Start a co-process to count the number of atoms and bonds in a SMILES string
p = subprocess.Popen(["/Users/dalke/opensmiles-ragel/smiles_counts"],
                     stdin = subprocess.PIPE,
                     stdout = subprocess.PIPE)

filenames = glob.glob("/Users/dalke/databases/pubchem/*.smi.gz")
for i, filename in enumerate(filenames):
    msg = "Processing %d/%d\n" % (i+1, len(filenames))
    sys.stderr.write(msg)
    sys.stderr.flush()

    for line in gzip.open(filename):
        # Read a line from the data file
        smiles, id = line.split()

        # Send it to the co-process
        p.stdin.write(smiles + "\n")
        p.stdin.flush()

        # Get the counts (looks like "atoms: 34 bonds: 38")
        line = p.stdout.readline()
        _, atoms, _, bonds = line.split()
        num_atoms = int(atoms)
        num_bonds = int(bonds)

        # Skip records which cannot be non-planar
        # (Note: assumes single component structures!)
        if num_bonds < num_atoms + 3:
            continue

        # Extract the topology into upper-triangle dictionary form
        mol = Chem.MolFromSmiles(smiles)
        if mol is None:
            continue
        d = defaultdict(list)
        for bond in mol.GetBonds():
            b1 = bond.GetBeginAtomIdx()
            b2 = bond.GetEndAtomIdx()
            if b1 < b2:
                d[b1].append(b2)
            else:
                d[b2].append(b1)

        # print to stdout so the Sage program can read it
        data = (id, smiles, dict(d))
        print repr(data)
BTW, if you haven't been paying attention, I have one process which tests if a SMILES string has enough bonds in it, another to convert structure information into a simple topology, and a third program to do the graph planarity test. "Bailing-wire and chewing gum!" to repeat an old phrase.

The structures!

It took many hours for my computer to chug along (while I slept). I ended up with 224 of the non-planar SMILES in the subset of 28.5 million PubChem structures I have on my machine. (In other words, bear in mind that this is not a complete search!)

Here are a few structures to show you what they look like:

That first structure, silicon nitride, is a ceramic, and which cannot be expressed in SMILES. (That is "[C]" is an equally poor SMILES representation of graphite as it is for diamond.) The others look like progressively more realistic chemical representations.

The non-planar structures were a bit too verbose to show as a SMILES file. Instead, here's the full list of identifiers, which you can easily use to get the structures yourself (or follow the hyperlink to look at the non-planar depictions).

390566, 498002, 636755, 3084099, 4868274, 5104674, 6712449, 10019039, 10882690, 10895017, 10898366, 10994362, 11027681, 11126005, 11131973, 11187418, 11340053, 11350583, 11360285, 11384163, 11407796, 11672382, 14381365, 14381430, 14381432, 14381435, 16132679, 16132681, 16132995, 16132999, 16133150, 16133259, 16133262, 16133397, 16133412, 16133413, ee16133414, 16133878, 16145580, 16146229, 16146230, 16148442, 16148609, 16148632, 16148888, 16148900, 16149007, 16149114, 16149222, 16149238, 16149361, 16149362, 16149482, 16149579, 16149602, 16149827, 16149958, 16150054, 16150193, 16150399, 16150419, 16150654, 16150658, 16150833, 16150994, 16151169, 16151337, 16151360, 16151567, 16151729, 16151960, 16152362, 16152566, 16152608, 16152699, 16152729, 16153098, 16153207, 16153649, 16154275, 16154327, 16154453, 16154584, 16154971, 16155013, 16155014, 16155015, 16155069, 16155075, 16155076, 16155130, 16155140, 16155152, 16155193, 16155194, 16155197, 16155399, 16155442, 16155443, 16155456, 16155607, 16155626, 16155630, 16155631, 16155632, 16155633, 16155652, 16155784, 16155884, 16155916, 16155917, 16155920, 16155972, 16156042, 16156080, 16156082, 16156145, 16156198, 16156260, 16156261, 16156264, 16156265, 16156312, 16156411, 16156413, 16156417, 16156432, 16156495, 16156539, 16156544, 16156545, 16156609, 16156610, 16156613, 16157557, 16214951, 17749011, 21597602, 21597607, 21597610, 21597611, 21770498, 22294696, 22835058, 22835161, 22835262, 22835624, 22835636, 22835637, 23327291, 23584643, 23726086, 23727886, 23955822, 24764125, 24770227, 24770228, 24770229, 24770290, 24770291, 24871221, 24940071, 25200029, 44239114, 44303783, 44303799, 44303821, 44382489, 44397933, 44397934, 44566282, 44566284, 44575206, 44575207, 44575208, 44592641, 44592642, 44592645, 44592646, 44592647, 44592648, 44592945, 44593582, 44606373, 44606374, 46882313, 46882314, 46882315, 46882316, 46882317, 46882318, 46882319, 46882320, 46882321, 46882322, 46882323, 46882324, 46882325, 46882326, 46882327, 46882328, 46891923, 46895835, 49799159, 49799160, 49873810, 50897242, 50900298, 50900299, 50900300, 50919058, 51004304, 51026319, 52945815, 52952313, 52952314, 52952315, 52952316, 52952317, 53468167, 53468168, 53468169, 53468170, 53468171

Still, having some SMILES on-hand is nice so here are some of the shorter SMILES strings. At the least, you can use these as test-cases for an MCS search engine, and perhaps force a linear-time planar-graph-only method to fail :)

O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 390566
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cccc6 498002
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 636755
[Si]123N4[Si]56N1[Si]4(N25)N36 3084099
C123C45C16C24C6C7C(C35)CC(C(C7)C)C 4868274
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 5104674
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 6712449
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 10019039
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 10882690
O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)(C(C2)O)CO 10895017
O1c2c3cccc2Cc4c5c(ccc4)Cc6c7c(ccc6)Cc8c(c(ccc8)C3)OCCOCCN(CCOCCO5)CCOCCOc9c(cccc9)OCCOCCN(CCOCC1)CCOCCO7 10898366
O1C2C34C5C(C(C2)OC(=O)C)(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)COC 10994362
C123C45c6c(cccc6)C17c8c(cccc8)C2(c9c(cccc9)C3(c1c4cccc1)c1c7cccc1)c1c5cccc1 11027681
OC1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 11131973
O1c2c3c(ccc2)OB4Oc5c6c(ccc5)OB1Oc7c(c(ccc7)O4)N63 11187418
C123C45C6C7C8C19C1C(C4C4C%10C22C%11C(C5C5C%12C33C(C6C5)C8CC5C9C(C2C(C35)CC%12%11)CC1%10)C4)C7 11340053
O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)O)O)C4C1)CC)(C(C2)O)COC 11350583
C123C4C5C1C6C5(C4C26)C=CC#CC=CC78C9C1C7C2C1(C9C82)C=CC#CC=C3 11360285
C123C4C5C1C6C5(C4C26)C=CC#CC#CC=CC78C9C1C7C2C1(C9C82)C=CC#CC#CC=C3 11384163
O1C2C34C5C(CN(C3C(C5OC)c6c(cc(c(c6)OC)O)C4C1)CC)(C(C2)O)COC 11407796
O(c1cc2c(cc1OC)C34C56C27c8c(cc(c(c8)OC)OC)C5(c9c3cc(c(c9)OC)OC)c1c(cc(c(c1)OC)OC)C6(c1c7cc(c(c1)OC)OC)c1c4cc(c(c1)OC)OC)C 11672382
O(C12NC(=O)C3C4C56C1C7C8C59C15C6(C3C3C1C(C9C43)C#N)C2C7C5C8=O)C(=O)C 14381365
O1C2C3C45C67C2C8C3C9C42C6(C8C9=O)C3C4C7C(C5C4C2C3C#N)C1=O 14381430
O1C2C3C45C67C2C8C3C9C42C63C8C9OC(=O)C4C2C2C5C(C7C2C34)C1=O 14381432
BrC12C34C56C7(C8C9C5C5C3C9C1C8C(=O)OC1C2C2C4C(C6C2C71)OC5=O)Br 14381435
O1c2c3c4c5c6c7c3c(cc8c7c(cc6Oc9cc(ccc9)OCCOCCOc3cc1ccc3)C(=O)N(C8=O)C(CCCCCC)C)Oc1cc(ccc1)OCCOCCOc1cc(ccc1)Oc5cc1c4c(c2)C(=O)N(C1=O)C(CCCCCC)C 16214951
BrC12C34C56C7(C8C9C5C5C3C9C1C8C(=O)OC1(C2C2C4C(C6C2C71)OC5=O)Br)Br 21597602
BrC12C34C56C7C8C9C3C7C(=O)OC3C4C4C1C(=O)C(C5(C8C(C29)C(=O)OCC)Br)C4C63 21597607
BrC12C34C56C7C8C9C5C5C3C9C1C8C(=O)OC1C2C2C4C(C6C2C71)OC5=O 21597610
O1C2C3C45C67C2C8C3C9C42C6(C8C9O)C3C4C7C(C5C4C2C3C(=O)O)C1 21597611
O1C23OC(OC4C25C6C7(C1=O)C(C3OC(OC5C(=C)C4CC6)(C)C)C(CCC7)(C)C)(C)C 21770498
ClC1=CC2OC3C1C4OC5C2C4C3C5 22294696
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 23327291
O1c2cc3c4cc2OCCOCCOCCOc5c(cc6c(c5)c(c7c(c6C)cc8c(c7)OCCOCCOCCOc9cc2c(cc9OCCOCCOCCO8)C4(c4c(cccc4)C32C)C)C)OCCOCCOCC1 23584643
O1C2(OCC34C5C2(CCC3OC(=O)C67C4C(OC51)CC(C6)C(=C)C7=O)C)C 23727886
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 23955822
O1C2C3C45C6C1OC(C6(CCC4OC(=O)C37CC(C2)C(=C)C7=O)C)OC5 24764125
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)C6=C2CC(=C(C6)OC)OC 24871221
O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CCN(C6Cc7c3c1c(cc7)O)CC4 44303783
O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CN(C6Cc7c3c1c(cc7)O)CC4 44303799
O1C2C34C56C(C(C2(C=C5)OC)C(O)(CCCC)C)CCN(C6Cc7c3c1c(cc7)O)CC4 44303821
S(=O)(=O)(OCC12C3C4C(C1)CC(C3)CC4C2)N 44382489
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(c(c6)OC)OC 44592945
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 49873810
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cccc6 50897242
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 50919058
O1C23C4C5(C16OCC7C4(CC(C2(OC(=O)C3(CC5C8C6CC=C9C8(C(=O)C=CC9)C)O)C)OC7=O)C)O 51004304
O=C1N2C34N5CCC(C3CCC4(C1)C=CC5)c6c2cc(cc6)OC 51026319

Feel free to leave a comment.


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2020 Andrew Dalke Scientific AB