Dealing with SSSR

[ previous | newer ] /home/writings/diary/archive/2011/06/04/dealing_with_sssr

Dealing with SSSR

This delves into the depths of how different toolkits use the SSSR (Smallest Set of Smallest Rings) algorithm. The end result is that I will be removing support for bits 257-262 of the ChemFP Substruct key because I cannot reliably implement "number of aromatic rings" and "number of hetero-aromatic rings" across multiple toolkits.

Travel note: I will be at the 9^th ICCS conference, leaving tomorrow. Hope to see some of my readers there!

ESSSR

Section 2 of the PubChem fingerprint definition starts:

Section 2: Rings in a canonic Extended Smallest Set of Smallest Rings (ESSSR) ring set - These bits test for the presence or count of the described chemical ring system. An ESSSR ring is any ring which does not share three consecutive atoms with any other ring in the chemical structure. For example, naphthalene has three ESSSR rings (two phenyl fragments and the 10-membered envelope), while biphenyl will yield a count of only two ESSSR rings.

Looking at this now, I realized something I missed the previous many times I looked - naphthalene has only two cycles, which means any SSSR algorithm should find two rings. (Otherwise it's not the smallest set.) How come the documentation says there are three ESSSR rings?

SSSR lovers and haters

The CDK implementation of the ESSSR dependent bits uses its SSSR algorithm. For example, bit 115 is set for ">=1 any ring of size 3." The corresponding CDK code is:

            ringSet = finder.findSSSR();
...
        public int countAnyRing(int size) {
            int c = 0;
            for (IAtomContainer ring : ringSet.atomContainers()) {
                if (ring.getAtomCount() == size)
                    c++;
            }
            return c;
        }
...
        b = 115;
        if (cr.countAnyRing(3) >= 1) fp[b >> 3] |= MASK[b % 8];

OpenEye somewhat famously published Smallest Set of Smallest Rings (SSSR) Considered Harmful wherein they point out "SSSR is not an invariant subset of all possible rings" and:

We believe that it is a great service to our customers that we do not include any SSSR functionality in OEChem.

As a consequence, they changed the meaning of the SMARTS "R" definition to indicate the number of ring bonds an atom has rather than the number of SSSR rings that it's in. "[R1]" under Daylight is roughly equivalent to "[R2]" under OpenEye.

Don't use SSSR when SMARTS is close enough

My goal for the ChemFP substructure keys is to develop a useful and especially cross-platform substructure filter definition closely based on the PubChem keys. They don't need to be identical to PubChem. OpenEye doesn't have SSSR support so I decided it would be easier to implement those ring definitions as SMARTS patterns. For bit 115 that's "*~1~*~*1". All of the fixed-sized rings can be done this way.

SMARTS isn't good enough; counting aromatic rings

SMARTS like this are very portable, but there are limits to the types of patterns that SMARTS can handle. Consider these bits:

255     >= 1 aromatic ring
256     >= 1 hetero-aromatic ring
257     >= 2 aromatic rings
258     >= 2 hetero-aromatic rings
259     >= 3 aromatic rings
260     >= 3 hetero-aromatic rings
261     >= 4 aromatic rings
262     >= 4 hetero-aromatic rings

The SMARTS for 255 and 226 are [aR] and [aR;#!6] but there are no SMARTS patterns for the other. It must be done through some other means.

Here's the CDK implementation for bit 257 (2 or more aromatic rings):

        public int countAromaticRing() {
            int c = 0;
            for (IAtomContainer ring : ringSet.atomContainers()) {
                if (isAromaticRing(ring)) c++;
            }
            return c;
        }
...
        b = 257;
        if (cr.countAromaticRing() >= 2) fp[b >> 3] |= MASK[b % 8];

See how it uses the already computed SSSR?

I went ahead an implemented something like it for OpenBabel, RDKit, and Indigo. Here's how to do it:

def openbabel_count_aromatic_rings(mol):
    count = 0
    for ring in mol.GetSSSR():
        # Note: the OB implementation is wrong. It assumes that if all
        # atoms in the ring are aromatic then the ring itself must be
        # aromatic. This is not necessarily true.
        if ring.IsAromatic():
            count += 1
    return count

def rdkit_count_aromatic_rings(mol):
    count = 0
    for ring in mol.GetRingInfo().BondRings():
        if all(mol.GetBondWithIdx(bondIdx).GetIsAromatic() for bondIdx in ring):
            count += 1
    return count

def indigo_count_aromatic_rings(mol):
    count = 0
    mol.aromatize() # XXX missing from chemfp!
    for ring in mol.iterateSSSR():
        # bond-order == 4 means "aromatic"; all rings bonds must be aromatic
        if all(bond.bondOrder() == 4 for bond in ring.iterateBonds()):
            count += 1
    return count

(And yes, I know I could have done "count += all(....)".)

Cross-toolkit comparison of the aromatic ring counts

How toolkit-dependent is this definition? I've got a bunch of PubChem data. Here's a driver script using the molecule readers in chemfp. (Do note that the API will change. I'm going to swap the order of molecule and title in the reader iterator.)

from itertools import izip
from chemfp import indigo, openbabel, rdkit

def cross_compare(filename):
    num_zero = num_structures = num_no_consensus = num_mismatches = 0
    toolkit_is_unique = dict.fromkeys( ("indigo", "rdkit", "openbabel"), 0)

    for ((ob_title, ob_mol), (rd_title, rd_mol), (in_title, in_mol)) in (
                              izip(openbabel.read_structures(filename),
                                   rdkit.read_structures(filename),
                                   indigo.read_structures(filename)) ):
        num_structures += 1
        assert ob_title == rd_title == in_title, (ob_title, rd_title, in_title)
        counts = [openbabel_count_aromatic_rings(ob_mol),
                  rdkit_count_aromatic_rings(rd_mol),
                  indigo_count_aromatic_rings(in_mol)]
        num_unique = len(set(counts))
        if num_unique == 1:
            if counts[0] == 0:
                num_zero += 1
            # All the same; not interesting
            #print "%s %s %s %s      EQUAL" % tuple([ob_title] + counts)
            continue
        elif num_unique == 2:
            if counts[0] == counts[1]:
                name = "indigo"
            elif counts[0] == counts[2]:
                name = "rdkit"
            else:
                name = "openbabel"
            toolkit_is_unique[name] += 1
            #print "%s %s %s %s %s" % tuple([ob_title] + counts + [name])
        else:
            print "%s %s %s %s      ALL" % tuple([ob_title] + counts)
            num_no_consensus += 1
        num_mismatches += 1

    print filename
    print "structures: %d  not-relevant: %d  no consensus %d" % (
        num_structures, num_zero, num_no_consensus)
    print "  openbabel: %d rdkit: %d indigo %d" % (toolkit_is_unique["openbabel"],
         toolkit_is_unique["rdkit"], toolkit_is_unique["indigo"])

for filename in (
    "/Users/dalke/databases/pubchem/Compound_001000001_001025000.sdf.gz"):
    "/Users/dalke/databases/pubchem/Compound_004000001_004025000.sdf.gz",
    "/Users/dalke/databases/pubchem/Compound_005000001_005025000.sdf.gz",
    "/Users/dalke/databases/pubchem/Compound_006000001_006025000.sdf.gz",
    "/Users/dalke/databases/pubchem/Compound_008000001_008025000.sdf.gz",
    "/Users/dalke/databases/pubchem/Compound_009000001_009025000.sdf.gz",
    cross_compare(filename)

Here's the results using a few different structure files.

/Users/dalke/databases/pubchem/Compound_001000001_001025000.sdf.gz
structures: 21054  not-relevant: 29  no consensus 1
  openbabel: 113 rdkit: 1 indigo 1008

/Users/dalke/databases/pubchem/Compound_004000001_004025000.sdf.gz
structures: 23563  not-relevant: 334  no consensus 17
  openbabel: 198 rdkit: 4 indigo 2284

/Users/dalke/databases/pubchem/Compound_005000001_005025000.sdf.gz
structures: 24025  not-relevant: 478  no consensus 16
  openbabel: 234 rdkit: 3 indigo 2305

/Users/dalke/databases/pubchem/Compound_006000001_006025000.sdf.gz
structures: 11595  not-relevant: 301  no consensus 43
  openbabel: 1347 rdkit: 3 indigo 1198

/Users/dalke/databases/pubchem/Compound_008000001_008025000.sdf.gz
structures: 24405  not-relevant: 340  no consensus 7
  openbabel: 159 rdkit: 1 indigo 2914

/Users/dalke/databases/pubchem/Compound_009000001_009025000.sdf.gz
structures: 24763  not-relevant: 162  no consensus 1
  openbabel: 17 rdkit: 4 indigo 1597

Indigo noticibly disagrees with the consensus. If we just consider OpenBabel and RDKit then there's an toolkit bias of about 1.5-2%.

How many aromatic rings are obvious even without SSSR?

However! How many of these are trivial? That is, most of these are likely independent rings, which of course won't have a problem. What we need to do is compare the number of toolkit differences to the number of times where there could be a difference.

While OpenEye doesn't have SSSR, they do have a "count number of aromatic ring systems" function, and we can use that to compare the number of aromatic rings found by RDKit to the number of aromatic ring systems found by OpenEye.

from itertools import izip

from chemfp import openeye, rdkit

from openeye.oechem import (OEDetermineAromaticRingSystems, OECreateCanSmiString,
                            OEThrow, oeosstream)
# Disable printing OpenEye's warnings
oes = oeosstream()
OEThrow.SetOutputStream(oes)

def rdkit_count_aromatic_rings(mol):
    count = 0
    for ring in mol.GetRingInfo().BondRings():
        if all(mol.GetBondWithIdx(bondIdx).GetIsAromatic() for bondIdx in ring):
            count += 1
    return count

def openeye_min_aromatic_rings(mol):
    num_aromatic_systems, parts = OEDetermineAromaticRingSystems(mol)
    return num_aromatic_systems

def ring_compare(filename):
    num_zero = num_lt = num_eq = num_gt = 0
    
    for ((oe_title, oe_mol), (rd_title, rd_mol)) in izip(
                  openeye.read_structures(filename),
                  rdkit.read_structures(filename)):
        assert oe_title == rd_title
        num_oe = openeye_min_aromatic_rings(oe_mol)
        num_rd = rdkit_count_aromatic_rings(rd_mol)

        if num_oe == num_rd == 0:
            num_zero += 1
        elif num_oe == num_rd:
            num_eq += 1
        elif num_oe < num_rd:
            num_gt += 1
        else:
            num_lt += 1
            print "LESS?!", oe_title, OECreateCanSmiString(oe_mol)

    print filename, "lt:", num_lt, "equal:", num_eq, "greater:", num_gt


for filename in (
    "/Users/dalke/databases/pubchem/Compound_001000001_001025000.sdf.gz",
    "/Users/dalke/databases/pubchem/Compound_004000001_004025000.sdf.gz",
    "/Users/dalke/databases/pubchem/Compound_005000001_005025000.sdf.gz",
    "/Users/dalke/databases/pubchem/Compound_006000001_006025000.sdf.gz",
    "/Users/dalke/databases/pubchem/Compound_008000001_008025000.sdf.gz",
    "/Users/dalke/databases/pubchem/Compound_009000001_009025000.sdf.gz"):
    ring_compare(filename)

In theory RDKit will always find at least the number of aromatic ring systems which OEChem finds, but they have different definitions of what "aromaticity" means, so there are a few cases otherwise.

Toolkit ring counts vs. complex ring systems

Here are the numbers I get:

Compound_001000001_001025000.sdf.gz lt:  1 equal: 16834 greater: 4190 zero:  29
Compound_004000001_004025000.sdf.gz lt:  9 equal: 17424 greater: 5793 zero: 337
Compound_005000001_005025000.sdf.gz lt: 20 equal: 18042 greater: 5485 zero: 478
Compound_006000001_006025000.sdf.gz lt:  2 equal:  9387 greater: 1880 zero: 326
Compound_008000001_008025000.sdf.gz lt: 15 equal: 18558 greater: 5486 zero: 346
Compound_009000001_009025000.sdf.gz lt:  2 equal: 20565 greater: 4034 zero: 162

In other words, only 4190+5793+5485+1880+5486+4034 = 26868 compounds can have different aromatic ring counts because 102488 of the compounds have easily identified rings.

Now go back to the previous numbers. There were 2,068 times where OpenBabel did not agree with the consensus. In other words, 8% of the times where OpenBabel could differ from RDKit, were different. For Indigo it's fully 50% of the time,

BTW, some of the "LESS?!" cases are:

LESS?! 1007243 c1cc(ccc1CSc2c-3ncnc3c(=S)[nH][nH]2)I
LESS?! 4001753 CCOC(=O)c1c(c-2c(n3c(cc2n1)C(=C(C3C)C)C)CC(C)C)C
LESS?! 4002932 c1ccccccccc[cH+]cccccccc1
LESS?! 4003686 COc1ccc(cc1)Nc2c(cc-3ccc(=O)cc3o2)C(=O)Nc4nc(cs4)c5ccc6c(c5)CCCC6
LESS?! 5004733 CCOc1ccc(cc1)n2c(c-3c(nnc3nc2SCC(=O)Nc4cccc(c4)OC)C)N
LESS?! 5015979 Cc1c-2c(=O)cc(nc2n([nH]1)C3CCS(=O)(=O)C3)CCl
LESS?! 6015037 c1cc(ccc1C=Cc2nc(=O)c-3c[nH][nH]c3n2)Cl
LESS?! 8009267 c1ccc(cc1)n2c-3nc(nc(=O)c3c[nH]2)CN=[N+]=[N-]
LESS?! 9017342 Cc1ccc(cc-2c(cc(c12)C=NNC(=O)N)C)C(C)C

I haven't tried to characterize them.

Counting hetero-aromatics rings

The hetero-aromatics are a bit harder to calculate. I need to find aromatic rings which also contain a non-carbon. Here are the implementations for OpenBabel, RDkit and Indigo.

def openbabel_count_hetero_rings(mol):
    count = 0
    for ring in mol.GetSSSR():
        if (ring.IsAromatic() and
               any(mol.GetAtom(atom_id).GetAtomicNum() != 6 for atom_id in ring._path)):
            count += 1
        
    return count

The hardest was RDKit, which doesn't have a "ring" as an independent concept. It has a RingInfo, but that gives you all the atoms in all the rings, or all the bonds in all the rings, as lists of lists of either atom or bond indicies. These are parallel lists, so I can extract the atoms and bonds which correspond to a given ring by doing a zip.

def rdkit_count_hetero_rings(mol):
    count = 0
    ring_info = mol.GetRingInfo()
    ring_atom_info = ring_info.AtomRings()
    ring_bond_info = ring_info.BondRings()
    for (ring_atoms, ring_bonds) in zip(ring_atom_info, ring_bond_info):
        if not all(mol.GetBondWithIdx(bondIdx).GetIsAromatic() for bondIdx in ring_bonds):
            continue
        if any(mol.GetAtomWithIdx(atomIdx).GetAtomicNum() != 6 for atomIdx in ring_atoms):
            count += 1
    return count

And lastly, Indigo:

def indigo_count_hetero_rings(mol):
    count = 0
    for ring in mol.iterateSSSR():
        # bond-order == 4 means "aromatic"; all rings bonds must be aromatic
        if not all(bond.bondOrder() == 4 for bond in ring.iterateBonds()):
            continue
        if any(atom.atomicNumber() != 6 for atom in ring.iterateAtoms()):
            count += 1
    return count

With that in place the change to the consensus SSSR search program is simply:

    #    counts = [openbabel_count_aromatic_rings(ob_mol),
    #              rdkit_count_aromatic_rings(rd_mol),
    #              indigo_count_aromatic_rings(in_mol)]
        counts = [openbabel_count_hetero_rings(ob_mol),
                  rdkit_count_hetero_rings(rd_mol),
                  indigo_count_hetero_rings(in_mol)]

which gives output of

/Users/dalke/databases/pubchem/Compound_001000001_001025000.sdf.gz
structures: 21054  not-relevant: 10169  no consensus 1
  openbabel: 113 rdkit: 0 indigo 1006

/Users/dalke/databases/pubchem/Compound_004000001_004025000.sdf.gz
structures: 23563  not-relevant: 9313  no consensus 15
  openbabel: 193 rdkit: 2 indigo 2273

/Users/dalke/databases/pubchem/Compound_005000001_005025000.sdf.gz
structures: 24025  not-relevant: 10326  no consensus 16
  openbabel: 232 rdkit: 2 indigo 2295

/Users/dalke/databases/pubchem/Compound_006000001_006025000.sdf.gz
structures: 11595  not-relevant: 5478  no consensus 24
  openbabel: 1365 rdkit: 0 indigo 1069

/Users/dalke/databases/pubchem/Compound_008000001_008025000.sdf.gz
structures: 24405  not-relevant: 12192  no consensus 2
  openbabel: 164 rdkit: 0 indigo 2902

/Users/dalke/databases/pubchem/Compound_009000001_009025000.sdf.gz
structures: 24763  not-relevant: 14897  no consensus 0
  openbabel: 15 rdkit: 0 indigo 1541

Looking at the totals, only 67030 of the 129405 structure had an hetero-aromatic ring, and there were 2082 times when the OpenBabel did not agree with the consensus, by which I'll interpret to mean that there's a 3% difference in the counts between RDKit and OpenBabel.

Again you see that RDKit was closest to the consensus. Here's some cases where all three differed. I have not analyzed why:

  CID     OpenBabel   RDKit   Indigo 
4001129      3         7        4
4009493      2         6        3
5022220      6         9        7
6005944      2         1        0
6007096      3         2        1
6007644      2         1        0
6008549      2         4        3

How many hetero-aromatic rings are obvious even without SSSR?

Again, to judge how much of a problem this is we need to figure out how many of these hetero-aromatic ring counts were obvious, that is, where the SSSR implementation is not relevant.

OpenEye's OEDetermineAromaticRingSystems() returns two components. The first is the total number of system, the second is a lookup table from the atom index to the component number that it's in. (Component 0 is reserved for those atoms not in a ring system.)

If I iterate over the non-carbon aromatic atoms, and get the component numbers for each one, and turn them into a set, then the size of the set is the number of ring systems with at least one hetero-aromatic atom. That's confusing to read, and I'm not sure the code is any cleaner to someone who hasn't used the toolkit before, but here it is:

def openeye_min_hetero_rings(mol):
    num_aromatic_systems, parts = OEDetermineAromaticRingSystems(mol)

    # Atoms which match [a;!#6]
    hetero_atoms = mol.GetAtoms(OEAndAtom(OEIsAromaticAtom(),
                                          OENotAtom(OEIsCarbon())))

    atom_components = set(parts[atom.GetIdx()] for atom in hetero_atoms)
    return len(atom_components)

and the change to the OEChem/RDKit comparison code is:

        #num_oe = openeye_min_aromatic_rings(oe_mol)
        #num_rd = rdkit_count_aromatic_rings(rd_mol)
        num_oe = openeye_min_hetero_rings(oe_mol)
        num_rd = rdkit_count_hetero_rings(rd_mol)

and the manually cleaned up output is:

Compound_001000001_001025000.sdf.gz lt:  1 equal:  9502 greater: 1318 zero: 10233
Compound_004000001_004025000.sdf.gz lt: 11 equal: 12919 greater: 1201 zero:  9432
Compound_005000001_005025000.sdf.gz lt: 27 equal: 12307 greater: 1214 zero: 10477
Compound_006000001_006025000.sdf.gz lt:  4 equal:  4652 greater:  409 zero:  6530
Compound_008000001_008025000.sdf.gz lt: 15 equal: 10585 greater: 1498 zero: 12307
Compound_009000001_009025000.sdf.gz lt:  0 equal:  9013 greater:  842 zero: 14908

which works out to only 6482 structures which can have a difference.

Toolkit hetero-aromatic ring counts vs. complex hetero-aromatic ring systems

Since there were 2082 cases where OpenBabel gave a different answer than RDKit, that means that about 1/3 of the times where OpenBabel could differ from RDKit, were different.

Unfortunately, it's not as simple as that. Indigo reports 11,000 differences from the consensus, and 11,000 > 6,482. Why is this? Almost certainly because of differences in aromaticity or differences in SSSR determination. If multiple SSSRs are possible, then it's possible that one choice of SSSRs ends up with one hetero-aromatic ring, while another choice gives two such rings.

Someone could go through and figure out why they are different. That person isn't me. Perhaps it's you?

Why do people use SSSR and its variants?

Gilleain Torrance on the CDK mailing list pointed out Counterexamples in Chemical Ring Perception J. Chem. Inf. Comput. Sci., 2004, 44 (2), pp 323-331 DOI: 10.1021/ci030405d (Berger et al.) It clearly points out problems in the different SSSR algorithms.

Why do the existing toolkits use SSSR? I don't know. History? Tradition?

What do people use SSSR for in the existing toolkits? Again, I don't know. Have they evalauted the stability of the results due to atom input order?

What is ESSSR? That's the algorithm PubChem says they use for those bits. The Berger paper doesn't mention ESSSR but it does mention "Extended Set of Smallest Rings (ESSR)" and says it:

... was introduced by Downs et al. [17] as an approach to design an optimal ring set for retrieval purposes. ESSR by definition is limited to planar graphs.

Perhaps that is the ESSR used at PubChem, but not only are some PubChem structure non-planar, but the Berger paper shows examples where ESSR is not correct. It does say that an ESSR variation "ESSR' ... is indeed a useful definition for a chemically relevant 'extended set of smallest rings' that could be generated reasonably efficiently..."

No (E)SSSR will be used in the ChemFP substructure fingerprint

I am so not going to implement that alogorithm for each toolkit, for the sole purposes of computing a couple of extra bits. It's not portable and it appears to be highly variable. I would need a lot of testing to convince myself that something as simple as the atom order doesn't affect the hetero-atom ring counts.

Therefore, I'm going to leave those bits as 0 in ChemFP's implementation, unless I can find something appropriate.

Comments?

Do you use SSSR? How do you handle these problems? Do you have a better suggestion for me than to ignore those bits?

If so, feel free to leave a comment or email me directly.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me