Dealing with SSSR
This delves into the depths of how different toolkits use the SSSR (Smallest Set of Smallest Rings) algorithm. The end result is that I will be removing support for bits 257-262 of the ChemFP Substruct key because I cannot reliably implement "number of aromatic rings" and "number of hetero-aromatic rings" across multiple toolkits.
Travel note: I will be at the 9th ICCS conference, leaving tomorrow. Hope to see some of my readers there!
ESSSR
Section 2 of the PubChem fingerprint definition starts:
Section 2: Rings in a canonic Extended Smallest Set of Smallest Rings (ESSSR) ring set - These bits test for the presence or count of the described chemical ring system. An ESSSR ring is any ring which does not share three consecutive atoms with any other ring in the chemical structure. For example, naphthalene has three ESSSR rings (two phenyl fragments and the 10-membered envelope), while biphenyl will yield a count of only two ESSSR rings.Looking at this now, I realized something I missed the previous many times I looked - naphthalene has only two cycles, which means any SSSR algorithm should find two rings. (Otherwise it's not the smallest set.) How come the documentation says there are three ESSSR rings?
SSSR lovers and haters
The CDK implementation of the ESSSR dependent bits uses its SSSR algorithm. For example, bit 115 is set for ">=1 any ring of size 3." The corresponding CDK code is:
ringSet = finder.findSSSR(); ... public int countAnyRing(int size) { int c = 0; for (IAtomContainer ring : ringSet.atomContainers()) { if (ring.getAtomCount() == size) c++; } return c; } ... b = 115; if (cr.countAnyRing(3) >= 1) fp[b >> 3] |= MASK[b % 8];OpenEye somewhat famously published Smallest Set of Smallest Rings (SSSR) Considered Harmful wherein they point out "SSSR is not an invariant subset of all possible rings" and:
We believe that it is a great service to our customers that we do not include any SSSR functionality in OEChem.As a consequence, they changed the meaning of the SMARTS "R" definition to indicate the number of ring bonds an atom has rather than the number of SSSR rings that it's in. "[R1]" under Daylight is roughly equivalent to "[R2]" under OpenEye.
Don't use SSSR when SMARTS is close enough
My goal for the ChemFP substructure keys is to develop a useful and especially cross-platform substructure filter definition closely based on the PubChem keys. They don't need to be identical to PubChem. OpenEye doesn't have SSSR support so I decided it would be easier to implement those ring definitions as SMARTS patterns. For bit 115 that's "*~1~*~*1". All of the fixed-sized rings can be done this way.
SMARTS isn't good enough; counting aromatic rings
SMARTS like this are very portable, but there are limits to the types of patterns that SMARTS can handle. Consider these bits:
255 >= 1 aromatic ring 256 >= 1 hetero-aromatic ring 257 >= 2 aromatic rings 258 >= 2 hetero-aromatic rings 259 >= 3 aromatic rings 260 >= 3 hetero-aromatic rings 261 >= 4 aromatic rings 262 >= 4 hetero-aromatic ringsThe SMARTS for 255 and 226 are [aR] and [aR;#!6] but there are no SMARTS patterns for the other. It must be done through some other means.
Here's the CDK implementation for bit 257 (2 or more aromatic rings):
public int countAromaticRing() { int c = 0; for (IAtomContainer ring : ringSet.atomContainers()) { if (isAromaticRing(ring)) c++; } return c; } ... b = 257; if (cr.countAromaticRing() >= 2) fp[b >> 3] |= MASK[b % 8];See how it uses the already computed SSSR?
I went ahead an implemented something like it for OpenBabel, RDKit, and Indigo. Here's how to do it:
def openbabel_count_aromatic_rings(mol): count = 0 for ring in mol.GetSSSR(): # Note: the OB implementation is wrong. It assumes that if all # atoms in the ring are aromatic then the ring itself must be # aromatic. This is not necessarily true. if ring.IsAromatic(): count += 1 return count
def rdkit_count_aromatic_rings(mol): count = 0 for ring in mol.GetRingInfo().BondRings(): if all(mol.GetBondWithIdx(bondIdx).GetIsAromatic() for bondIdx in ring): count += 1 return count
def indigo_count_aromatic_rings(mol): count = 0 mol.aromatize() # XXX missing from chemfp! for ring in mol.iterateSSSR(): # bond-order == 4 means "aromatic"; all rings bonds must be aromatic if all(bond.bondOrder() == 4 for bond in ring.iterateBonds()): count += 1 return count(And yes, I know I could have done "count += all(....)".)
Cross-toolkit comparison of the aromatic ring counts
How toolkit-dependent is this definition? I've got a bunch of PubChem data. Here's a driver script using the molecule readers in chemfp. (Do note that the API will change. I'm going to swap the order of molecule and title in the reader iterator.)
from itertools import izip from chemfp import indigo, openbabel, rdkit def cross_compare(filename): num_zero = num_structures = num_no_consensus = num_mismatches = 0 toolkit_is_unique = dict.fromkeys( ("indigo", "rdkit", "openbabel"), 0) for ((ob_title, ob_mol), (rd_title, rd_mol), (in_title, in_mol)) in ( izip(openbabel.read_structures(filename), rdkit.read_structures(filename), indigo.read_structures(filename)) ): num_structures += 1 assert ob_title == rd_title == in_title, (ob_title, rd_title, in_title) counts = [openbabel_count_aromatic_rings(ob_mol), rdkit_count_aromatic_rings(rd_mol), indigo_count_aromatic_rings(in_mol)] num_unique = len(set(counts)) if num_unique == 1: if counts[0] == 0: num_zero += 1 # All the same; not interesting #print "%s %s %s %s EQUAL" % tuple([ob_title] + counts) continue elif num_unique == 2: if counts[0] == counts[1]: name = "indigo" elif counts[0] == counts[2]: name = "rdkit" else: name = "openbabel" toolkit_is_unique[name] += 1 #print "%s %s %s %s %s" % tuple([ob_title] + counts + [name]) else: print "%s %s %s %s ALL" % tuple([ob_title] + counts) num_no_consensus += 1 num_mismatches += 1 print filename print "structures: %d not-relevant: %d no consensus %d" % ( num_structures, num_zero, num_no_consensus) print " openbabel: %d rdkit: %d indigo %d" % (toolkit_is_unique["openbabel"], toolkit_is_unique["rdkit"], toolkit_is_unique["indigo"]) for filename in ( "/Users/dalke/databases/pubchem/Compound_001000001_001025000.sdf.gz"): "/Users/dalke/databases/pubchem/Compound_004000001_004025000.sdf.gz", "/Users/dalke/databases/pubchem/Compound_005000001_005025000.sdf.gz", "/Users/dalke/databases/pubchem/Compound_006000001_006025000.sdf.gz", "/Users/dalke/databases/pubchem/Compound_008000001_008025000.sdf.gz", "/Users/dalke/databases/pubchem/Compound_009000001_009025000.sdf.gz", cross_compare(filename)Here's the results using a few different structure files.
/Users/dalke/databases/pubchem/Compound_001000001_001025000.sdf.gz structures: 21054 not-relevant: 29 no consensus 1 openbabel: 113 rdkit: 1 indigo 1008 /Users/dalke/databases/pubchem/Compound_004000001_004025000.sdf.gz structures: 23563 not-relevant: 334 no consensus 17 openbabel: 198 rdkit: 4 indigo 2284 /Users/dalke/databases/pubchem/Compound_005000001_005025000.sdf.gz structures: 24025 not-relevant: 478 no consensus 16 openbabel: 234 rdkit: 3 indigo 2305 /Users/dalke/databases/pubchem/Compound_006000001_006025000.sdf.gz structures: 11595 not-relevant: 301 no consensus 43 openbabel: 1347 rdkit: 3 indigo 1198 /Users/dalke/databases/pubchem/Compound_008000001_008025000.sdf.gz structures: 24405 not-relevant: 340 no consensus 7 openbabel: 159 rdkit: 1 indigo 2914 /Users/dalke/databases/pubchem/Compound_009000001_009025000.sdf.gz structures: 24763 not-relevant: 162 no consensus 1 openbabel: 17 rdkit: 4 indigo 1597Indigo noticibly disagrees with the consensus. If we just consider OpenBabel and RDKit then there's an toolkit bias of about 1.5-2%.
How many aromatic rings are obvious even without SSSR?
However! How many of these are trivial? That is, most of these are likely independent rings, which of course won't have a problem. What we need to do is compare the number of toolkit differences to the number of times where there could be a difference.
While OpenEye doesn't have SSSR, they do have a "count number of aromatic ring systems" function, and we can use that to compare the number of aromatic rings found by RDKit to the number of aromatic ring systems found by OpenEye.
from itertools import izip from chemfp import openeye, rdkit from openeye.oechem import (OEDetermineAromaticRingSystems, OECreateCanSmiString, OEThrow, oeosstream) # Disable printing OpenEye's warnings oes = oeosstream() OEThrow.SetOutputStream(oes) def rdkit_count_aromatic_rings(mol): count = 0 for ring in mol.GetRingInfo().BondRings(): if all(mol.GetBondWithIdx(bondIdx).GetIsAromatic() for bondIdx in ring): count += 1 return count def openeye_min_aromatic_rings(mol): num_aromatic_systems, parts = OEDetermineAromaticRingSystems(mol) return num_aromatic_systems def ring_compare(filename): num_zero = num_lt = num_eq = num_gt = 0 for ((oe_title, oe_mol), (rd_title, rd_mol)) in izip( openeye.read_structures(filename), rdkit.read_structures(filename)): assert oe_title == rd_title num_oe = openeye_min_aromatic_rings(oe_mol) num_rd = rdkit_count_aromatic_rings(rd_mol) if num_oe == num_rd == 0: num_zero += 1 elif num_oe == num_rd: num_eq += 1 elif num_oe < num_rd: num_gt += 1 else: num_lt += 1 print "LESS?!", oe_title, OECreateCanSmiString(oe_mol) print filename, "lt:", num_lt, "equal:", num_eq, "greater:", num_gt for filename in ( "/Users/dalke/databases/pubchem/Compound_001000001_001025000.sdf.gz", "/Users/dalke/databases/pubchem/Compound_004000001_004025000.sdf.gz", "/Users/dalke/databases/pubchem/Compound_005000001_005025000.sdf.gz", "/Users/dalke/databases/pubchem/Compound_006000001_006025000.sdf.gz", "/Users/dalke/databases/pubchem/Compound_008000001_008025000.sdf.gz", "/Users/dalke/databases/pubchem/Compound_009000001_009025000.sdf.gz"): ring_compare(filename)In theory RDKit will always find at least the number of aromatic ring systems which OEChem finds, but they have different definitions of what "aromaticity" means, so there are a few cases otherwise.
Toolkit ring counts vs. complex ring systems
Here are the numbers I get:
Compound_001000001_001025000.sdf.gz lt: 1 equal: 16834 greater: 4190 zero: 29 Compound_004000001_004025000.sdf.gz lt: 9 equal: 17424 greater: 5793 zero: 337 Compound_005000001_005025000.sdf.gz lt: 20 equal: 18042 greater: 5485 zero: 478 Compound_006000001_006025000.sdf.gz lt: 2 equal: 9387 greater: 1880 zero: 326 Compound_008000001_008025000.sdf.gz lt: 15 equal: 18558 greater: 5486 zero: 346 Compound_009000001_009025000.sdf.gz lt: 2 equal: 20565 greater: 4034 zero: 162In other words, only 4190+5793+5485+1880+5486+4034 = 26868 compounds can have different aromatic ring counts because 102488 of the compounds have easily identified rings.
Now go back to the previous numbers. There were 2,068 times where OpenBabel did not agree with the consensus. In other words, 8% of the times where OpenBabel could differ from RDKit, were different. For Indigo it's fully 50% of the time,
BTW, some of the "LESS?!" cases are:
LESS?! 1007243 c1cc(ccc1CSc2c-3ncnc3c(=S)[nH][nH]2)I LESS?! 4001753 CCOC(=O)c1c(c-2c(n3c(cc2n1)C(=C(C3C)C)C)CC(C)C)C LESS?! 4002932 c1ccccccccc[cH+]cccccccc1 LESS?! 4003686 COc1ccc(cc1)Nc2c(cc-3ccc(=O)cc3o2)C(=O)Nc4nc(cs4)c5ccc6c(c5)CCCC6 LESS?! 5004733 CCOc1ccc(cc1)n2c(c-3c(nnc3nc2SCC(=O)Nc4cccc(c4)OC)C)N LESS?! 5015979 Cc1c-2c(=O)cc(nc2n([nH]1)C3CCS(=O)(=O)C3)CCl LESS?! 6015037 c1cc(ccc1C=Cc2nc(=O)c-3c[nH][nH]c3n2)Cl LESS?! 8009267 c1ccc(cc1)n2c-3nc(nc(=O)c3c[nH]2)CN=[N+]=[N-] LESS?! 9017342 Cc1ccc(cc-2c(cc(c12)C=NNC(=O)N)C)C(C)CI haven't tried to characterize them.
Counting hetero-aromatics rings
The hetero-aromatics are a bit harder to calculate. I need to find aromatic rings which also contain a non-carbon. Here are the implementations for OpenBabel, RDkit and Indigo.
def openbabel_count_hetero_rings(mol): count = 0 for ring in mol.GetSSSR(): if (ring.IsAromatic() and any(mol.GetAtom(atom_id).GetAtomicNum() != 6 for atom_id in ring._path)): count += 1 return countThe hardest was RDKit, which doesn't have a "ring" as an independent concept. It has a RingInfo, but that gives you all the atoms in all the rings, or all the bonds in all the rings, as lists of lists of either atom or bond indicies. These are parallel lists, so I can extract the atoms and bonds which correspond to a given ring by doing a zip.
def rdkit_count_hetero_rings(mol): count = 0 ring_info = mol.GetRingInfo() ring_atom_info = ring_info.AtomRings() ring_bond_info = ring_info.BondRings() for (ring_atoms, ring_bonds) in zip(ring_atom_info, ring_bond_info): if not all(mol.GetBondWithIdx(bondIdx).GetIsAromatic() for bondIdx in ring_bonds): continue if any(mol.GetAtomWithIdx(atomIdx).GetAtomicNum() != 6 for atomIdx in ring_atoms): count += 1 return countAnd lastly, Indigo:
def indigo_count_hetero_rings(mol): count = 0 for ring in mol.iterateSSSR(): # bond-order == 4 means "aromatic"; all rings bonds must be aromatic if not all(bond.bondOrder() == 4 for bond in ring.iterateBonds()): continue if any(atom.atomicNumber() != 6 for atom in ring.iterateAtoms()): count += 1 return count
With that in place the change to the consensus SSSR search program is simply:
# counts = [openbabel_count_aromatic_rings(ob_mol), # rdkit_count_aromatic_rings(rd_mol), # indigo_count_aromatic_rings(in_mol)] counts = [openbabel_count_hetero_rings(ob_mol), rdkit_count_hetero_rings(rd_mol), indigo_count_hetero_rings(in_mol)]which gives output of
/Users/dalke/databases/pubchem/Compound_001000001_001025000.sdf.gz structures: 21054 not-relevant: 10169 no consensus 1 openbabel: 113 rdkit: 0 indigo 1006 /Users/dalke/databases/pubchem/Compound_004000001_004025000.sdf.gz structures: 23563 not-relevant: 9313 no consensus 15 openbabel: 193 rdkit: 2 indigo 2273 /Users/dalke/databases/pubchem/Compound_005000001_005025000.sdf.gz structures: 24025 not-relevant: 10326 no consensus 16 openbabel: 232 rdkit: 2 indigo 2295 /Users/dalke/databases/pubchem/Compound_006000001_006025000.sdf.gz structures: 11595 not-relevant: 5478 no consensus 24 openbabel: 1365 rdkit: 0 indigo 1069 /Users/dalke/databases/pubchem/Compound_008000001_008025000.sdf.gz structures: 24405 not-relevant: 12192 no consensus 2 openbabel: 164 rdkit: 0 indigo 2902 /Users/dalke/databases/pubchem/Compound_009000001_009025000.sdf.gz structures: 24763 not-relevant: 14897 no consensus 0 openbabel: 15 rdkit: 0 indigo 1541Looking at the totals, only 67030 of the 129405 structure had an hetero-aromatic ring, and there were 2082 times when the OpenBabel did not agree with the consensus, by which I'll interpret to mean that there's a 3% difference in the counts between RDKit and OpenBabel.
Again you see that RDKit was closest to the consensus. Here's some cases where all three differed. I have not analyzed why:
CID OpenBabel RDKit Indigo 4001129 3 7 4 4009493 2 6 3 5022220 6 9 7 6005944 2 1 0 6007096 3 2 1 6007644 2 1 0 6008549 2 4 3
How many hetero-aromatic rings are obvious even without SSSR?
Again, to judge how much of a problem this is we need to figure out how many of these hetero-aromatic ring counts were obvious, that is, where the SSSR implementation is not relevant.
OpenEye's OEDetermineAromaticRingSystems() returns two components. The first is the total number of system, the second is a lookup table from the atom index to the component number that it's in. (Component 0 is reserved for those atoms not in a ring system.)
If I iterate over the non-carbon aromatic atoms, and get the component numbers for each one, and turn them into a set, then the size of the set is the number of ring systems with at least one hetero-aromatic atom. That's confusing to read, and I'm not sure the code is any cleaner to someone who hasn't used the toolkit before, but here it is:
def openeye_min_hetero_rings(mol): num_aromatic_systems, parts = OEDetermineAromaticRingSystems(mol) # Atoms which match [a;!#6] hetero_atoms = mol.GetAtoms(OEAndAtom(OEIsAromaticAtom(), OENotAtom(OEIsCarbon()))) atom_components = set(parts[atom.GetIdx()] for atom in hetero_atoms) return len(atom_components)and the change to the OEChem/RDKit comparison code is:
#num_oe = openeye_min_aromatic_rings(oe_mol) #num_rd = rdkit_count_aromatic_rings(rd_mol) num_oe = openeye_min_hetero_rings(oe_mol) num_rd = rdkit_count_hetero_rings(rd_mol)and the manually cleaned up output is:
Compound_001000001_001025000.sdf.gz lt: 1 equal: 9502 greater: 1318 zero: 10233 Compound_004000001_004025000.sdf.gz lt: 11 equal: 12919 greater: 1201 zero: 9432 Compound_005000001_005025000.sdf.gz lt: 27 equal: 12307 greater: 1214 zero: 10477 Compound_006000001_006025000.sdf.gz lt: 4 equal: 4652 greater: 409 zero: 6530 Compound_008000001_008025000.sdf.gz lt: 15 equal: 10585 greater: 1498 zero: 12307 Compound_009000001_009025000.sdf.gz lt: 0 equal: 9013 greater: 842 zero: 14908which works out to only 6482 structures which can have a difference.
Toolkit hetero-aromatic ring counts vs. complex hetero-aromatic ring systems
Since there were 2082 cases where OpenBabel gave a different answer than RDKit, that means that about 1/3 of the times where OpenBabel could differ from RDKit, were different.
Unfortunately, it's not as simple as that. Indigo reports 11,000 differences from the consensus, and 11,000 > 6,482. Why is this? Almost certainly because of differences in aromaticity or differences in SSSR determination. If multiple SSSRs are possible, then it's possible that one choice of SSSRs ends up with one hetero-aromatic ring, while another choice gives two such rings.
Someone could go through and figure out why they are different. That person isn't me. Perhaps it's you?
Why do people use SSSR and its variants?
Gilleain Torrance on the CDK mailing list pointed out Counterexamples in Chemical Ring Perception J. Chem. Inf. Comput. Sci., 2004, 44 (2), pp 323-331 DOI: 10.1021/ci030405d (Berger et al.) It clearly points out problems in the different SSSR algorithms.
Why do the existing toolkits use SSSR? I don't know. History? Tradition?
What do people use SSSR for in the existing toolkits? Again, I don't know. Have they evalauted the stability of the results due to atom input order?
What is ESSSR? That's the algorithm PubChem says they use for those bits. The Berger paper doesn't mention ESSSR but it does mention "Extended Set of Smallest Rings (ESSR)" and says it:
... was introduced by Downs et al. [17] as an approach to design an optimal ring set for retrieval purposes. ESSR by definition is limited to planar graphs.Perhaps that is the ESSR used at PubChem, but not only are some PubChem structure non-planar, but the Berger paper shows examples where ESSR is not correct. It does say that an ESSR variation "ESSR' ... is indeed a useful definition for a chemically relevant 'extended set of smallest rings' that could be generated reasonably efficiently..."
No (E)SSSR will be used in the ChemFP substructure fingerprint
I am so not going to implement that alogorithm for each toolkit, for the sole purposes of computing a couple of extra bits. It's not portable and it appears to be highly variable. I would need a lot of testing to convince myself that something as simple as the atom order doesn't affect the hetero-atom ring counts.
Therefore, I'm going to leave those bits as 0 in ChemFP's implementation, unless I can find something appropriate.
Comments?
Do you use SSSR? How do you handle these problems? Do you have a better suggestion for me than to ignore those bits?
If so, feel free to leave a comment or email me directly.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB