Implementing the CACTVS/PubChem substructure keys

[ previous | newer ] /home/writings/diary/archive/2011/01/20/implementing_cactvs_keys

Implementing the CACTVS/PubChem substructure keys

If you are doing a substructure search against a large chemical compound database then you probably don't want to do brute-force subgraph isomorphism searches. In other words, don't do a SMARTS match over every compound in the database - it's slow. You can speed things up by precomputing certain indicies.

As a simple example, if the query contains a triple bond then there's no need to search the targets which don't have a triple bond since you know it's not going to match. This is a simple screen which removes obvious mismatches before doing the full match.

I'll make it a bit more selective and include a filter for "contains a ring." When a query comes in, see if it has a triple bond and see if it has a ring. There are four cases:

   Has triple   Has      then do the full
     bond?     ring?      full search on
   ==========  =====    ===================
      no        no      the entire data set
     yes        no      those targets with a triple bond
      no       yes      those targets with a ring
     yes       yes      those targets with a triple bond and a ring

Now add a filter for "has at least one chlorine" and another for "contains a ring of size 6" and another for "contains two rings of size 6" and another for "carbon bonded to at least two other carbons", and so on. It's not hard to come up with a short list of reasonable candidates.

Each test is called a "feature. " When a query comes in, find all of the features which are present in the query. Each feature is mapped to a set of structures which contain that feature, so get the intersection of those sets to find the targets which contain all the features present in the query. If the query contains no features then you'll just have to search all the targets.

(There's another even earlier filter step which can be done; if the query is exactly one of the filters then you've already got the answer.)

This algorithm is an example of the well-understood inverted index. I leave the details of how to deal with the indicies efficiently to another essay.

By the way, there are many types of substructure keys. The example I just gave is also known as a feature key. Another common approach is a hash fingerprint, the details of which I'm not going to get into now.

Available substructure keys: MACCS

The 2-feature key I defined was simple. Coming up with a longer list of effective substructure filter keys is harder. The filters are both query and data set dependent, and it's hard to say that adding, say, the 257th feature is all that more effective than leaving it out.

A long term goal of mine is to experiment and see if subgraph enumeration and data mining algorithms can be used to improve the choice of keys, both for a large data set like PubChem's or a smaller, specialized one focusing on, say, estrogen-like structures.

In order to "improve" on something I need a baseline. There are a number of substructure filter definitions. The absolute most common are the MACCS key fingerprints. MACCS is short for "Molecular ACCess System" and was MDL's chemistry database system started back in the late 1970s. Just about every edition of a cheminformatics journal has a paper which uses the MACCS keys, though more often as a basis for similarity comparisons and not for substructure filtering.

MACCS keys come in 166 bit and 960 bit forms, but most people use the smaller ones. Strangely though, there seems to be no primary reference for the key definitions. Greg Landrum implemented the MACCS keys in RDKit. Where did he get them from? He wrote:

unfortunately this information isn't exactly easy to come by. There's a paper from MDL that describes a *reoptimization* of the keys: 1. Durant, J.L., Leland, B.A., Henry, D.R. & Nourse, J.G. Reoptimization of MDL Keys for Use in Drug Discovery. Journal of Chemical Information and Computer Sciences 42, 1273-1280 (2002).
...
I'm not sure that the 166 "public" keys were ever really published in the open literature. I think many people just derived their definitions from the data that one could find in the help from Isis/Base.

(That's DOI: 10.1021/ci010132r for those of you with access. Please note that they reoptimized for similarity, and not for substructure filtering effectiveness.)

If you know of the primary reference describing the bits, let me know.

Greg's implementation is as a set of SMARTS patterns. He cross-tested them against two other packages, though not with results generated by MDL software. A nice thing about about SMARTS is they are relatively portable. Excepting a few differences in aromaticity perception and complexities in chirality and stereochemistry, they should work in most toolkits.

As a result, and because RDKit 1) the definitions are available as a data file and not code and 2) RDKit is a BSD-licensed project, they were quickly taken up by both OpenBabel and CDK. (Note: there were some serious bugs in the MACCS key definitions in OpenBabel 2.3.0. Make sure you get something more recent than that.)

OpenEye's GraphSim Tk supports MACCS keys, though again, make sure you have something newer than 1.7.4.3 because there's an error in one of the bits. (Nothing serious - the last bit was always set to 1.) Pipeline Pilot and other packages also implement the MACCS keys.

So far I've looked at three MACCS key implementations, and two had bugs. For all I know, the others do as well. Someday I'll do a cross comparison and let you know.

I decided against using the MACCS keys as the basis for my comparison. Why? For one, take a look at at Greg's cautions in his cross-comparison:

Most of the differences have to do with aromaticity

There's a discrepancy sometimes because the current RDKit definitions do not require multiple matches to be distinct. e.g. the SMILES C(=O)CC(=O) can match the (hypothetical) key O=CC twice in my definition. It's not clear to me what the correct behavior is.

Some keys are not fully defined in the MDL documentation

Two keys, 125 and 166, have to be done outside of SMARTS.

Key 1 (ISOTOPE) isn't defined

The first one will always be a problem as different toolkits use different methods to perceive aromaticity. The last three mean that of the 166 bits only 163 are potentially usable.

Since I'll be working with PubChem-sized data sets, I decided I want something with more bits, with better documentation, and which has already been used for such large data sets. Rememer, the 166 bit keys were defined a long time ago when the data sets and memory space were much smaller.

PubChem/CACTVS substructure keys

Why not go to the source? PubChem documented the PUBCHEM_CACTVS_SUBSKEYS and I wrote about it a few years ago.

These 881 bits of feature key goodness come from Xemistry's CACTVS toolkit. Most specifically, Wolf-Dietrich Ihlenfeldt, the main author, says they are:

property E_SCREEN in version 1.0, with non-default parameter "extended" set to 2

Just reading the PubChem documentation makes you think "I can implement that." Almost all of the feature definitions can be mapped to a SMARTS, or a SMARTS plus a minimum count. I've just done this, and found a number of gotchas on the way. Think of the rest of this essay as advice for future implementors.

I haven't released my implementation yet. Wait a few weeks. The C++/OEChem examples you see come from my code.

When I was close to finishing I had problems in my validation. I couldn't match the PubChem results. I asked Evan Bolton over at PubChem for help. He pointed out that there are some public implementations that I might want to look at for guidance. There's a PubChemFingerprinter in CDK, and it's based on code from the NIH Chemical Genomics Center. It looks like the NIH/CGC people in turn asked Evan for help some 4 or 5 years ago in getting things working.

However, I didn't find this code until yesterday, so I think it's also interesting to compare the three implementations.

Section 1 - Hierarchic Element Counts

Implementing these is simple. If there are 2 or more lithium atoms then set bit 5. (The first bit is bit 0.) If there are 32 or more carbons then set bit 13. And so on.

You can define almost all of these as SMARTS patterns, except the hydrogens bits. But you probably shouldn't. It's easy and fast to make a table of all element counts. Here's my C++ code:

  int count_table[OEElemNo::MAXELEM] = {0};
     ...
  for (OEIter<OEAtomBase> atom = mol.GetAtoms(); atom; ++atom) {
    unsigned int eleno = atom->GetAtomicNum();
    if (eleno >= OEElemNo::MAXELEM) {
      continue;
    }
    count_table[eleno] += 1;
    count_table[OEElemNo::H] += atom->GetImplicitHCount();
  }

In retrospect I should have had a hard coded limit of 92 for the maximum element number, since the PubChem keys don't go beyond uranium. While you might think I don't need to check for elements against OEElemNo::MAXELEM, OEChem does allow larger values in the data structure, through at least two different ways:

>>> mol = OEGraphMol()
>>> mol.NewAtom(200).GetAtomicNum()
200
>>> mol.Clear()
>>> OEParseSmiles(mol, "[#199]")
True
>>> [atom.GetAtomicNum() for atom in mol.GetAtoms()]
[199]
>>>

The equivalent CDK fingerprinter is similar:

int[] counts = new int[120];

public CountElements(IAtomContainer m) {
    for (int i = 0; i < m.getAtomCount(); i++)
        ++counts[PeriodicTable.getAtomicNumber(m.getAtom(i).getSymbol())];
}
public int getCount(int atno) {
    return counts[atno];
}
public int getCount(String symb) {
    return counts[PeriodicTable.getAtomicNumber(symb)];
}

though they should use m.getAtomicNum() instead of going through the symbol. (I sent email to Rajarshi, who wrote that code.)

Implicit hydrogens

But wait! What about implicit hydrogens? In the normal SMILES chemistry model, the SMILES "C" creates a single atom in the molecular graph, where the atom has 4 implicit hydrogens. This is because about 1/2 of the atoms in a typical molecule are hydrogens, so including them is a memory space concern.

CDK implements implicit hydrogens so what's going on?

That's where you have to look at the test cases. src/test/org/openscience/cdk/fingerprint/PubchemFingerprinterTest.java says:

CDKHydrogenAdder adder = CDKHydrogenAdder.getInstance(DefaultChemObjectBuilder.getInstance());
IMolecule mol1 = parser.parseSmiles("c1ccccc1CCc1ccccc1");
AtomContainerManipulator.percieveAtomTypesAndConfigureAtoms(mol1);
adder.addImplicitHydrogens(mol1);
AtomContainerManipulator.convertImplicitToExplicitHydrogens(mol1);

so the CDK PubChem fingerprinter requires explicit hydrogens in order to work. This is an implementation choice since, as you can see in my code, an alternate implementation doesn't need this requirement. I've also mentioned this to Rajarshi.

Setting a bit

Once you have the atom count table you can set the section 1 flags easily. My code uses lines like:

ATOM_COUNT(13, 32, OEElemNo::C);  // 13 >= 32 C

but that's because I'm using a macro. The actual code is more like:

if (count_table[OEElemNo::C] >= 32) {
  fp[13/8] |= 1 << (13 % 8);
}

This is the C way of saying "set bit 5 on the 2nd byte" using bit operations.

The CDK code is:

b = 13;
if (ce.getCount("C") >= 32) fp[b >> 3] |= MASK[b % 8];

which is uses a different set of bit operations. (MASK is a lookup table equivalent to 1<<(8-bit_index), meaning that my code has a different endianness.) In both cases the compiler should figure out that 13/8 and 13%8 are 1 and 5, respectively, and do so at compile time. The Java code does take a hit going through a element-symbol-to-offset table. I personally would have done ce.getCount(6), and preferably with a compile-time name to get the 6, but the overall performance difference is going to be small.

Section 2: Rings in a canonic Extended Smallest Set of Smallest Rings (ESSSR) ring set

CDK and my code have very different implementations. OpenEye does not implement SSSR. In fact they are notoriously against SSSR. Therefore I can't easily implement anything close to the CACTVS definition.

Instead, I translated each case to a roughly equivalent SMARTS pattern. "A ring of size 5" is "*1****1". Since I don't have an easy way to define "saturated" and "unsaturated" as a SMARTS, I decided to use "aromatic" and "nonaromatic" instead. As SMARTS these look like "a1aaaaa1" and "A1AAAAA1" for 6-membered aromatic and nonaromatic rings. The carbon-only, nitrogen-containing and heteroatom-containing definitions for 6-membered aromatic rings are: c1ccccc1, n1aaaaa1 and [a;!#6]1aaaaa1.

CDK has an SSSR implementation. While likely not exactly the same as ESSR it's much closer than my ad hoc substitution. It does

    SSSRFinder finder = new SSSRFinder(m);
    ringSet = finder.findSSSR();
...
public int countAnyRing(int size) {
    int c = 0;
    for (IAtomContainer ring : ringSet.atomContainers()) {
        if (ring.getAtomCount() == size)
            c++;
    }
    return c;

and the code to set the bit is the simple and fast:

b = 115;
if (cr.countAnyRing(3) >= 1) fp[b >> 3] |= MASK[b % 8];

Saturated and unsaturated

A question I had from the PubChem definition is "how do they define saturated"? And the answer is a direct translation from of the chemistry. To figure out the implementation, start where a bit is set:

b = 172;
if (cr.countSaturatedOrAromaticCarbonOnlyRing(5) >= 5) fp[b >> 3] |= MASK[b % 8];

and check out the implementation:

public int countSaturatedOrAromaticCarbonOnlyRing(int size) {
    int c = 0;
    for (IAtomContainer ring : ringSet.atomContainers()) {
        if (ring.getAtomCount() == size
                && isCarbonOnlyRing(ring)
                && (isRingSaturated(ring) || isAromaticRing(ring)))
            c++;
    }
    return c;
}

private boolean isRingSaturated(IAtomContainer ring) {
    for (IBond ringBond : ring.bonds()) {
        if (ringBond.getOrder() != IBond.Order.SINGLE) return false;
    }
    return true;
}
private boolean isRingUnsaturated(IAtomContainer ring) {
    return !isRingSaturated(ring);
}

In other words, "only contains single bonds" means unsaturated while everything else is saturated. In SMARTS these translate to "*-1-*-*-*-*-*1" and "*1***:,=,#**1", respectively. (The ":,=,#" means "match an aromatic bond or a double bond or a triple bond. You don't often see booleans in SMARTS bonds expressions. I'll use them again later on.)

Those SMARTS would *almost* work, except that the test is for "isRingSaturated(ring) || isAromaticRing(ring)". And that, dear reader, isn't possible. Even if it were possible, sticking in a "contains nitrogen" is another complication.

So I'm going to stick with my ad hoc solution.

How does my OEChem code get the ring counts if it doesn't have SSSR? Through code which looks like this:

// Bits 144 151 158 165 172: saturated or aromatic carbon-only ring size 5
static OESubSearch aromatic_carbon_ring_size_5("c1cccc1");
  ...
  m = aromatic_carbon_ring_size_5.Match(mol, true);
  if (m) {
    /* 144 >= 1 saturated or aromatic carbon-only ring size 5 */
    SET_BIT(144);
    if (++m) {
      /* 151 >= 2 saturated or aromatic carbon-only ring size 5 */
      SET_BIT(151);
      if (++m) {
        /* 158 >= 3 saturated or aromatic carbon-only ring size 5 */
        SET_BIT(158);
        if (++m) {
          /* 165 >= 4 saturated or aromatic carbon-only ring size 5 */
          SET_BIT(165);
          if (++m) {
            /* 172 >= 5 saturated or aromatic carbon-only ring size 5 */
            SET_BIT(172);
          }
        }
      }
    }
  }

where I again use a macro to set the correct bit number.

Now, my SMARTS matches are going to give wrong answers for some ring systems. Consider if I have two 6-membered rings sharing a bond. Then SSSR finds only two rings, and ignores the 10-membered ring which is the outside of those two rings. My code finds the two rings plus the 10-membered ring.

This may or may not be a good thing for substructure matching. It's a bad thing if you want good fidelity to CACTVS. Or good if your fealty is to OpenEye. :)

Overall ring counts

To finish off section 2, bits 255-262 are overall ring counts. Bits 255 and 256 correspond to the SMARTS [a] and [a;!#6] so are simple. The others I don't support because there's just no simple way with OEChem to do that. For example, there might be two rings of size 50. Do I just keep testing for ever larger SMARTS ring definitions until I've reached the number of atoms?

While with SSSR it's simple. Here's the CDK implementation.

public int countAromaticRing() {
    int c = 0;
    for (IAtomContainer ring : ringSet.atomContainers()) {
        if (isAromaticRing(ring)) c++;
    }
    return c;
}
...
b = 257;
if (cr.countAromaticRing() >= 2) fp[b >> 3] |= MASK[b % 8];

Section 3: Simple atom pairs

The definitions for this section look simple, and they are, but they required me to not overthink. The documentation says

Section 3: Simple atom pairs - These bits test for the presence of 
patterns of bonded atom pairs, regardless of bond order 
or count.

Bit Position	Bit Substructure
263	Li-H
264	Li-Li
265	Li-B
266	Li-C

and I first translated those as SMARTS queries. After all, "[Li]-C" is a perfectly good SMARTS. But that's not what the definition means. It says "regardless of bond order", which means the SMARTS is more like "[Li]~C".

Even that's going to be wrong because in SMARTS the "C" only matches non-aromatic carbons. The test should really be "[Li]~[c,C]" or "[Li]~[#6]". Those are two different ways to match either aromatic or non-aromatic carbons.

CDK uses SMARTS

This is exactly what the CDK does

public int countSubstructure(String smarts) throws CDKException {
    sqt.setSmarts(smarts);
    boolean status = sqt.matches(mol);
    if (status) {
        return sqt.getUniqueMatchingAtoms().size();
    } else return 0;
}
 ...

b = 266;
if (cs.countSubstructure("[Li]~[#6]") > 0) fp[b >> 3] |= MASK[b % 8];

There's a slight performance hit because it will search for all matches when it only needs at least one, and another performance hit because it has to reparse the SMARTS pattern each time, unless the back-end implements a cache. (I couldn't find evidence of such a cache.)

Implicit hydrogens

If there are implicit hydrogens then there will be a problem because implicit hydrogens don't match explicit atoms in a SMARTS query. I pointed out earlier that the hydrogen count terms work only because the interface API requires that all molecules have explicit hydrogens. But the CDK code for section 3 correctly works with both explicit and implicit hydrogens:

b = 263;
if (cs.countSubstructure("[Li&!H0]") > 0) fp[b >> 3] |= MASK[b % 8];

Jump up to the top of PubchemFingerprinter.java and you'll see Rajarshi's comment:

Some SMARTS patterns have been modified from the original code, since they were based on explicit H matching. As a result, we replace the explicit H's with a query of the #N&!H0 where N is the atomic number.

The "!H0" in SMARTS says "match if this atom has at least one implicit hydrogens or is bonded to at least one hydrogen." Which solves this problem nicely. (A long time ago H matched only implicit hydrogens, which is why Daylight's SMARTS documentation has the footnote "Semantics of [H] changed in v4.5".)

Aromatic As and Se (and B and Si too!)

I think there's a bug in the CDK patterns which is traceable all the way back to its first NCGC incarnation. Consider:

b = 295;
if (cs.countSubstructure("[#6]~[As]") > 0) fp[b >> 3] |= MASK[b % 8];
b = 296;
if (cs.countSubstructure("[#6]~[Se]") > 0) fp[b >> 3] |= MASK[b % 8];

SMILES allows aromatic arsenic and aromatic selenium, which are matched by the SMARTS [as] and [se]. The above only matches the non-aromatic forms. They should be [#33] and [#34].

This of course assumes that CACTVS supports aromatic forms of those elements. I would need to search PubChem for examples, get the corresponding bit, and check.

(Updated on 21 Jan 2011: the structures 136132 and 22967695 set the bits for "aromatic ring" and for arsenic and selenium, respectively, which means that CACTVS does support those forms. In addition, 10899214 sets the bits showing it contains aromatic boron. This last is not supported by Daylight but is supported by OpenEye and a few other toolkits. This means the CDKs SMARTS pattern for bit 265, [Li]~[B], should be [Li]~[#5] for maximum portability. There's also a few aromatic silicons, like 136138 so the CDK patterns using [Si] should use [#14] to be portable.)

This also goes to show that there hasn't been enough validation testing in either of the two code bases. But let me tell you, validating this is hard. I spent about 2 days to implement everything and another 2 days to test it, and I expect to spend another couple of days in testing before I'm happy with the results.

I did tell Rajarshi about this, and I sent him a copy of my validation data set. I'll include those tests with my upcoming code release.

My code goes once through the list of bonds

The CDK works and is easy to understand. I decided to be a bit more clever, with some thought towards better performance. This is usually not a good thing, especially for a first implementation.

What I did was go through the list of bonds and look at the element numbers of the atoms on either side. "Eleno" is the smallest element number which isn't 1 and "other_eleno" is the other element. Then I can go through the list of bonds once, and quickly figure out which bit to set.

The code to go through the list of bonds and order the element numbers correctly is:

  for (OEIter<OEBondBase> bond=mol.GetBonds(); bond; ++bond) {
    OEAtomBase *atom = bond->GetBgn();
    OEAtomBase *other_atom = bond->GetEnd();
    unsigned int eleno = atom->GetAtomicNum();
    unsigned int other_eleno = other_atom->GetAtomicNum();

    if (eleno == 1 || ((other_eleno != 1) && (other_eleno < eleno))) {
      int tmp;
      tmp = eleno;
      eleno = other_eleno;
      other_eleno = tmp;
    }

and here's how I set the bits:

    switch (eleno) {
    case OEElemNo::Li:
      switch (other_eleno) {
      case OEElemNo::H: SET_BIT(263); break; // 263 Li-H
      case OEElemNo::Li: SET_BIT(264); break; // 264 Li-Li
      case OEElemNo::B: SET_BIT(265); break; // 265 Li-B
      case OEElemNo::C: SET_BIT(266); break; // 266 Li-C
      case OEElemNo::O: SET_BIT(267); break; // 267 Li-O
      case OEElemNo::F: SET_BIT(268); break; // 268 Li-F
      case OEElemNo::P: SET_BIT(269); break; // 269 Li-P
      case OEElemNo::S: SET_BIT(270); break; // 270 Li-S
      case OEElemNo::Cl: SET_BIT(271); break; // 271 Li-Cl
      }
      break;
    case OEElemNo::B:
      switch (other_eleno) {
      case OEElemNo::H: SET_BIT(272); break; // 272 B-H
      case OEElemNo::B: SET_BIT(273); break; // 273 B-B
      case OEElemNo::C: SET_BIT(274); break; // 274 B-C
        ...

Implicit hydrogens

But what about the implicit hydrogens? After all, there are no data structure bonds which connect an atom to an implicit hydrogen. What I can do is go through the list of atoms and if the atom is an Li, B, C, etc. and the atom has an implicit hydrogen then set the right bits.

I already had code in section 1 which goes through the atoms, so I modified it to also handle these bits in section 3. What I showed earlier wasn't complete. Here's the full code:

  for (OEIter<OEAtomBase> atom = mol.GetAtoms(); atom; ++atom) {
    unsigned int eleno = atom->GetAtomicNum();
    if (eleno >= OEElemNo::MAXELEM) {
      continue;
    }
    count_table[eleno] += 1;
    num_hydrogens = atom->GetImplicitHCount();
    if (num_hydrogens) {
      count_table[OEElemNo::H] += num_hydrogens;
      // Handle some of the bits for section 3
      switch (eleno) {
      case OEElemNo::Li: SET_BIT(263); break; // 263 Li-H
      case OEElemNo::B: SET_BIT(272); break; // 272 B-H
      case OEElemNo::C: SET_BIT(283); break; // 283 C-H
      case OEElemNo::N: SET_BIT(299); break; // 299 N-H
      case OEElemNo::O: SET_BIT(308); break; // 308 O-H
      case OEElemNo::Al: SET_BIT(318); break; // 318 Al-H
      case OEElemNo::Si: SET_BIT(320); break; // 320 Si-H
      case OEElemNo::P: SET_BIT(323); break; // 323 P-H
      case OEElemNo::As: SET_BIT(325); break; // 323 As-H
      }
    }

Section 4: Simple atom nearest neighbors

The previous section look at single connections to a specific atom. This section looks at multiple bonding patterns to a specific atom.

These bits test for the presence of atom nearest neighbor patterns,
regardless of bond order (denoted by "~") or count, but where bond
aromaticity (denoted by ":") is significant.

Bit Position	Bit Substructure
327	C(~Br)(~C)
328	C(~Br)(~C)(~C)
329	C(~Br)(~H)
330	C(~Br)(:C)
 ...

CDK uses SMARTS patterns

The CDK code is easy to understand and very much like its implementation of the previous section. Here's an example:

b = 327;
if (cs.countSubstructure("[#6](~Br)(~[#6])") > 0) fp[b >> 3] |= MASK[b % 8];
b = 328;
if (cs.countSubstructure("[#6](~Br)(~[#6])(~[#6])") > 0) fp[b >> 3] |= MASK[b % 8];
b = 329;
if (cs.countSubstructure("[#6&!H0]~[Br]") > 0) fp[b >> 3] |= MASK[b % 8];

It uses "~" in the SMARTS to mean "any bond", and it has to be careful with the "C" and turn it into a "[#6]" so it matches both aromatic and non-aromatic carbons.

Here again you can see how Rajarshi uses the SMARTS "!H0" to check if an atom matches a hydrogen. The original NCGC implementation requires explicit hydrogens with:

b = 263; if (cs.countSubstructure("[Li]~[H]") > 0) fp[b>>3] |= MASK[b%8];

I go through the list of atoms

Again, I wrote my code with a premature view towards performance. I say this because during testing I found several bugs in my implementation and it would have been better to get valid code working first, use that to generate a lot of test cases, and optimize knowing that the test cases will help identify flaws.

What I did was notice that there are two things to track: "number of bonds to atoms of each element number" (including aromatic bonds) and "number of aromatic bonds to atoms of each element number." If I have those two lists then my bit tests are almost simple.

Since English is an imprecise language for this, I'll show you the code which makes those two lists:

  for (OEIter<OEAtomBase> atom=mol.GetAtoms(); atom; ++atom) {
    unsigned int eleno = atom->GetAtomicNum();
    if (eleno != OEElemNo::C && eleno != OEElemNo::N && eleno != OEElemNo::O &&
        eleno != OEElemNo::P && eleno != OEElemNo::S && eleno != OEElemNo::Si) {
      continue;
    }
    int bond_to[70] = {0};
    int aromatic_to[70] = {0};
    unsigned int other_eleno;

    bond_to[1] = atom->GetImplicitHCount();

    for (OEIter<OEBondBase> bond=atom->GetBonds(); bond; ++bond) {
      other_eleno = bond->GetNbr(atom)->GetAtomicNum();
      // the heaviest one I worry about is iodine
      if (other_eleno > 60) {
        continue;
      }
      if (bond->IsAromatic()) {
        aromatic_to[other_eleno]++;
        bond_to[other_eleno]++;
      } else {
        bond_to[other_eleno]++;
      }
    }

With that in place the tests become:

    switch (eleno) {
    case OEElemNo::C:
      if (bond_to[OEElemNo::Br] && bond_to[OEElemNo::C]) {
        SET_BIT(327);   // 327 C(~Br)(~C)
      }
      if (bond_to[OEElemNo::Br] && bond_to[OEElemNo::C]>=2) {
        SET_BIT(328);   // 328 C(~Br)(~C)(~C)
      }
      ...
      if (bond_to[OEElemNo::C]>=2) {
        SET_BIT(332);   // 332 C(~C)(~C)
      }
      ...
      if (bond_to[OEElemNo::C]>=2 && aromatic_to[OEElemNo::C]) {
        // >=2 because the :C adds one
        SET_BIT(355);   // 355 C(~C)(:C)
      }
      ...
    case OEElemNo::N:
      if (bond_to[OEElemNo::C]>=2) {
        SET_BIT(390);   // 390 N(~C)(~C)
      }
      if (bond_to[OEElemNo::C]>=3) {
        SET_BIT(391);   // 391 N(~C)(~C)(~C)
      }
    ..

The tricky part was things like bit 355. The -C adds one to bond_to and the :c adds one to both bond_to and aromatic_to. As a result, I have to check for "at least 2 carbon bonds" and "at least 1 aromatic carbon bond."

I missed up several times, which is why I had to write a lot of extra tests. Remember, when possible you should have at least one test for each possible branch in the code. In bit 355 there are 3 different possible cases. And when you write the tests, be sure to have them on the boundaries so that simple mistakes (like writing ">" instead of ">=") can be identified.

Section 5: Detailed atom neighborhoods

The last two sections looked at the environment of a given atom, and this is a more complex extension of the same idea.

These bits test for the presence of detailed atom neighborhood
patterns, regardless of count, but where bond orders are specific,
bond aromaticity matches both single and double bonds, and where "-",
"=", and "#" matches a single bond, double bond, and triple bond
order, respectively.

Bit Position	Bit Substructure
416	C=C
417	C#C
418	C=N
419	C#N
420	C=O
421	C=S
422	N=N
423	N=O
424	N=P
425	P=O
426	P=P
427	C(#C)(-C)
428	C(#C)(-H)

It took me some time to figure out section, but in the end it's a simple translation to SMARTS. That's exactly what the NCGC and CDK implementations do:

b = 416;
if (cs.countSubstructure("[#6]=,:[#6]") > 0) fp[b >> 3] |= MASK[b % 8];
b = 417;
if (cs.countSubstructure("[#6]#[#6]") > 0) fp[b >> 3] |= MASK[b % 8];
...
b = 427;
if (cs.countSubstructure("[#6](#[#6])(-,:[#6])") > 0) fp[b >> 3] |= MASK[b % 8];
b = 428;
if (cs.countSubstructure("[#6H](#[#6])") > 0) fp[b >> 3] |= MASK[b % 8];

It's simple. It works. (Well, see below for a problem in one of the SMARTS.) And it's the same thing I do except that I compile the SMARTS patterns once during static initialization:

static OESubSearch bit_416("[c,C]=,:[c,C]");  // C=C
static OESubSearch bit_417("C#C");  // C#C
  ...
static OESubSearch bit_427("C#[c,C]-,:[c,C]");  // C(#C)(-C)
static OESubSearch bit_428("[C;!H0]#C");  // C(#C)(-H)

and OEChem has an option to check if a match exists, rather than returning a list of all matches like the CDK implementation. This gives the same results but with better performance.

  if (bit_416.SingleMatch(mol)) {
    SET_BIT(416);   // C=C
  }
  if (bit_417.SingleMatch(mol)) {
    SET_BIT(417);   // C#C
  }

Implicit hydrogens and aromaticity

If you are very keen-eyed and detail oriented you would see that in bit 428 I use "[C;!H0]" and CDK uses "[#6;H]". I think both of our patterns contain errors.

Even though the C is triple bonded, I think there's an slight chance that might be in an aromatic ring. I need to use "#6" instead. (I tend to use the "[c,C]" form because it doesn't require me to think about the atomic number all the time. This is more a concern with [As,as].)

So I changed my code.

While the CDK code should have "[#6;!H0]" because as it stands it matches "one and only one hydrogen" not "at least one hydrogen." I'll let Rajarshi know.

And this, dear reader, is one example of why you should read someone else's code, and one example why open source is a Good Thing.

Ideally we would need test cases for all of these, cross-matched with the CACTVS output. That will take time. Writing tests isn't fun for most people. But paying people helps. (Hint, hint - I'm a consultant!)

Idea for faster performance

I test each pattern, one by one. Each pattern has to test all N atoms, so there are 45*N atom checks. I should be able to speed this up using OESubSearch.AtomMatch, which matches the pattern using only the specified atoms. Then my code would look like:

  for (OEIter<OEAtomBase> atom = mol.GetAtoms(); atom; ++atom) {
    switch (atom->GetAtomicNum()) {
    case OEElemNo::C:
      if (bit_416.AtomMatch(atom)) {
        SET_BIT(416);
      }
      if (bit_417.AtomMatch(atom)) {
        SET_BIT(417);
      }
    ...
  }

This would do only 32*N atom matches in the worst case of all carbons. Not much difference I know. Also, I'm slightly put off by the documentation:

The AtomMatch method is the effective combination of sequential calls to AddConstraint followed by SingleMatch.

I translate this to mean AtomMatch isn't thread-safe. In any case, I can probably use a similar technique from before using careful bond tracking. It's not as obvious here on how to do that, but perhaps it's just a simple case of missing the obvious.

Section 6: Simple SMARTS patterns

This section title lies. Well, it isn't a serious lie. It says it contains SMARTS patterns but it's as interpeted in the context:

These bits test for the presence of simple SMARTS patterns, regardless
of count, but where bond orders are specific and bond aromaticity
matches both single and double bonds.


Bit Position	Bit Substructure
460	C-C-C#C
461	O-C-C=N
462	O-C-C=O
463	N:C-S-[#1]
   ...

The question is, how do you convert these into SMARTS?

Both the anonymous NCGC developer and I ended up with essentially the same solution. Here's the CDK implementation of those bits:

b = 460;
if (cs.countSubstructure("[#6]-,:[#6]-,:[#6]#[#6]") > 0) fp[b >> 3] |= MASK[b % 8];
b = 461;
if (cs.countSubstructure("[#8]-,:[#6]-,:[#6]=,:[#7]") > 0) fp[b >> 3] |= MASK[b % 8];
b = 462;
if (cs.countSubstructure("[#8]-,:[#6]-,:[#6]=,:[#8]") > 0) fp[b >> 3] |= MASK[b % 8];
b = 463;
if (cs.countSubstructure("[#7]:[#6]-,:[#16]-[#1]") > 0) fp[b >> 3] |= MASK[b % 8];

Where the original definition had a "-", the SMARTS has a "-,:". This is because of the "bond aromaticity matches both single and double bonds." But if you do that then you have to allow both aromatic and non-aromatic atoms at either end, as with [#6].

Similarly, where the "simple SMARTS" had a "=", the real SMARTS has "=,:", and again, both atoms must allow aromatic and non-aromatic forms.

For reference, here's my corresponding implementation:

static OESubSearch bit_460("[#6]-,:[#6]-,:[#6]#C"); // C-C-C#C
static OESubSearch bit_461("[#8]-,:[#6]-,:[#6]=,:[#7]"); // O-C-C=N
static OESubSearch bit_462("[#8]-,:[#6]-,:[#6]=,:[#8]"); // O-C-C=O
static OESubSearch bit_463("n:c-,:[#16;!H0]"); // N:C-S-[#1]
 ...
  if (bit_460.SingleMatch(mol)) {
    SET_BIT(460);   // C-C-C#C
  }
  if (bit_461.SingleMatch(mol)) {
    SET_BIT(461);   // 461 O-C-C=N
  }
...

My SMARTS are slightly different because I noticed that a term like "N:C" can be expressed directly as "n:c" and not "[#7]:[#6]". Both are correct.

The CDK code also uses [#1], which is the SMARTS way to write "explicit hydrogen atom". My code uses the "!H0" trick so it work with both implicit and explicit hydrogens. I sent email to Rajarshi about this.

Section 7: Complex SMARTS patterns

This was the last, and most annoying nut to crack. The definition says:

These bits test for the presence of complex SMARTS patterns,
regardless of count, but where bond orders and bond aromaticity are
specific.

Bit Position	Bit Substructure
713	Cc1ccc(C)cc1
714	Cc1ccc(O)cc1
715	Cc1ccc(S)cc1
 ...
776	CC1CCC(C)CC1
 ...

I took that to mean these are real SMARTS definitions, and I can use them directly.

I did that and compared my results to CACTVS results from PubChem. To my annoyance, I didn't get the same matches, and what I saw made little sense.

Using the definition above as my example, CACTVS would find matches to both bit 713 and bit 776. I looked at the structure and that would only be possible if the SMARTS matches the same substructure. Which can't be because in SMARTS "c" and "C" will never match the same atoms.

That's where Evan's pointer to other open source code proved very helpful. The CDK uses the original NCGC SMARTS translations like this:

b = 713;
if (cs.countSubstructure("[#6]c1ccc([#6])cc1") > 0) fp[b >> 3] |= MASK[b % 8];
...
b = 776;
if (cs.countSubstructure("[#6][#6]1[#6][#6][#6]([#6])[#6][#6]1") > 0) fp[b >> 3] |= MASK[b % 8];

In other words, if the documentation says "C" then the SMARTS should be "[#6]", but leave "c" as "c". Ditto for the other atoms which have aromatic forms.

A few search-and-replaces later and I had:

static OESubSearch bit_713("[#6]c1ccc([#6])cc1");
static OESubSearch bit_714("[#6]c1ccc([#8])cc1");
static OESubSearch bit_715("[#6]c1ccc([#16])cc1");
 ...
static OESubSearch bit_776("[#6][#6]1[#6][#6][#6]([#6])[#6][#6]1");
 ...
  if (bit_713.SingleMatch(mol)) {
    SET_BIT(713);   // 713 c1ccc(C)cc1
  }
  if (bit_714.SingleMatch(mol)) {
    SET_BIT(714);   // 714 Cc1ccc(O)cc1
  }
...

and my failing tests started passing. Yay! Thanks once again to CDK and NCGC for making their source code available for the world to read!

Idea for faster performance

Neither CDK nor I take advantage of an obvious possible performance improvement. In order to match "[#6]c1ccc([#16])cc1" the query must have at least 7 carbons and a sulpher. The counts have already been computed so it's easy to say something like:

if (atom_counts[OEElemNo::C] >= 7 && atom_counts[OEElemNo::S] >= 1 &&
       bit_713.SingleMatch(mol)) {
   SET_BIT(713);   // 713 c1ccc(C)cc1
}

Even better, in my code I've already tested that there's at least one 6-membered ring, so I can also test for that feature. It isn't so easy with ESSSR since that algorithm might have give different results with a bridged ring.

I haven't done this because I have no idea if I'll need the performance. Remember, it will take time to implement and provide a comprehensive set of test cases, and right now that time isn't worth it to me.

But I will release this library soon, and if it's worth it to you then you're free to make the changes. If you do, please contribute them back to me, along with tests and performance numbers.

Comments or criticism?

Or do you just want to praise my writing talents? Email me or leave comments.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me