Parsing SMILES, compression, Ragel

[ previous | newer ] /home/writings/diary/archive/2007/06/25/smiles_states

Parsing SMILES, compression, Ragel

SMILES

Some classes of molecular structures can be represented using graphs from number theory, with atoms as nodes and bonds as edges. These happen to include effectively all of the molecules used to make drugs and pesticides and those used in biological systems. (There are rough edges, like tautomers, and some of the no-go areas are metals, which have more complex bonding system not representable as bonds between atoms.)

Tokenizing SMILES with a state diagram

I talked some about SMILES before but didn't get into the details, only a bit about the tokenization step. Years ago I developed the SMILES tokenizer for FROWNS. I'm rather quite proud of it. It's pretty simple to understand, especially compared to the set of if/switch statements used in most of the other ones I've seen (for example, CDK's).

BTW, if you're going to use it you should make the following extensions, based on OpenEye SMILES:

aromatic arsenic is allowed inside of brackets (that is, [as]). This is found in the NCI data set
they allow quadruple bonds using '$'
I now think that "C..C" should be an allowed SMILES string

A tokenizer is one stage in the parsing process. Usually it isn't the only stage. My version strengthens the tokenizer so it know which tokens are allowed after a given token. This is usually the parser's job, along with checking that parentheses are balanced and that ring closures match.

I used the FROWNS tokenizer idea for one of my clients. They have a web form text entry box that lets you put in SMILES, and SD file, or a set of identifiers. Their code autodetects the input format. I used my tokenizer to see if the input was sufficiently SMILES-like. It was on the brower client side so I translated the code by hand into Javascript.

Parser generators and Ragel

I've long been interested on parser generation tools for things like this. I like the idea that I can write something very succinctly and let the machine make it work and hopefully make it work fast. My SMILES parser using regular expressions to drive a state machine isn't particularly fast. Fast enough for most cases though.

There are a large number of parser generation tools out there. Most are oriented towards context-free grammars, which is a bit more complex than I need. Turns out most of what I deal with uses regular grammars. There are tools for that class of problems as well. One I've had an eye on is Ragel.

In the last few years there have been two cases where I wanted a SMILES parser which was more low-level. To understand what low-level means in this context I'll describe the problems.

Unique chemical graph enumeration

When Mick Kappler was at Daylight (some years ago) he did some work on the structure/graph enumeration problem. How many chemically reasonable structures are possible with N (carbon) atoms? Some details on his poster.

One of the steps in generating a structure is ensuring that it's not a duplicate structure. The simplest way is to have a set of "seen" structures. If the new structure isn't in that set, and its canonical form is not in that set, then it is a new structure, and add the canonical SMILES (and possibly the non-canonical SMILES) to the set.

The number of monocyclic unsaturated structures he generated (including duplicates) goes something like 3**(n-3.25) and the number of unique ones also has exponential growth. At some point you can't keep all of those SMILES in memory.

Compressing SMILES

I got to wondering how to make better use of memory storage. Dave Weininger and Roger Sayle (then at Metaphorics) did some internal work in the late 1990s to compress SMILES strings using LZW. I don't recall the details now and I wasn't too clear on it then because it wasn't until a couple years later that I learned about how compression works, from the great book "Managing Gigabytes."

My memory is they only used a bit or two per letter. A nice thing about LZW compression is there's a direct mapping from a given bit pattern to a token, which makes it straight-forward to do a text search of a compressed data stream. There's a bit of trickiness because the algorithm has to deal with variable bit-sized tokens, and there might be multiple ways to encode the SMILES given an input dictionary. But Roger and Dave are smart and clever people.

(Another neat thing about compressing the data is that modern computers are usually waiting on I/O. I can be faster to read compressed data off the disk and uncompress it in-memory than to read uncompressed data directly. I remember Dave not believing this at first and had to test it to prove it to himself.)

Consider the input SMILES of "CCOC" and the dictionary containing encoding for "C", "CC", "CO" and "OC". Depending on the bit size for each token the encoder may choose to encode that string as "C", "CO", "C" or as "CC" "OC". As I recall from Mick, the compression code he inherited would try the various combinations to get better compression, at the expense of higher compression time. I don't recall if he ended up compressing the input.

Arithmetic coding

One of the limitations of LZW is the quantization of the number of bits used per symbol. This is also its strength, depending on the application. Another approach is arithmetic coding, which is also really cool. When I was in 8th or 9th grade I learned about how in a physically perfect world all information could be stored in a single stick. Take the information, convert it to a number, put a decimal point in front of it, and cut a stuck that's exactly that long.

For example, the text "Hello!" in base 256, encoded as ASCII and displayed in hex is "0x48656c6c6f21", or in base 10 as 79600447942433. Cut a stick which is exactly 0.79600447942433 meters long. To read information off of the stick, determine its length exactly (remember, this is a physically perfect world) and do the back transformation.

Arithmetic coding is exactly that principle, with some tricks to handle things like knowing when to stop, and generating the minimum number of bits needed to get close enough to the stopping condition for a given number.

Another limitation to Dave and Roger's approach, in my opinion, is they didn't look at what they were compressing. They treated SMILES as a text string. The more modern approach is to do statistical modeling. They probably used a 0 order approach to find the most common tokens some training set. SMILES doesn't use all of the characters so they could use "Q" for, say, "CC(" and "!" for "O)" if those proved common enough.

modeling the SMILES grammar

It's a 0 order model because it only looks at the statistics of each token, without regard for each other. "C" might occur 80% of the time and "O" the other 20%. A first order model looks at the frequencies of when one token followed another. For example, the pair "CC" might occur three times more often than "CO", so if there's a "C" then there's a 25% chance that the next character is an "O", instead of the 20% predicted by the 0 order model. This means the encoder can use slightly fewer bits in the output.

Encoding tools like gzip support higher order encoders, but at some point it's probably not worth going higher unless you have very repetitive strings. Though computers are fast and memory cheap so go ahead these days and use "-9" when you gzip something. These systems are adaptive and they adjust the encoding frequencies to better fit the data as it comes in. The encoder and decoder use the same adaptation algorithm so everything stays in synch.

The better you can model the input, the better the compression. General purpose algorithms know nothing about SMILES strings, so I figured I could probably make something which compresses better than gzip. For Mick's case the SMILES are very simple: carbon atoms, '(' branches ')', one or two ring closures, no bond terms, nothing inside of '[' brackets ']'. By construction I know the first character must be a 'C' so there's no reason to actually encode it. If a '(' is in the SMILES then there must be a ')' before the end.

What I wanted to do was take his input SMILES strings (around 34 bytes each) and compress them down to 64 bits (8 bytes) or less then insert that as a 64 bit number into a Judy1 bitset from the Judy library. Along with some special code to handle the few SMILES which don't compress down that far. I thought of using a hash like an MD5 or SHA-1 hash but those are 128 bits folding to 64 bits and by the birthday paradox I would get duplicates after about 2**32 compounds. Mick generated about 2% of that number of compounds.

I want to compress short tokens. This is not the normal case for gzip, which compresses larger blocks of text, and well outside what bzip2 does with it's Burroughs-Wheeler block transform. gzip uses zlib for the compression. I got and tried out the zlib source code. Its output started with a signature, which wastes some bits, and while I can prepopulate some of the frequency information there's a high startup code which make it infeasible to use for every token.

Hence I couldn't compare directly. What I ended up doing was taking my test set, compressing it with gzip/bzip, dividing by the number of input compounds, and using that as an estimate for how well those compress. Were I to use the gzip idea I would train on a test set to generate statistics for the higher order models and figure out some way to restart output anew for each token.

For my token compresser I used a simplified SMILES grammar and my own (dog slow) implementation of an arithmetic coder. I've got the code but it's undocumented and in development mode. Still, if you are interested then here it is.

SMILES tokenization compression results

Here's the email summarizing my conclusions.

	From: dalke@...
	Subject: more on compression
	Date: October 6, 2004 10:06:42 AM GMT+02:00

Following up on the conversation with Tudor and Mick during Friday
lunch, I experimented some with the combination of arithmetic coding
and better modeling of SMILES strings.  (That is, don't assume they
are just characters but take into account the the first character must
be an atom, as well as the character after a '('. )

I tried two models.  The first uses this state diagram for a SMILES
string.  The number in parenthesis are the statistics generated from
the test set Mick gave me.

START:
  'C' -> expect_any (20043)
expect_any:
  None -> END (20043)
  '(' -> expect_atom (92148)
  ')' -> expect_any (92148)
  '1' -> expect_any (39454)
  '2' -> expect_any (34766)
  '3' -> expect_any (10156)
  '4' -> expect_any (388)
  '5' -> expect_any (38)
  'C' -> expect_any (268353)
expect_atom:
  'C' -> expect_any (92148)
END:

I then compressed that same data set using the above coding.  Here's a
summary of what I found

20,043 strings, 669,685 bytes (649,642 without newline)
13517 with 64 bits or less
6526 with more than 64 bits
largest needs 105 bits, 16 need >= 100  bits
1,243,386 bits total = 62.0 bits per compound (avg)
163,260 bytes (1,306,080 bits) quantized to 8 bits = 65.2 bits/cmpd
compressed size = 23% uncompressed size (at bit level)
compressed size = 24% uncompressed size (at quantized byte level)

I'm looking at the number < 64 bits because one of the Judy tree data
structures is optimized for word-sized boolean arrays.  This says 2/3
of the compounds could be stored in that very compact manner.

The "quantized to 8 bits" is because some of the codings didn't take
up the full byte.  That's the number rounded up to the nearest byte
size.

I next considered a better model of SMILES strings.  The data set I
got has at most 4 embedded (branches).  That's
  CCC1(CC(C(C(C)(C)C)C(C)(C)C)C2CC1C)C2

When inside of a '(' there cannot be the end of string, so there's no
need to include that possibility.  Of course balanced parens are not
regular -- they need at least a context free grammar -- but if the
maximum depth is known beforehand then there is a regular grammar.

So I tried with this model of the SMILES strings

START:
  'C' -> expect_any (20043)
expect_any:
  None -> END (20043)
  '(' -> expect_atom_close1 (76129)
  '1' -> expect_any (33090)
  '2' -> expect_any (27112)
  '3' -> expect_any (8025)
  '4' -> expect_any (305)
  '5' -> expect_any (29)
  'C' -> expect_any (214609)
expect_any_close1:
  '(' -> expect_atom_close2 (15454)
  ')' -> expect_any (76129)
  '1' -> expect_any_close1 (6055)
  '2' -> expect_any_close1 (7309)
  '3' -> expect_any_close1 (2039)
  '4' -> expect_any_close1 (81)
  '5' -> expect_any_close1 (9)
  'C' -> expect_any_close1 (51073)
expect_any_close2:
  '(' -> expect_atom_close3 (560)
  ')' -> expect_any_close1 (15454)
  '1' -> expect_any_close2 (309)
  '2' -> expect_any_close2 (343)
  '3' -> expect_any_close2 (92)
  '4' -> expect_any_close2 (2)
  '5' -> expect_any_close2 (0)
  'C' -> expect_any_close2 (2649)
expect_any_close3:
  '(' -> expect_atom_close4 (5)
  ')' -> expect_any_close2 (560)
  '1' -> expect_any_close3 (0)
  '2' -> expect_any_close3 (2)
  '3' -> expect_any_close3 (0)
  '4' -> expect_any_close3 (0)
  '5' -> expect_any_close3 (0)
  'C' -> expect_any_close3 (22)
expect_any_close4:
  ')' -> expect_any_close3 (5)
  '1' -> expect_any_close4 (0)
  '2' -> expect_any_close4 (0)
  '3' -> expect_any_close4 (0)
  '4' -> expect_any_close4 (0)
  '5' -> expect_any_close4 (0)
  'C' -> expect_any_close4 (0)
expect_atom:
  'C' -> expect_any (0)
expect_atom_close1:
  'C' -> expect_any_close1 (76129)
expect_atom_close2:
  'C' -> expect_any_close2 (15454)
expect_atom_close3:
  'C' -> expect_any_close3 (560)
expect_atom_close4:
  'C' -> expect_any_close4 (5)
END:

You'll note that this is only good enough for coding the given data
set.  Other compounds are likely to use an edge in the state diagram
which is listed as 0 probability.  That's easy enough to fix, but it
means the numbers I'm about to give are on the low side.

On the other hand, I can do even better modeling.  For example, for
some of your data you'll know how many atoms are expected, so I can
remove the
   'expect_any' -> END
probabilities entirely.  I also know that if the previous atom used a
ring closure number then the atom bonded to it because they are
sequential in the SMILES string (barring branches) cannot have the
same ring closure.  (Ie, 'C1C1' is not allowed, nor is C1(CC)C1 )

These could be described with a state diagram, with great difficulty,
but it's probably better to augment these with a bit of code.

Okay, so here are the numbers

20,043 strings, 669,685 bytes (649,642 without newline)
19219 with 64 bits or less
824 with more than 64 bits
largest needs 115 bits, 6 need >= 100  bits
1,046,046 bits total = 52.2 bits per compound (avg)
139,548 bytes (1,116,384 bits) quantized to 8 bits = 55.7 bits/cmpd
compressed size = 19.5% uncompressed size (at bit level)
compressed size = 21%   uncompressed size (at quantized byte level)

As you can see, there's a 15% reduction in average size, and nearly
all the coded strings fit in a 64 bit word.

The impressive thing is that the gzip'ed size is 122,295 bytes and
bunzip uses only 112,487 bytes.  Both general purpose compressors beat
my best effort even when I have a good model of the SMILES string.

I can conjecture on why this is true.  Despite the detail of my model
it's still only a first order model.  There is higher order
information in the system that I do not capture.  For example, the
odds of "CCC" given "CC" is not the same as the odds of "CCCC" given
"CCC".

This can be fixed in the usual way of having a 2nd order model to
depend on this first order one, and a 3rd to depend on the second.
Though I see more advanced systems use Markov modeling.

By the way, I experimented with using zlib, the compression library
used by gzip, as a way to get this higher order model information
without having to do extra work.  You can provide text to it to make
the coding dictionary so it can be pre-trained on the input data.

It didn't work out.  The compressed strings were not well compressed.
Why?  I suspect it's something to do with the stream oriented nature
of the zlib compression while these SMILES strings are rather small.
It may take a few characters for the higher order predictors to kick
in.  I also tried to remove the obvious control bits it produced, so
as to only work with data bits.  But I may not have done it correctly.

I have not looked into using the library behind bzip2.

This is all the work I'm going to put into this for now.  The code for
this is all in Python.  Available if desired.

Lingos for similarity searching on SMILES strings

Roger (now at OpenEye) and Andrew Grant (AstraZeneca) presented a poster at CUP '06 and a talk at CUP '07 on doing similarity searches of two SMILES strings based solely on lexical similarity between the two SMILES. For details see "Lingos, Finite State Machines, and Fast Similarity Searching", J. A. Grant, J. A. Haigh, B. T. Pickup, A. Nicholls, and R. A. Sayle, J. Chem. Inf. Model 46(5) (2006) p1912-1918.

I have the paper but haven't read it yet. At the poster presentation I noted that because of its direct dependences on SMILES encoding rules, 'C' and 'I' are given the same weight while 'Br' is different, as is '[Se]'. I wondered if there could be some way of normalizing the SMILES, at the syntax level. For example, assign a single character to each element (convert "H" into chr(1), "C" into chr(6), ...) and strip out any atomic numbers and chirality information.

I prototyped a version at CUP using my Python tokenizer to show it's feasible, but the performance isn't anywhere near what OpenEye wants, and there's the obvious problem that finding a good "normalization" is a lot of work for perhaps only a little improvement and it's best to get something out to get feedback on if tuning is even needed.

Using compression to compare similarity

Last week was the 4^th Joint Sheffield Conference on Chemoinformatics. One of the speakers was Jonathan Hirst from the University of Nottingham. He spoke on Interpretable Correlation Descriptors and Compression-based Similarity: New Open Source Software. (Hmm, he also has a paper in J Chem Inf Mod, which I should read. Good thing Chalmers has a good library.)

He used straight gzip (not even zlib, so there's extra overhead) and bzip2 and commented that he needed to generate *N copies of the input to get better compression statistics. Probably because of the header information and to let the higher order models kick in. He didn't mention that, and when I went up to talk with him afterwards he asked if I had published my results. Not being an academic I hadn't. Maybe I should. But this article is probably as informative to the world. For example, J Chem Inf Mod is subscriber only.

Kolmogorov-Chaitin algorithmic complexity

He started his talk with a reference to "Kolmogorov-Chaitin algorithmic complexity ... defined as the size of the program (that you would feed to a universal Turing machine) required to print the sequence and then terminate." [Quoting his zippotron page]. The tricky thing about that definition, which I didn't mention to him when we chatted, is that it's highly dependent on the Turing machine. SMILES strings don't cover all of symbol space so there's no need to have a general purpose TM. Quoting (thanks be to Google) Delahaye and Zenil in "On the Kolmogorov-Chaitin Complexity for short sequences"

A drawback to Kolmogorov-Chaitin complexity $(K)$ is that it is uncomputable in general, and that limits its range of applicability. Moreover when strings are short, the dependence of $K$ on a particular universal Turing machine $U$ can be arbitrary. In practice one can approximate it by computable compression methods. However, such compression methods do not provide a good approximation for short sequences -- shorter for example than typical compiler lengths. In this paper we will suggest an empirical approach to overcome this difficulty and to obtain a stable definition of the Kolmogorov-Chaitin complexity for short sequences.

(I quote this only for the background text. The paper itself does not appear to be relevant to this topic.)

SMILES strings are definitely "short sequences". Also note that my 1st order compression scheme, which is very knowledgable about SMILES, is not a good compression method for Zippotron-style comparison. It really is the higher order modeler in gzip kicking in.

Ragel

I decided to write up here what I did, which gave me the incentive to look at Ragel. I've been looking for a way to describe SMILES which can do the parsing without have the overhead of using a regular expression engine. And preferably something cross platform. Because I know Roger, I want the generated result to be as fast as hand-written code, eg, with no extra variable assignments which are thrown away later. (Actually, it could potentially be faster because Ragel can generate several different types of parsers, eg, if switch statements are faster slower than state tables for a given compiler and architecture).

I also wanted it to be flexible. For example, support for being able to instrument state transitions to gather statistics and do predictive modeling. Or to simplify the grammar (eg, no one really uses a charge of '++++' instead of '+4', and no toolkit supports the higher level chiral symmetry classes).

Yesterday I developed a SMILES grammar definition for Ragel. It's again rather undocumented and only somewhat tested. Here's the contents of my simple Makefile

a.out: smiles_counter.rl
	ragel -e smiles_counter.rl | rlgen-cd 
	g++ smiles_counter.c

The code in there started as a "count the number of atoms and bonds in a SMILES string" then turned into a "compute the molecular weight of each SMILES in a file" program as I worked on testing it, then changed a bit more. The original molecular weight numbers were wrong because it doesn't include implicit hydrogens. I used OpenEye's OEChem to convert my original test data into explicit atom form and to get the molecular weight data.

The timings I got were that my code is "faster" than OpenEye's, but it's a worthless comparison because I'm not doing the perception needed to add implicit hydrogens myself and I'm not verifying that the SMILES is correct. So I won't tell you actual timing numbers. I will compare the 650 or so lines I needed compared to the following OEChem version

#include 
#include "oechem.h"

int main(int arg, char **argv) {
  OEChem::oemolistream ifs("/Users/dalke/src/test.smi");

  OEChem::OEMol mol;

  while (ifs >> mol) {
    std::cout << OEChem::OECalculateMolecularWeight(mol) << std::endl;
  }

  return 0;
}

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me