Computers -- History of Chemical Nomenclature

[ previous | newer ] /home/writings/diary/archive/2003/10/15/Computers

Computers -- History of Chemical Nomenclature

In comes the 1950s, and the start of commercial, general purpose computers. People started to use them to automate many of the mechanical tasks once done by hand. Looking back 50 years, one obvious task is assigning chemical nomenclatures, but there are some big steps to overcome.

First, can computers be taught nomenclature? Sure, there's all this talk of 'electronic brains', but it is real or just hype? Second, the early computers were quite small (in power; definitely not small in size). One reason it took three years to train a chemist to be a nomenclaturist is because chemistry is complex. How can all those rules be encoded in the computer? Third, can computer handle the I/O appropriate to chemistry?

Luckily, there are useful intermediate steps. Before getting to them, let me elaborate on the interesting influence of linguistics in computer science. If you've ever programmed in Fortran you'll notice it has a different feel than C, Java, Python, Lisp and just about every other programming language. It seems to lack from a certain sense of structure, but I'm at a loss for giving a good, clear example. Fortran was designed before there was a good theoretical understanding of languages and parsers.

Chomsky, now best known as a .. political dissident is probably the best term, formalized in 1956 how we think about syntax in computer languages. He introduced context-free grammars, which are now used to describe the syntax of nearly all computer languages. CFGs are succinctly written in a variation of BNF ("Backus-Naur Form") and can easily be turned into a parser using a parser-generator like yacc or ANTLR. There is a CFG for Fortran but it's very complicated; there are definite advantages in having a language be designed to be easy to parse as a CFG. BNF, by the way, was introduced in 1960 as a way to describe the Algol programming language.

Linguistics also provides tools to understand existing languages. Garfield's 1961 PhD thesis, which I've mentioned favorably in a few of my older essays, used linguisitic tools to determine the morphemes (smallest parts of a language) in chemical nomenclature and how they fit together. One result of this analysis, which is the topic of his thesis, was the realization that the molecular formula for a compound could be determined directly from the morphemes in the compound name.

This was surprising, even stunning. Previously (meaning for the previous century), chemists were taught to use the chemical name to make the chemical graph, then count the atoms in the graph to get the molecular weight. This takes time and is error-prone, even for an expert. His thesis showed that 1) chemical nomenclature has a well-defined syntax which can be studied with the tools of linguistics, 2) the results have direct applicability to real needs in chemistry, and 3) can be encoded on a computer, or done by a relatively untrained person. Chemists were shocked that it was so easy, and it took a lot of checking and double checking to verify the results really were correct.

In the 1960s the big pharmas and indexing companies like ISI started using computers for their chemical informatics requirements. Computers could do all the things punch card machines could do, and faster. They could be programmed increasingly nuanced rules of chemistry, to compute molecular weight, figure out the molecular graph, etc. given a WLN or a molecular name. Graphic display systems became mainstream enough that people could sketch the molecular graph on a display instead of on paper, and get a high-quality printout of the graph. The first commercial system for this was ISI's CROSSBOW, (released in 1969 and written in Cobol (!) and assembler [#]). It only had text input but did have graph output. Within a few years there were several available products, including programs which could take graphical input.

Computers got powerful enough to support new types of searches based on structure. IUPAC systematic names, WLNs, and other line notations required years of training in order to produce a unique (or canonical) compound name. This made it hard for normal chemists to find a compound by name. But any chemist could sketch out the molecular graph on the screen, which could easily and automatically be converted to a connection table listing all the atoms and bonds. To find a matching compound in the database, the computer just finds all records with an isomorphic connection table. (In other words, where the graphs were identical.) Real code would use various tricks to filter out obvious rejects, because graph isomorphisms are slow.

Computers can also do substructure and superstructure searches, which look for compounds where part of the molecule matches the sketched out query, or vice versa. The older line notations along with paper-based indicies could handle a few of these, but they were basically precomputed. Computers allowed chemists to search for an arbitrary substructure. Subgraph isomorphism searches are a bit harder than graph isomorphism ones, and required cleverer tricks to prefilter out likely mismatches, but chemists and programmers are clever. I expect true substructure search appeared after structure search, but I don't know when. (I say "true" because CROSSBOW did allow searching for WLN fragment names.)

Computers could also be used for similarity searches. This is mostly done using fingerprints, which is a bit string. A key based system like MACCS applies a chemical meaning to each bit; "bit 4 is only on when there is at least one selenium atom", "bit 5 is on if and only if there is a C=O bond." A path based system like Daylight's looks at all paths up to a certain size and computes a hash value based on the atoms and bonds in the path, which is used to turn a bit on. The result is an invariant fingerprint that can be compared to other fingerprints. If two fingerprints are highly dissimilar than they are different, and if they are similar than the compounds are (hopefully) likely to be similar. If the similarity value is a metric (or metric-like) then it can be used to cluster data sets, which gives yet another way computers can help understand chemistry. (Contact Mesa Analytics if you want to hire people with expertise on this topic. Tell John I apologize for borrowing his ladder for so long :)

(As another aside: the Levenstein distance between two words is the number of character insertions and deletions to transform one into the other. I know of no similar distance measure for graphs. Do you? If so, please let me know.)

All of these searches became possible because computers could apply graph algorithms to the connection table. Even better, computers could be used to make predications of physical properties (like the number of rotatable bonds, flexibility, and CLogP) based only on that connection table. It's no wonder that by the latter part of the 1970s people could exclaim that "Line notations are dead!"

There are some problems with connection tables. They are big. The most commonly used connection table format comes from MSI and looks like

14 14 0 0 0 0 0 0 0 0999 V2000
0.8641 0.1885 -0.0550 C 0 0 0 0 0 0 0 0 0 0 0 0
1.6904 1.2832 -0.0518 C 0 0 0 0 0 0 0 0 0 0 0 0
3.0579 1.1437 -0.0197 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6120 -0.1296 0.0167 C 0 0 0 0 0 0 0 0 0 0 0 0
2.7571 -1.2117 0.0051 C 0 0 0 0 0 0 0 0 0 0 0 0
1.3999 -1.0697 -0.0319 C 0 0 0 0 0 0 0 0 0 0 0 0
3.8348 2.4466 -0.0147 C 0 0 0 0 0 0 0 0 0 0 0 0
3.1949 3.5035 -0.0635 O 0 0 0 0 0 0 0 0 0 0 0 0
5.2264 2.5396 0.0432 O 0 0 0 0 0 0 0 0 0 0 0 0
5.0391 -0.2600 0.0701 O 0 0 0 0 0 0 0 0 0 0 0 0
5.8568 -1.4155 0.2362 C 0 0 0 0 0 0 0 0 0 0 0 0
5.5015 -2.5866 0.4359 O 0 0 0 0 0 0 0 0 0 0 0 0
7.3030 -1.1536 0.2016 C 0 0 0 0 0 0 0 0 0 0 0 0
5.4969 3.4603 0.0349 H 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 6 2 0 0 0 0
2 3 2 0 0 0 0
3 4 1 0 0 0 0
4 5 2 0 0 0 0
5 6 1 0 0 0 0
3 7 1 0 0 0 0
7 8 2 0 0 0 0
7 9 1 0 0 0 0
4 10 1 0 0 0 0
0 11 1 0 0 0 0
1 12 2 0 0 0 0
1 13 1 0 0 0 0
9 14 1 0 0 0 0
M END

According to the filename, that's aspirin, although I don't know what the hydrogen is doing there as the 14th atom. The first three numbers in the atoms fields are for the coordinates. Other fields in the table store the atom symbol (remember Berzeluis?), formal charge, bond type, chirality, and other atom or bond properties. What makes this a connection table is the first two numbers in the second section (starting with 1 2 1. This is the bond section and each field connects one atom to another. It's very hard for people to read a connection table directly, and tedious even to convert it by hand into a chemical graph, which would look like

The full SD file for aspirin takes 928 bytes, which is about 40 times larger than the systematic name of "2-Acetoxybenzoic acid" or the semi-systematic name of "acetylsalicylic acid," or the trivial name of "aspirin." The coordinates aren't needed for the topology, so the connection table could in theory be smaller, but still nowhere comparable in size.

Why is space a problem? In part because there are a lot of known compounds. By the early 1980s, CAS had 6.3 million compounds. Uncompressed it would take over 6 GB just for the connection table, and even with a good binary data structure it needed to live on disk and not in memory. (A Cray supercomputer may have had that much RAM, but likely not your local VAX.)

Even if space wasn't a concern, a good line notation is nice because you can stick it in an text field of a relational database, or into a cell of an Excel spreadsheet. If it's a canonical name, it's easy to check for duplicate compounds and even, with some line notations, remove salts and do other simple manipulations right on the string. And with a really good one, a chemist can convert it to/from a graph with little problem.

It's actually relatively easy to come up with a line notation. The standard way is to create a spanning tree for the graph, label each atom and bond (perhaps with a shortcut notation for things like carbons and single bonds), and come up with a way to write that graph as a string, eg, by using markers for the start of each branch in the spanning tree and markers which join cycles which were broken to make the tree.

A good example of this style of line notation is SMILES. A (non-canonical) SMILES for aspirin is C(=O)(O)c1ccccc1OC(=O)C which is only one character longer than the systematic name. The '(' is the start of a branch and the ')' closes a branch. Digits are used to close open rings, so the 1 closes up the benzene ring.

If you've noticed, I keep mentioning the canonicalization problem. It's easy to make a line notation, but it's really hard to make a canonicalization scheme using that notation. That's the big problem. found in all graph-based chemistry nomenclatures developed since the early 1800s, and was only known solution required people spend a few training to use a nomenclature system. How could a computer hope to match that?

The answer, which was very, very clever, is described in

David Weininger, Arthur Weininger and Joseph L. Weininger, ``SMILES 2: Algorithm for Generation of Unique SMILES Notation'', Journal of Chemical Information and Computer Science (JCICS), Vol. 29, No. 2, pp. 97-101, 1989.

It depends on the Fundamental Theorem of Arithmetic. That's the theorem which states that all integers > 1 have a unique prime decomposition. (It's also the reason what 1 is neither a prime nor a compositite, as then the FToA wouldn't be true, or would need a special-case exclusion of 1.)

It's so cool I'm going to tell you how it works. Assign each atom a number based on all the properties which distinguish it from another atom; two atoms are the same if and only if that number is that same. It doesn't matter what the numbers are, so long as they can be compared that way. Renumber them so the atom with the lowest value is 0, the second lowest is 1, etc. Duplicates are allowed if the original values were the same. Get the corresponding prime (0->2, 1->3, 2->5, 3->7, etc.). Now replace all the numbers with the product of the values of its neighbors. Eg, if an atom is bonded to two other atoms, with values 5 and 7 then its new value is 35. Renumber them again so that the lowest value is 0, etc. If the order hasn't changed then stop.

This gives every atom a unique number, independent of the input order, up to symmetry. (Ie, the two carbons in C-O-C will have the same number. Root the spanning tree at an atom with the lowest value. When a branch is possible, take the one to the atom with the lowest value. Presto - a canonical SMILES! And you might be able to tell that this solution could only have been done on a computer; no one would want to work through the number of multiplications needed to generate it by hand.

As it turns out, the original SMILES paper only establishes atom order and ignores bonds. What happens if there the bonds are of different type? The paper does say that when there are multiple atoms with the same type to take the one with highest bond type first (triple over double over single). It's a good approximate solution, but I think there are cases where it can fail. I don't know the generally accepted solution for those cases.

Daylight Chemical Information Systems is built around SMILES and related languages like SMARTS (a substructure search language). Because the notation is simple, it can be entered by people with minimal training, even using a keyboard. A SMILES can be easily converted by hand into a graph depiction. The SMILES string is small for reasonable-sized molecule, so can easily be manipulated by ~~perl~~Python, Excel, Oracle or other tools. It has a simple grammar so it's easy to write parsers for it. All-in-all, a nice language.

Most importantly, it meant their tools were fast. Instead of doing a graph isomorphism search (even with filters) to find an identical compound, they get the canonical SMILES for the query compound and use a very fast string match. Instead of reading records from a disk, they can stick entire chemical database in memory, which totally eliminates all slow disk I/O.

Pretty cool what a good nomenclature can do!

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me