Dalke Scientific Software: More science. Less time. Products

Before starting into today's lecture, my lecture yesterday was based on John Bradshaw's Introduction to Chemical Systems article. He understand chemistry and the history of chemical systems much better than I do, and it has some neat pictures. I recommend reading it.

Fingerprints

A chemical fingerprint is a list of binary values (0 or 1) which characterize a molecule. There are several ways to create the list. I'll describe the widely use MACCS keys and how to use them for similarity comparisons and for database filtering. I'll then switch over to John Barnard's talk titled Chemical Structure Representation and Search Systems, which is a very good and comprehensive overview of the ways people have developed to compare two molecules.

The MACCS keys are a set of questions about a chemical structure. Here are some of the questions:

The result of this is a list of binary values – either true (1) or false (0). This list of values for a given chemical structure is called the MACCS key fingerprint for that structure.

Here's an example. If the molecule is C1CCC1 then the answers to those questions are:

The answers are frequently written as a list of bits (also called a bitstring). The bitstring for this molecule is "1010".

I can repeat that for other compounds. If the input structure is C1(=C(SSC1=O)Cl)Cl, which looks like

then the bitstring is "1101".

An interesting idea, but why do it? Comparing two molecules directly is a hard problem. In bioinformatics you're used to comparing two sequences based on the alignment. That works because the concept of a minimum string edit maps pretty well to the physical model of how evolution works on the sequence. The direct mapping into chemistry is to look for the minimum edit distance of the graphs. That doesn't work because that operation has little physical meaning.

Chemists have worked hard to understand molecules and discovered that some substructure motifs give an indication of the functionality (or lack) of a compound. While not a perfect description these bitstrings have three useful properties. They are easy to compare, a chemist can understand the results, and they have some predictive power.

Here's an easy way to compare two bitstrings. Compare each bit and add 1 when they are they different (one is 1 and the other 0 or vice versa). Divide the result by the total number of bits in the string. If the two strings are identical then this value is 0. If one string is the exact opposite of the other then this value is 1. This is known as the Hamming distance between the two bitstrings.

I can use the fingerprint bitstrings to search a chemical database. If I think the bitstrings and the comparison method are close enough to the chemistry then I can find similar compounds to the query by comparing bitstrings and choosing only those that are similar enough. Computers are very fast at comparing bits so this technique can be used even in very large databases.

Fingerprints are also useful as filters for substructure searching. Suppose each structure has a fingerprint with fields like

At the start of a substructure search the code and analyze the query to see if any of the fingerprints can reduce the search space. For example, if the query has 5 or more carbons then there's no need to test those compounds with fewer than 5 carbons. Similarly, if the query has a 6 member ring then the ones which don't have a 6 member ring can be skipped.

There are many ways to compare two compounds and many nuances. John Barnard's talk does a great job of covering the topic so I'll walk through his slides.

For more details see the Daylight chapter on fingerprints and Mesa Analytic's fingerprint overview.



Copyright © 2001-2020 Andrew Dalke Scientific AB