Diary RSS | All of Andrew's writings | Diary archive
Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.Finding the MCSes for the ChEBI ontology #
This is part 3 of a series on MCS:
- MCS background
- fmcs - find the MCS of a set of compounds
- Finding the MCSes for the ChEBI ontology
The industrious folks at EBI have been developing ChEBI, which expands to "Chemical Entities of Biological Interest." Quoting Wikipedia, "[ChEBI] is a database and ontology of molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical Ontologies effort."
They define several distinct ontologies. One is a chemical structure ontology. For example, the identifier CHEBI:33567 contains catecholamine, and a few examples of catecholamines are hexoprenaline (CHEBI:37950), arbutamine (CHEBI:50580), L-isoprenaline (CHEBI:6257). In addition, catecholamine is a catechol (CHEBI:33566), which in turn is a benzenediol (CHEBI:33570), and so on. A group can have more than one parent; catecholamine is also a monoamine molecular messenger (CHEBI:25375).
The end result is a hierarchial structure. The bottom of the hierarchy are structures, and intermediate nodes are such that all children of the node have some common property.
Some of these common properties map directly to a common substructure. For example, CHEBI:33853 contains phenols, so every compound under that node has "one or more hydroxy groups attached to a benzene or other arene ring."
However, not all of them do. As Chepelev, Hastings, Ennis, Steinbeck, and Dumontier pointed out in "Self-organizing ontology of biochemically relevant small molecules", BMC Bioinformatics 2012, 13:3, the term "'ester' includes compounds that conform to C(=O)OC (i.e. carboxylic esters) and C(=S)OC patterns, among others."
Other cases can't even be represented as SMARTS. They give "bicyclic" as one such example.
Can I find the MCS of all structures in a node in the ontology?
I was curious to see if I could use their data set as a test of fmcs. If their intermediate nodes have a machine-readable way to tell if it's a purely substructure-based node, and if I could get the size information, then I could get all the structures underneath it, find the MCS, and compare my answer to theirs.
Alas, they don't have that annotation information. It's something they are working on, but I didn't get the impression that it's a high priority. (I don't see why it should, either.)
Still, it's an interesting thought - what if I were to generate the MCS for all nodes, and visualize the results somehow?
It took a bit longer than I thought, but I finally downloaded their ontology (in OBO format), parsed it, extracted the hierarchy, figured out the compounds in each node, tossed out the structures that RDKit couldn't parse, and the nodes which didn't have at least two remaining structures in them.
One that was done, I let my MCS algorithm at it. It took about 50 minutes to process. (Well, I had a 15 second timeout on the MCS. I've found that 15 seconds is usually good enough.)
I also developed a visualizer for the result, using Karen Schomburg's SMARTSviewer and Daylight's depictmatch.cgi
Oooh! More pictures!
Here's a snapshot of one of the successful cases, CHEBI:16648, which is dialkyl phosphate:

Most of the results aren't as clear-cut. For example, CHEBI:16389 contains the ubiquinones. I found the MCS:

which is nearly right, but the Wikipedia page for Coenzyme_Q10 ("Coenzyme Q10, also known as ubiquinone, ...") shows a methyl attached to the top-most oxygen this SMARTS depiction. This is because CHEBI:18238 is a structure in the set which does not have that methyl attached!
It this methyl important? I don't know. I'm not a chemist, and this requires expertise I simply don't have.
An oopsie in the oxolanes?
What I do know is that there's a mistake in the oxolanes, CHEBI:26912. Wikipedia calls this tetrahydrofuran and says it's an 5-membered ring with the formula (CH2)4O. I would write it as the SMILES/SMARTS "O1CCCC1".
However, my search finds only "OCCCC"; it doesn't find the cycle. There shouldn't be a problem with this one so I investigated further, wondering if it was a bug. It ended up that acetylblasticidin S (CHEBI:2413) is considered an oxolane. A quick look at the structure though shows that it has no 5-membered ring.
I think that's an annotation error. BTW, I do not envy the job of annotator. There's a lot of data to review, and people like me end up pointing out the mistakes, not the huge amount of work to get all the other parts right.
Even more pictures... most of the ChEBI ontology!
Do you want to see the output of my full analysis? Do you have a lot of memory on your computer? If so, download fmcs_chebi.html.bz2. It's only 7.5 MB but it bzip2 uncompresses to 166 MB. Open fmcs_chebi.html in your browser, and have fun! (Note: I'll probably delete it after a month or so.)
BTW: the images are computed on-demand using servers from the University of Hamburg and from Daylight. I didn't want to show everything at once since that would put a huge demand on those servers. Instead, you'll need to press the "Toggle images" button in order to see the SMARTS and the graphical depiction of the matches.
Comments
If you have comments or questions, leave them here.
fmcs - find the MCS of a set of compounds #
- MCS background
- fmcs - find the MCS of a set of compounds
- Finding the MCSes for the ChEBI ontology
My new MCS program is named "fmcs." It uses RDKit for the cheminformatics and is is available under the BSD 2-clause license. Many thanks to Roche for funding the project and making it available under a free license. Their hope and mine is that fmcs will be included someday as part of RDKit. For now, you can download it from the Bitbucket site. My hope is that people will fund me to continue developing the program.
Show me an example
I'll start with a set of 3669 compounds which I know have a benzotriazole core. I know this because that data set come from doing a substructure search. If you want to try this yourself, the SD file is part of the fmcs distribution.
% fmcs sample_files/benzotriazole.sdf --verbose Loaded 3669 structures from sample_files/benzotriazole.sdf [#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:1:2 9 atoms 10 bonds (complete search) Total time 5.89 seconds: load 2.76 fragment 2.66 select 0.42 enumerate 0.06 (MCS found after 3.13)The core result is the SMARTS pattern on the second line, which represents the common substructure. The first and third line show text sent to stderr when the "--verbose" flag is used. In this case, it took 2.76 seconds to load the 3669 structures, 2.66 seconds to remove atom-bond-atom patterns which aren't in all of the structures, 0.42 seconds to select the reference structure, and only 0.06 seconds to do the actual MCS enumeration algorithm. (It's so fast for this case because the preprocessing stage was enough to identify the MCS.)
Most people can't look at a SMARTS like "[#7]:1:[#7]:[#7]:[#6]:2:[#6]:[#6]:[#6]:[#6]:[#6]:1:2" and easily figure out what it means, so I'll use Karen Schomburg's excellent SMARTSviewer to render it graphically:
as well as RDKit's image renderer to show the match to the first 10 structures of that data file.
Alternate atom and bond comparisons
ignore atom and bond types
By default, the subgraph matches atoms by element and bonds by bond type. There are other options. The "--compare topology" option lets any atom match any other atom, and any bond match any other bond.
% fmcs sample_files/ar_clustered_3D_MM_3.sdf [#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6]-[#6](-[#6]-[#6]-[#6]-[#6])-[#6] 14 atoms 13 bonds (complete search) % fmcs sample_files/ar_clustered_3D_MM_3.sdf --compare topology [*]~1~[*]~[*]~[*]~[*]~2~[*]~[*]~[*]~3~[*](~[*]~1~2)~[*]~[*]~[*]~1~[*]~[*]~[*]~[*]~1~3 17 atoms 20 bonds (complete search) [c-2ec28d18-74736162:~/cvses/fmcs] dalke%Here is a graphical comparison of those two patterns:


That's very interesting. It looks like there's a common substructure, based on the topology, but not when using on the default match parameters. Why is this?
It's a matter of bond perception. Some of the rings found by the topology search are aromatic in some structures but not in others, and some of the bonds are double in one and single in others. I know I gave PubChem a SMARTS pattern to find those structures. I wonder how it ended up like it did.
Match atom types but ignore bond types
The "--compare topology" flag is a shortcut for "--atom-compare any --bond-compare any". You can chose different options for comparing atom types and bond type. For example, if you want to compare atoms by element and ignore bond types then you can do "--atom-compare elements --bond-compare any".
Coming up with a good example of when to use it is left to the student as an exercise. (Let me know the answer when you figure it out, so I can include it here.)
Bond modifiers
Chemists like rings. That's because rings are important. A problem with MCS is that sometimes it matches a long chain of atoms with a strange routing through a ring system. Chemists don't always like that.
If you're one of those people, use the "--ring-matches-ring-only" flag. This make ring atoms match only ring atoms and chain atoms only match chain atoms.
Even that might be too generous. Sometimes chemists really want to see a complete ring in the MCS, and not just a partial one. That is, if the MCS includes a bond which maps to a ring bond in the original structures, then the MCS bond must also be in a ring.
If that's what you want, use the "--complete-rings-only" option.
When I figure out good uses cases for these, I'll update the primary documentation. (I'm told that they are important, but I don't have a good example for you to understand why.)
All that, and more!
I'm still working on the documentation for the newest features, added during the last couple of weeks. You can hijack the "isotopes" atom matcher to define your own atom classes, and specify those classes through a tag in the SD file. Instead of showing the SMARTS as the result, you can see the fragment as SMILES, or you can save the results to an SD file.
Comments
Do you have any comments or questions?
MCS background #
For the last couple of months I've been working on a new MCS implementation, named fmcs. MCS is short for "Maxium Common Substructure", and is the cheminformatics term for the maximum common subgraph isomorphism problem. I'll be posting a set of essays about it, so I'll start with a background piece.
I thought about going into the details of what that means, but the detailed description was too boring. Here's the short version: given two structures, find the largest substructure which is common to both. These can be connected or disconnected substructures; I'm only concerned with connected substructures. Even then there are still variations; does "maximum" mean to maximize the number of atoms in the subgraph, the number of bonds, the number of cycles, or perhaps maximize some more physical property?
There are also variations in how you define atom and bond equivalency. Two atoms might match if and only if they are of the same element type, or you can be more lenient and say that any halogen matches any other halogen. You can say that bonds only match bonds of the same bond type, or that aromatic bonds are allowed to match single or double bonds, or that ring bonds are only allowed to match other ring bonds.
You can also make requirements on overall structure properties, like if a ring bond is part of the MCS then it must also be a ring in the MCS.
That said, the most common MCS is to say that atoms are the same if and only if the element numbers are the same, and bonds are the same if they have the same bond type (single, double, triple, and aromatic types never match each other.) Here's an example of that MCS between acesulfame and saccharin:

(Image thanks to Daylight's
'depictmatch.cgi.')
Yo dawg, I heard you like NP-hard problems ...
The MCS problem is NP-hard. As the number of atoms increases, the run-time should tend to increase exponentially. An advantage to chemical graphs is that atoms have a limited valence; few atoms have more than 4 bonds. This makes the MCS problem a more tractable than the general case. An advantage to working in small-molecule chemistry is that N is usually in the 10s and rarely ever over 100.
So let's make the MCS problem harder (although not in the theoretical sense; it's still NP hard). Instead of finding the MCS between two compounds, find the MCS of two or more compounds. This is sometimes called the "Mulitple MCS problem."
This need often comes up during clustering, where you have have an algorithm which grouped structures together based on various properties, including experimental results. A question might be, is there a structural explanation for this grouping? One way to answer it is to look at the MCS and see if it's a significant part of the structures.
The usual solutions
The multiple MCS problem is not new. People have developed algorithms to find it for several decades. The 2002 review paper by Raymond and Willett, "Maximum common subgraph isomorphism algorithms for the matching of chemical structures", JCAMD 16: 521b lists a variety of solutions, and says "the clique-based approach has been the most prevalent technique involving MCS-based chemical structure manipulation." They mostly discuss the pairwise MCS, but there is some mention of the multiple MCS as well.
A more recent implementation paper is "MultiMCS: a fast algorithm for the maximum common substructure problem on multiple molecules" by Hariharan, Janakiraman, Nilakantan, Singh, Varghese, Landrum, and Schuffenhauer, J Chem Inf Model. 2011 Apr 25;51(4):788-806. Epub 2011 Mar 29. They use the clique-based approach to find all maximal common substructures between pairs of structures, and use those to find the maximum common substructure of the set.
I myself have worked on the multiple MCS problem before, when I was a Bioreason employee in the late 1990s. There I used a back-tracking solution based on "Backtrack Search Algorithms and the Maximal Common Subgraph Problem" by McGregor, Software-Practice and Experience, vol. 12, 23-34 (1982), and also informed by "An algorithm for the multiple common subgraph problem", Bayada, Simpson, Johnson, and Laurenco, J Chem Inf Model. 1992 v32, 680-685. It too enumerated maximal pair-wise solutions to find the MCS for the entire set. We used it for a clustering algorithm based on substructure similarity.
"Maximal" common subgraphs?
You need to be a bit careful when doing a pairwise MCS. It's easy to think that the MCS for the entire set must be a substructure of the MCS for any two pairs, but it's almost as easy to come up with a counter-example. Consider: CCCSSNNN CCCPPPSS NNNPPPOSS. The MCS between the first two structures is "CCC", between the first and third is "NNN", and between the second and third is "PPP" while the MCS between all three is "SS".
Instead, find all the "maximal" common subgraphs between a pair. These are common subgraphs where it's impossible to add a bond to the subgraph and still be common in both graphs. Any MCS must be part of at least one of the maximal subgraphs between a pair of structures. Once you have a maximal common substruture, find the maximal common substructures between that and the next molecule. Keep applying the MCS to successive pairs until done.
One of the nice advantages of this approach is that you get a partial answer early. If you've gotten to the last pairing and ended up with a common substructure for the entire set containing 6 atoms and 6 bonds, then you can quickly reject any future maximal common substructures which are smaller than that. This is called "branch and bound." The algorithm explores different branches (in this case, different maximal common substructure pairings) and sets a bound which can be used to "prune" the branch, that is exclude further searching along a branch.
That was then, this is now
That was the general approach I did in the late 1990s. It took a lot of mental strain and concentration for about 6 weeks, but at the end, it did work. I even released the implementation as PyMCS, under a Python-like license. You can see remnants of the documentation thanks to Archive.org, but no one seems to have access to a copy of it any more.
While it worked, there were some things I didn't like about it. Molecules with lots of local symmetry cause problems, because there are a large number of ways to get maximal common substructure mappings out of it. There are, after all, 12 ways to map a benzene ring to itself. Also, my implementation used threads, where each thread generated the maximimal common substructures for a pair. These days I would use a generator.
Instead, I came up with an alternate solution, based out of ideas I've been thinking of regarding subgraph enumeration. Enumerate all of the subgraphs of a reference structure, convert each subgraph into a subgraph isomorphism query, and test if the subgraph exists in the other structures. The largest such match is the MCS.
There can be exponentially many subgraphs in a structure, but there are ways to make it faster. Again, use a branch and bound technique. If a subgraph doesn't exist in all of the structures then there's no way that any containing subgraph can exist in all structures. So if you use a growth-based method to enumerate the subgraphs, you can test each subgraph and reject the ones which don't work, as well as any further growth in that direction.
Another optimization, also available in RASCAL and MultiMCS, is to check the size of the remaining search space. If the current size, plus the maximum number of atoms or bonds remaining, isn't better than the current best match, then there's no need to keep on searching.
This ends up finding all of the common substructures, and not just the maximal ones. There's a few advantage to this approach. You only need to find one substructure match for each structure, not all of them. Also, subgraph isomorphism tests are well-optimized core parts of a cheminformatics toolkit, so that's work I don't have to do.
Everything old is new again
I thought that this was a new algorithm. I was mostly wrong. It turns out that Takahashi, Satoh, and Sasaki published this idea in "Recognition of largest common structural fragment among a variety of chemical structures" Anal. Sci. 1987, 3, 23b
I didn't read this paper until this evening. It's really neat to see the similarities between their ideas and mine, as well as the differences. They omit a pre-processing step which removes obvious bond mismatches, and they prefer checking all subgraphs of size N before checking subgraphs of size N+1. I'll need to mull it over, and dig up the enumeration method of Varkony, Shiloach, and Smith in "Computer-assisted examination of chemical compounds for structural similarities", J. Chem. Inf. Comput. Sci. 1979, 19 (2), 104bI can make a more comprehensive comparison. Grrr! And that's only available from behind the ACS paywall.
In that paper they report that the MCS between acesulfame and saccharin, in "level 1" (which corresponds to my "topology" option) took their FORTRAN 77 program 21.15 CPU seconds on a Data General ECLIPSE-MV/6000. My code on my desktop takes 0.010 seconds, or 2000 times faster. With 25 years between us, it's meaningless to make any comparison of these other than the normal exclamation that computers have gotten a lot faster. BTW, my PyMCS code from the late 1990s, using larger structures, took about 10 seconds on average.
Curiously, Raymond and Willett categorize this as an "ad hoc procedure." "This group of algorithms represents a diverse set of methods that have typically been designed specifically to fulfill an immediate need for a particular application without much regard to general or wide-scale usage requirements." Based on my reading of the Takahashi algorithm, it is applicable to general and wide-scale usage requirements. I wonder how they drew that conclusion.
Comments
Do you know what Raymond and Willett commentes as they did? Have you implemented the enumeration method of Varkony? Do you know the binary language of moisture ... ooh, sorry, got distracted. Feel free to leave a comment or ask questions about MCS algorithms and implementations.
Python's concurrent.futures #
In this essay I'll describe how to use the concurrent.futures API from Python 3.2. Since I'm still using Python 2.7, I'll use Alex Grönholm's back port instead.
PEP 3148 gives the motivation for the new concurrent module:
Python currently has powerful primitives to construct multi-threaded and multi-process applications but parallelizing simple operations requires a lot of work i.e. explicitly launching processes/threads, constructing a work/results queue, and waiting for completion or some other termination condition (e.g. failure, timeout). It is also difficult to design an application with a global process/thread limit when each component invents its own parallel execution strategy.Basically, using "threading" and "multiprocessing" are harder than they should be.
The guiding problem: analyze web logs
My web site archives the daily server logs. Filenames are of the form "www.20120115.gz". Each access is a single line in "combined log format." Here's an example line:
198.180.131.21 - - [25/Dec/2011:00:47:19 -0500] "GET /writings/diary/diary-rss.xml HTTP/1.1" 304 174 "-" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"It contains the host IP address, date, URL path, referrer information, user agent, and a few more fields.
I have 169 files which I want to analyze. gzcat *.gz | wc -l says there are 1,346,595 records. I'll use this data set to show some examples of how to use concurrent.futures.
Number of accesses per day (single-threaded)
For the start, how many log events are there per day?
import glob
import gzip
for filename in glob.glob("www_logs/www.*.gz"):
with gzip.open(filename) as f:
num_lines = sum(1 for line in f)
print filename.split(".")[1], num_lines
Note: gzip files didn't support context managers until Python 2.7. If
you are on Python 2.6 then you'll get the error message
AttributeError: GzipFile instance has no attribute '__exit__'When I run that I get output which looks like:
20110801 7305 20110802 7594 20110803 7470 20110804 7348 20110805 7504 20110806 4774 20110807 4870 20110808 9815 ... 20120113 18124 20120114 9245 20120115 8100 20120116 14117That's too detailed, and hard to interpret. A graph would be nicer. Here it is:
I seem to get more people during the work week than the weekend, and one of my other essays got on Hacker News in early January.
I made that plot using the matplotlib's "pylab" API:
import glob
import gzip
from pylab import *
import datetime
dates = []
counts = []
for filename in glob.glob("www_logs/www.*.gz"):
with gzip.open(filename) as f:
num_lines = sum(1 for line in f)
date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
dates.append(date)
counts.append(num_lines)
plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()
That code is a bit ugly, so I'll clean it up a bit and conveniently put it into a form which helps transition to the parallelization code:
import glob
import gzip
import datetime
import time
def count_lines(filename):
with gzip.open(filename) as f:
num_lines = sum(1 for line in f)
date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
return (date, num_lines)
filenames = glob.glob("www_logs/www.*.gz")
dates = []
counts = []
for filename in filenames:
date, count = count_lines(filename)
dates.append(date)
counts.append(count)
## Believe or not, but this next line does the same as the previous block(!)
# dates, counts = zip(*(count_lines(filename) for filename in filenames))
from pylab import *
plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()
It's slow. Make it faster!
That code takes 5.5 seconds to read the 1.3 million lines. I have a four core machine - surely I can make better use of my hardware!
I'll start with multiple threads. Python has supported threads since the 1990s, but as we all know, CPython has the Global Interpreter Lock which prevents multiple threads from running Python code at the same time. On the other hand, this task is doing file I/O, and gzip uncompression in code which might release the GIL. Perhaps threads will work here?
I'll use a very standard approach. I'll define a set of jobs, and pass that over to a thread pool. Each job takes a filename to process as input, calls the function "count_lines", and returns the timestamp and number of lines in the file.
Here's how you do that with the concurrent.futures API:
import glob
import gzip
import datetime
from concurrent import futures
def count_lines(filename):
with gzip.open(filename) as f:
num_lines = sum(1 for line in f)
date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
return (date, num_lines)
filenames = glob.glob("www_logs/www.*.gz")
dates = []
counts = []
with futures.ThreadPoolExecutor(max_workers=2) as executor:
for (date, count) in executor.map(count_lines, filenames):
dates.append(date)
counts.append(count)
from pylab import *
plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()
The "ThreadPoolExecutor" creates a thread pool, in this case with two
workers. You can submit as many jobs as you want to this thread pool,
but only two (in this case) will be processed at a time. The thread
pool is also a context manager, and no more jobs can be submitted once
the context is finished.
How are jobs submitted? You can either submit a job using submit() or you can submit a number of jobs using the "map() " idiom, which is what I did here. Remember, this is a Python 3.x API so map() returns an iterator, and not a list like it does in Python 2.x.
What is "map"?
The term "map" comes from functional programming, but functional programming is not emphasized in the Python language. Instead, we more often use a list or generator comprehension, or build a list manually. The following three methods are equivalent:
>>> print [ord(c) for c in "Andrew"] [65, 110, 100, 114, 101, 119]
>>> print map(ord, "Andrew") [65, 110, 100, 114, 101, 119]
>>> result = [] >>> for c in "Andrew": ... result.append(ord(c)) ... >>> print result [65, 110, 100, 114, 101, 119]So "map(count_lines, filenames)" is a roughly the same as:
for filename in filenames:
yield count_lines(filename)
and "executor.map" does the same thing, only it uses a thread in the
thread pool to evaluate the function.
Also, to switch the above code to its almost exact single-threaded version, what you can do is get the Python 2.x iterater version of "map" (in itertools.imap) and rewrite the above as:
import itertools
...
for (date, count) in itertools.imap(count_lines, filenames):
dates.append(date)
counts.append(count)
But is it faster?
No. ;)
With one thread in the thread pool, the task takes 5.5 seconds. The overall time is unchanged from the unthreaded version, as we should expect.
With two worker threads, it takes 7.0 seconds - even longer than with one thread!
Three worker threads takes 7.3 seconds, and four threads takes 7.4. This is not a trend you want to see when you need to parallelize your software.
There are two likely candidates for the slowdown. The GIL is the obvious one, but perhaps my computer doesn't handle parallel disk I/O that well.
What about multiple processes?
What I'll do is switch from the multi-threaded version to the multi-processing version. Instead of using a thread pool, I'll have a process pool, which uses interprocess communications to send the job request to each process and get the results:
import glob
import gzip
import datetime
from concurrent import futures
def count_lines(filename):
with gzip.open(filename) as f:
num_lines = sum(1 for line in f)
date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
return (date, num_lines)
filenames = glob.glob("www_logs/www.*.gz")
dates = []
counts = []
with futures.ProcessPoolExecutor(max_workers=4) as executor:
for (date, count) in executor.map(count_lines, filenames):
dates.append(date)
counts.append(count)
from pylab import *
plot(dates, counts)
ylim(0, max(counts))
title("My website accesses")
show()
Did you see the difference? I used a "ProcessPoolExecutor" instead of a "ThreadPoolExecutor".
With that small change, a process pool with only one worker finishes in 5.6 seconds, which is a bit slower. That's probably due to the overhead of starting a new process and sending data back and forth.
What's exciting is that two workers finishes in 3.6 seconds, three workers in 2.8 seconds, and four workers in 2.6 seconds. It's obviously not great speedup (perfect scaling would be 5.5, 2.3, 1.8, and 1.1 seconds), but I end up cutting my time in half with relatively little work.
Faster, please
At this point it's safe to assume that most of the gzip+line count code requires the GIL. A quick look at "gzip.py" tells me that, yes, that is the case.
With some non-trivial effort, I could write a specialized C extension to replace the gzip module. That's overkill for this project. Instead, my computer has the usual unix utilities so I'll rewrite the "count_lines" function and let them them do the work instead.
import subprocess
def count_lines(filename):
gzcat = subprocess.Popen(["gzcat", filename],
stdout = subprocess.PIPE)
wc = subprocess.Popen(["wc", "-l"],
stdin = gzcat.stdout,
stdout = subprocess.PIPE)
num_lines = int(wc.stdout.readline())
date = datetime.datetime.strptime(filename, "www_logs/www.%Y%m%d.gz")
return (date, num_lines)
Using this version, my single-threaded time is 3.2 seconds, with two
threads it's 2.0 seconds, three threads is 1.8 seconds, and four
threads is 1.7 seconds.
The respective times with the process pool are 3.3 seconds, 2.1 seconds, 1.8 seconds and 1.8 seconds. This means that very little time in either of these cases is spent in the GIL, and the slightly slower multiprocess times likely reflects extra cost of starting a process and doing interprocess communications (IPC).
What are the top URLs on my site?
Okay, I admit that the previous section was overkill, but it's fun sometimes to try out and compare different alternatives.
I want to mine my logs for more information. What are the top 10 most downloaded URLs?
This is the perfect situation for Python's Counter container. This was added in Python 2.7; see that link for how to support older versions of Python.
I'll start with the simplest single-threaded version; remember that a line in the log file looks like:
198.180.131.21 - - [25/Dec/2011:00:47:19 -0500] "GET /writings/diary/diary-rss.xml HTTP/1.1" 304 174 "-" "Mozilla/5.0 (Windows NT 5.1; rv:8.0) Gecko/20100101 Firefox/8.0"The following analysis code:
import glob
import gzip
from collections import Counter
counter = Counter()
for filename in glob.glob("www_logs/www.*.gz"):
with gzip.open(filename) as f:
for line in f:
# Extract the path field from the log string
request = line.split('"')[1]
path = request.split(" ")[1]
counter[path] += 1
for path, count in counter.most_common(10):
print count, path
takes 8.9 seconds to generate this listing:
170073 /favicon.ico 93354 /writings/diary/diary-rss.xml 81513 /dss.css 78961 /images/toplogo_left.gif 78655 /images/spacer.gif 78526 /images/toplogo_right.gif 74223 /images/news_title.gif 26528 / 25349 /robots.txt 16962 /writings/NBN/python_intro/standard.cssThat's really not exciting information. In a bit, I'll have it only display counts for the information.
A concurrent.futures version
We've determined that Python's gzip reader uses the GIL, so it's pointless to parallelize the above code using threads.
There's another issue. The "counter" is a global data structure, and that can't be shared across multiple Python processes. I'll have to update the algorithm somewhat. I'll let each worker function process a file and create a new counter for that file. Once it's done, I'll send the counter instance back to the main process for further processing.
Here's a worker function which does that.
def count_urls(filename):
counter = Counter()
with gzip.open(filename) as f:
for line in f:
request = line.split('"')[1]
path = request.split(" ")[1]
counter[path] += 1
return counter
The code in the main process has to kick off all of the jobs, collect the counters from each file, merge the counters into one, and report the top hits. The new(ish) Counter object helps make this easy because the "update()" method sums the values for shared keys instead of replacing like it would for a dictionary.
merged_counter = Counter()
filenames = glob.glob("www_logs/www.*.gz")
with futures.ProcessPoolExecutor(max_workers=4) as executor:
for counter in executor.map(count_urls, filenames):
merged_counter.update(counter)
for path, count in merged_counter.most_common(10):
print count, path
(You might be asking "How does it exchange Python objects?" Answer:
Through pickles.)
The above runs in 4.4 seconds, so about 1/2 the time as the single processor version. And after I fixed a bug (I used "counter" my report, not "merged_counter"), I got identical values as the single-threaded version.
4.4 seconds is pretty good. As we saw before, Python's gzip reader is not as fast as calling out to gzcat, so I decided to use a Popen call instead. Also, I changed the code slightly so it only reports paths which end with ".html".
The final code runs in 3.2 seconds, and here it is:
from collections import Counter
from concurrent import futures
import glob
import gzip
import itertools
import subprocess
def count_urls(filename):
counter = Counter()
p = subprocess.Popen(["gzcat", filename],
stdout = subprocess.PIPE)
for line in p.stdout:
request = line.split('"')[1]
path = request.split(" ")[1]
if path.endswith(".html"):
counter[path] += 1
return counter
filenames = glob.glob("www_logs/www.*.gz")
merged_counter = Counter()
with futures.ProcessPoolExecutor(max_workers=4) as executor:
for counter in executor.map(count_urls, filenames):
merged_counter.update(counter)
for path, count in merged_counter.most_common(10):
print count, path
It tells me that the 10 most popular HTML pages from my site are
15830 /Python/PyRSS2Gen.html 13722 /writings/NBN/python_intro/command_line.html 11739 /writings/NBN/threads.html 10663 /writings/NBN/validation.html 6635 /writings/diary/archive/2007/06/01/lolpython.html 4525 /writings/NBN/writing_html.html 3756 /writings/NBN/generators.html 3465 /writings/NBN/parsing_with_ply.html 2958 /writings/diary/archive/2005/04/21/screen_scraping.html 2786 /writings/NBN/blast_parsing.html
Resolving host names from IP addresses
My logs contain bare IP address. I'm curious about where they come from. I write about cheminformatics; are any computers from pharma companies reading my pages? To do that, I need a fully qualified domain name for each IP address. Moreover, I want to save the IP address to domain name mapping so I can use it in other analyses.
Here's how to get the FQDN given an IP address as a string.
>>> import socket
>>> socket.getfqdn("82.94.164.162")
'dinsdale.python.org'
DNS lookups take a surprisingly long time; 0.2 seconds on my desktop, and I understand this is typical. Since I have 117,504 addresses, that may take a few hours. On the other hand, all of that time is spent waiting for the network to respond. This is easily parallelized.
socket.getfqdn() is single-threaded on a Mac
I tried at first to use multiple threads for this, but that didn't work. No matter how many threads I used, the overall time was the same. After a wild goose chase where I suspected that my ISP throttled the number of DNS lookups, I found the problem.
The getfqdn function is a thin wrapper to socket.gethostbyaddr(), which itself is a thin layer on top of the C function "gethostbyaddr()". In most cases, the underlying API may only be called from a single thread. A common solution is to implement a reentrant version, usually named "gethostbyaddr_r", but the OS X developers decided that people should use a different API for that case. ("getaddrinfo ... is a replacement for and provides more flexibility than the gethostbyname(3) and getservbyname(3) functions".) The Python module only calls the single-threaded code, and uses a lock to ensure that only one thread calls it at a time.
The problem is easily solved by using a process pool instead of a thread pool.
Extracting IP addresses from a set of gzip compressed log files
The first step is to get the IP addresses which I want to convert. I only care about unique IP addresses, and don't want to waste time looking up duplicates. The code to extract the IP addresses is straight-forward. Reading the compressed file is not the slow part, so there's no reason to parallelize this or use an external gzip process to speed things up.
def get_unique_ip_addresses(filenames):
# Report only the unique IP addresses in the set
seen = set()
for filename in filenames:
with gzip.open(filename) as gzfile:
for line in gzfile:
# The IP address is the first word in the line
ip_addr = line.split()[0]
if ip_addr not in seen:
seen.add(ip_addr)
yield ip_addr
I don't want to process the entire data set during testing and
debugging. The above returns an iterator, so I use itertools.islice to
get a section of 100 terms, also as an iterator:
filenames = glob.glob("www_logs/www.*.gz")
ip_addresses = itertools.islice(get_unique_ip_addresses(filenames), 1800, 1900)
(I started with the range (0, 1000), but then ran into the
gethostbyaddr reentrancy problems. I didn't want my computer to do a
simple local cache lookup, so I change the range to (1000, 1100), then
(1100, 1200) and so on. This show that it took me a while to figure
out what was wrong!)
Using "executor.submit()" instead of "executor.map"
How am I going to do the parallel call? I could do a simple
with ProcessPoolExecutor(max_workers=20) as executor:
for fqdn in executor.map(socket.getfqdn, ip_addresses):
print fqdn
but then I lose track of the original IP address, and I wanted to
cache the IP address to FQDN mapping for later use. While it might be
possible to use a combination of itertools.tee and itertools.izip, I
decided that "map" wasn't the right call in the first place.
The executor's "map" function guarantees that the result order will be the same as the input order. I don't care about the order. Instead, I'll submit each job using the "submit()" method.
with futures.ProcessPoolExecutor(max_workers=10) as executor:
jobs = []
for ip_addr in ip_addresses:
job = executor.submit(resolve_fqdn, ip_addr)
jobs.append(job)
The submit function returns a "concurrent.futures.Future"
object. For now, there are two important things about it. You can ask
it for its "result()", like this:
ip_addr, fqdn = job.result()The "result()" method blocks until the Promise has a result. Blocking is bad for performance, so how do you know which job promises are actually ready? Use "concurrent.futures.as_completed()" for that:
for job in futures.as_completed(jobs):
ip_addr, fqdn = job.result()
print ip_addr, fqdn
The last part to this puzzle is to have the actual job return the two
element tuple with both the input IP address and the resulting FQDN
def resolve_fqdn(ip_addr):
fqdn = socket.getfqdn(ip_addr)
return ip_addr, fqdn
Put it all together and the code is:
from concurrent import futures
import glob
import gzip
import itertools
import socket
import time
def get_unique_ip_addresses(filenames):
# Report only the unique IP addresses in the set
seen = set()
for filename in filenames:
with gzip.open(filename) as gzfile:
for line in gzfile:
# The IP address is the first word in the line
ip_addr = line.split()[0]
if ip_addr not in seen:
seen.add(ip_addr)
yield ip_addr
def resolve_fqdn(ip_addr):
fqdn = socket.getfqdn(ip_addr)
return ip_addr, fqdn
filenames = glob.glob("www_logs/www.*.gz")
ip_addresses = itertools.islice(get_unique_ip_addresses(filenames), 1800, 1900)
with futures.ProcessPoolExecutor(max_workers=20) as executor:
jobs = []
for ip_addr in ip_addresses:
job = executor.submit(resolve_fqdn, ip_addr)
jobs.append(job)
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
ip_addr, fqdn = job.result()
print ip_addr, fqdn
This processes 100 IP addresses in about 2-8 seconds. The actual time
is highly dependent on DNS response times from servers around the
world. To reduce the variability, I increased the number of IP
addresses I used for my measurements. I found that with 20 processes I
could do about 50 lookups per second, and with 50 processes I could do
about 90 lookups per second. I didn't try a higher number.
Use a dictionary of futures instead of a list
What I did seems somewhat clumsy in that I send the IP address to the process, and the process sends the IP address back to me. I did that because it was easy. The module documentation shows another technique.
You can keep the jobs in a dictionary, where the key is the future object (returned by "submit()"), and its value is the information you want to save. That is, you can rewrite the above as:
with futures.ProcessPoolExecutor(max_workers=20) as executor:
jobs = {}
for ip_addr in ip_addresses:
job = executor.submit(socket.getfqdn, ip_addr)
jobs[job] = ip_addr
# Get the completed jobs whenever they are done
for job in futures.as_completed(jobs):
ip_addr = jobs[job]
fqdn = job.result()
print ip_addr, fqdn
Notice how it doesn't need the "resolve_fqdn" function; it can call
socket.getfqdn directly.
Add a callback to the job future
The conceptual model so far is "create all the jobs" followed by "do something with the results." This works well, except for latency. I only processed 100 IP addresses in my example. I removed the "islice()" call and asked it to process all 117,504 IP addresses in my data set. The code looked like it wasn't working because it wasn't giving output. As it turned out, it was still loading all of the jobs.
The concurrent module uses an asynchronous model, and just like Twisted's Deferred and jQuery's deferred.promise(), there's a way to attach a callback function to a future, which will be called once the answer is ready. Here's how it works:
with futures.ProcessPoolExecutor(max_workers=50) as executor:
for ip_addr in ip_addresses:
job = executor.submit(resolve_fqdn, ip_addr)
job.add_done_callback(print_mapping)
When each job future is ready, the concurrent library will call the
"print_mapping" callback, with the job result as its sole parameter:
def print_mapping(job):
ip_addr, fqdn = job.result()
print ip_addr, fqdn
Technical notes: The callback occurs in the same process which
submitted the job, which is exactly what's needed here. However, the
documentation doesn't say that all of the callbacks will be done from
the same thread, so if you are using a thread pool then you probably
want to use a thread lock around a shared resource. (sys.stdout is a
shared resource, so you would need one around the print statement
here. I'm using a process pool, and the concurrent process pool
implementation uses a single local worker thread, so I don't think I
have to worry about contention. You should verify that.)
Here is the final callback-based code:
from concurrent import futures
import glob
import gzip
import socket
def get_unique_ip_addresses(filenames):
# Report only the unique IP addresses in the set
seen = set()
for filename in filenames:
with gzip.open(filename) as gzfile:
for line in gzfile:
# The IP address is the first word in the line
ip_addr = line.split()[0]
if ip_addr not in seen:
seen.add(ip_addr)
yield ip_addr
def resolve_fqdn(ip_addr):
fqdn = socket.getfqdn(ip_addr)
return ip_addr, fqdn
## A multi-threaded version should use create a resource lock
# import threading
# write_lock = threading.Lock()
def print_mapping(job):
ip_addr, fqdn = job.result()
print ip_addr, fqdn
## A multi-threaded version should use the resource lock
# with write_lock:
# print ip_addr, fqdn
filenames = glob.glob("www_logs/www.*.gz")
with futures.ProcessPoolExecutor(max_workers=50) as executor:
for ip_addr in get_unique_ip_addresses(filenames):
job = executor.submit(resolve_fqdn, ip_addr)
job.add_done_callback(print_mapping)
It processed my 117,504 addresses in 1236 seconds (about 21 minutes),
which means a rate of 95 per second. That's much better than my
original rate of 5 per second!
functools.partial
By the way, just like earlier, there's no absolute need for the worker function to return the ip address. I could have written this as:
job = executor.submit(socket.getfqdn, ip_addr) job.add_done_callback(functools.partial(print_mapping, ip_addr))or even as an ugly-looking lambda function with a default value to get around scoping issues.
In this variation, print_mapping becomes:
def print_mapping(ip_addr, job):
fqdn = job.result()
print ip_addr, fqdn
where the "ip_addr" was stored by the "partial()", and where "job"
comes from the completed promise.
This approach feels more "pure", but I find that methods like this are harder for most people to understand.
Who subscribes to my blog's RSS feed?
A quick check of the list of hostnames shows that no one from AstraZeneca reads my blog from a work machine. Actually, I don't have any accesses from them at all, which is a bit surprising since I know some of them follow what I do. They might use a blog aggregator like Google Reader, or use a home account, or perhaps AZ's data goes through a proxy which doesn't have the name "az" or "astrazeneca" in it.
There are requests from Roche, and Vertex, but no blog subscribers. Who then subscribes to my blog?
Here I print the hostnames for requests which fetch my blog's RSS feed. With 'zgrep' it's fast enough that I'm not going to parallelize the code.
import subprocess
import glob
hostname_table = dict(line.split() for line in open("hostnames"))
filenames = glob.glob("www_logs/www.*.gz")
p = subprocess.Popen(["zgrep", "--no-filename", "/writings/diary/diary-rss.xml"] + filenames,
stdout = subprocess.PIPE)
for line in p.stdout:
ip_addr = line.split()[0]
hostname = hostname_table[ip_addr]
if hostname_table == ip_addr:
# Couldn't find a reverse lookup; ignore
continue
print hostname
A quick look at the output shows a lot of requests from Amazon and
Google, so I removed those, and report the results using:
% python2.7 readers.py | sort | uniq -c | sort -n | grep -v amazon | grep -v google.comSince I have 169 days of log file, I'll say that "avid readers" poll the URL at least once per day. That gives me:
176 modemcable139.154-178-173.mc.videotron.ca 184 62.197.198.100 187 v041222.dynamic.ppp.asahi-net.or.jp 195 modemcable147.252-178-173.mc.videotron.ca 200 94-226-195-151.access.telenet.be 202 217.28.199.236 223 123.124.21.91 223 65.52.56.128 241 5a-m02-d1.data-hotel.net 263 117.218.210-67.q9.net 263 173-11-122-218-sfba.hfc.comcastbusiness.net 274 71-222-225-175.albq.qwest.net 332 modemcable069.85-178-173.mc.videotron.ca 335 no-dns-yet.convergencegroup.co.uk 337 ip-81-210-146-57.unitymediagroup.de 338 embln.embl.de 353 cpe-72-183-122-94.austin.res.rr.com 365 90-227-178-245-no128.tbcn.telia.com 370 138.194.48.143 408 210.96-246-81.adsl-static.isp.belgacom.be 428 k8024-02l.mc.chalmers.se 490 cpe-70-115-243-212.satx.res.rr.com 493 www26006u.sakura.ne.jp 527 219.239.34.54 534 44.186.34.193.bridgep.com 535 5a-m02-d2.data-hotel.net 586 5a-m02-c6.data-hotel.net 666 168-103-109-30.albq.qwest.net 676 y236106.dynamic.ppp.asahi-net.or.jp 698 82-169-211-97.ip.telfort.nl 759 w-192.cust-7150.ip.static.uno.uk.net 1071 adsl-75-23-68-58.dsl.peoril.sbcglobal.net 1217 hekate.eva.mpg.de 1223 211.103.236.94 1307 li147-78.members.linode.com 1342 static24-72-40-170.r.rev.accesscomm.ca 1346 dinsdale.python.org 1398 145.253.161.126 1771 pat1.orbitz.net 2060 artima.com 3164 148.188.1.60 3767 90-224-169-87-no128.tbcn.telia.com 4518 jervis.textdrive.com 5791 cpe-70-114-252-25.austin.res.rr.com 6171 ip21.biogen.com 8264 it18689.research.novo.dk 10919 61.135.216.104I know who comes from one of the Max Planck Institute machines, and a big "hello!" to the readers from Novo Nordisk and Biogen - thanks for subscribing to my blog! "dinsdale.python.org" is the Planet Python aggregator, and artima.com is the Artima Developer Community; another aggregator.
I know more than 60 people read my blog posts within 12 hours of when they are posted, so this tells me that most people read blogs through a web-based aggregator (like Planet Python or Google Reader), and not through a program running on their desktop. I'm glad to know I'm not alone in doing that!
I parallelize an algorithm #
I, with help from Kim Walisch, have been adding OpenMP support to chemfp. This is the first time I've used OpenMP, and I'm pleased to say there are places where it works really well. However, OpenMP, like every other parallization strategy, is no global panacea. It can take a lot of effort to get good scaling, and there are cases where it doesn't feel any easier to use OpenMP than to use pthreads (POSIX threads).
In this essay I'll walk through how I converted a single-threaded algorithm to OpenMP, and compare the results to a version built on an Python async I/O library atop of pthreads.
The single-threaded algorithm
Suppose you have a list of N objects - let's call them "fingerprints" - and a function which compares two fingerprints - call it "tanimoto" - which returns a similarity score value from 0.0 to 1.0. A score of 0.0 means "not similar" and 1.0 means "very similar." The similarity of a fingerprint compared with itself is always 1.0, and the tanimoto function is symmetric, so that tanimoto(x, y) == tanimoto(y, x).
One question you can ask is "which fingerprints are within threshold similarity to the tenth fingerprint?" I'll use the term "neighbor" to include any fingerprint which at least threshold similar to a given fingerprint, so I can restate the above as "what are the threshold neighbors of the tenth fingerprint.
The code for this is not hard:
fingerprint_t fingerprints[] = {array of fingerprint objects};
int query_index = 9; // The tenth fingerprint
assert(0.0 <= threshold && threshold <= 1.0);
for (target_index=0; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
if (query_index != target_index) {
printf("%d is a neighbor\n", target_index);
}
}
}
The main subtlety is the check that I don't report that the first
fingerprint is a neighbor of itself. There are a few ways to handle
that case: here I chose one which is optimized for performance,
assuming relatively view targets are similar enough.
Fingerprint searches are in a high-dimensional space so optimizations like k-d trees, which work for lower dimensional spaces, suffer from the curse of dimensionality. For exact answers, the best you can expect is linear performance. There are clever ways to get sublinear performance, but the worst case is still linear. Still, computers are fast, and can search 100,000 fingerprints in a blink.
Another question you can ask is, what are the neighbor counts for all of the fingerprints in the data set? Here's code which computes that:
fingerprint_t fingerprints[] = {array of fingerprint objects};
int count, query_index, target_index;
int counts[N] = {}; // initialize to 0
assert(0.0 <= threshold && threshold <= 1.0);
for (query_index=0; query_index<N; query_index++) {
count = 0;
for (target_index=0; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
count++;
}
}
/* The counts are too high by one since it includes the diagonal term */
/* and tanimoto(fingerprint[i], fingerprint[i]) == 1.0 */
/* Decrement by one to get the correct answer */
counts[query_index] += count - 1;
}
What this does is go row-by-row through the NxN comparison matrix,
compute the similarities, and add up the number of times where the
similarity is high enough. Since I include the diagonal term in the
counts, and since the similarity along the diagonal is always 1.0, I
have to subtract off one after computing the total row count.
Some might consider it inelegant that I count the self-similarity in (the main loop and subtract one at the end, but it makes the code short and understandable, and while there are N extra calculations, the double loop has a total of N*N calculations, so it's only a small amount of extra work.)
Parallelizing the NxN algorithm
Parallelizing this with OpenMP is dead-simple. I ask the compiler to evalute the row loop in parallel.
fingerprint_t fingerprints[] = {array of fingerprint objects};
int count, query_index, target_index;
int counts[N] = {}; // initialize to 0
assert(0.0 <= threshold && threshold <= 1.0);
#pragma omp parallel for private(count, target_index)
for (query_index=0; query_index<N; query_index++) {
count = 0;
for (target_index=0; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
count++;
}
}
/* The counts are too high by one since it includes the diagonal term */
/* and tanimoto(fingerprint[i], fingerprint[i]) == 1.0 */
counts[query_index] += count - 1;
}
That's it! OpenMP is a great fit to this case. With one new line of
code, I have very good scaleup across many cores.
It's not perfect scaleup. For one, the tanimoto() calculation is fast; fast enough that memory bandwidth and cache performance is an issue. It might be faster to use Z-ordering or other cache-oblivious ordering. That's outside the scope of this essay. For that matter, I hadn't tested this hypothesis because I use another technique which usually gives good cache behavior for the situations I'm most concerned about.
What about symmetry?
That easy parallelization is great, right? Well, I'm missing out on a simple factor of two speedup. The tanimoto function is symmetric, so I only need to compute the upper triangle terms. Here's the single-threaded implementation:
for (query_index=0; query_index<N; query_index++) {
for (target_index=query_index+1; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
counts[query_index]++;
counts[target_index]++;
}
}
}
It looks simple. Too bad it doesn't parallelize well. Increment is not
an atomic operation. If multiple threads execute "counts[i]++" at the
same time, for the same value of i, then it might be that thread 1
reads a value of 4, thread 2 reads a value of 4, thread 1 writes the
incremented value 5, and thread 2 writes its own incremented value of
5. This is bad.
One solution is to tell OpenMP that the increment code is a "critical" section, which means only one thread can execute it at a time. The resulting code is:
#pragma omp parallel for private(target_index)
for (query_index=0; query_index<N; query_index++) {
for (target_index=query_index+1; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
#pragma omp critical (add_count)
counts[query_index]++;
#pragma omp critical (add_count)
counts[target_index]++;
}
}
}
Here, 'add_count' is the symbolic name for a global lock to a critical
section.
I wrote something like this with a very high threshold, and found and almost perfect two-fold speedup. Go OpenMP!
Amdahl's Law strikes again!
The problem is the critical sections are single-threaded. When I lower the threshold, I find more matches, and more threads try to run the single threaded code. This runs directly into Amdahl's Law. The critical section becomes the limiting factor as all the threads contend for the same lock.
I can reduce the contention a bit by keeping track of the row counts in a thread-local variable:
#pragma omp parallel for private(count, target_index)
for (query_index=0; query_index<N; query_index++) {
count = 0;
for (target_index=query_index+1; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
count++;
#pragma omp critical (add_count)
counts[target_index]++;
}
}
/* Correction on 2012-01-12; Commenter scott_s on Hacker News pointed out */
/* that only one thread will ever get here with a given query_index. */
/* There is no chance that multiple threads try to change the same value */
/* This critical section can be removed without affecting correctness. */
/* I leave this code here because it affects the timings. */
/* It does not affect the conclusion that lock contention can make things slow */
#pragma omp critical (add_count)
counts[query_index] += count;
}
This is the simplest seemingly-reasonable parallization of the
upper-triangle algorithm.
How well does this work? My desktop has four cores. I can compare the performance of the original non-symmetric code with the symmetric one:
| Algorithm | Tanimoto thresholds | ||
|---|---|---|---|
| 0.8 | 0.6 | 0.5 | |
| symmetric | 40s | 151s | 660s |
| non-symmetric | 82s | 170s | 207s |
That's terrible. I have one algorithm which is best when there are few similarities, and another which is best when there are many similarities, and because the number of similarities is highly data-dependent, I don't have an easy way to figure out which algorithm to use.
Use many critical sections instead of one
The problem is that my four cores all want to use a single critical section. When one core has the lock, the other threads have to wait. What I can do is increase the number of critical sections. For example, I can have one lock to get access to the even-numbered rows, and another lock to get access to the odd-numbered rows. Here's the corresponding code:
#pragma omp parallel for private(count, target_index)
for (query_index=0; query_index<N; query_index++) {
count = 0;
for (target_index=query_index+1; target_index<N; target_index++) {
if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) {
count++;
switch (target_index % 2) {
case 0:
#pragma omp critical (add_count0)
counts[target_index]++;
break;
case 1:
#pragma omp critical (add_count1)
counts[target_index]++;
break;
}
}
}
/* Correction on 2012-01-12; Commenter scott_s on Hacker News pointed out */
/* that only one thread will ever get here with a given query_index. */
/* There is no chance that multiple threads try to change the same value */
/* The following could be replaced with a simple */
/* counts[query_index] += count; */
/* I leave this code here because it affects the timings. */
/* Since this is only called O(N) (instead of O(N*N/2) times), it */
/* should have minimal effect on the overall time. */
switch (query_index % 2) {
case 0:
#pragma omp critical (add_count0)
counts[query_index] += count;
break;
case 1:
#pragma omp critical (add_count1)
counts[query_index] += count;
break;
}
}
}
That's clumsy, but the performance is a bit better. With two critical
sections the times are:
| Algorithm | Tanimoto thresholds | ||
|---|---|---|---|
| 0.8 | 0.6 | 0.5 | |
| symmetric | 39s | 114s | 375s |
| non-symmetric | 82s | 170s | 207s |
What about even more critical sections? I tried a range of values. Here's the table:
| number of critical sections | Tanimoto thresholds | |||||
|---|---|---|---|---|---|---|
| 0.8 | 0.6 | 0.5 | 0.4 | 0.2 | 0.01 | |
| 1 | 40 | 151 | 660 | |||
| 2 | 39 | 114 | 375 | over 37 minutes | ||
| 16 | 40 | 86 | 133 | 299 | ||
| 64 | 41 | 84 | 105 | 137 | 271 | 307 |
| 128 | 40 | 82 | 102 | 131 | 244 | 278 |
| non-symmetric | 82 | 170 | 207 | 240 | 272 | 280 |
As you can see, with 128 critical sections and in the worst case, my code which takes advantage of symmetry is the same speed as the one which doesn't. This likely means that the code acquire and release the critical section lock has about the same amount of overhead at the Tanimoto similarity calculation.
I probably should have tried 256 different locks, but I think this code is ugly enough as it is, and very few people use thresholds below 0.6, much less down to 0.2. I do wonder how the times compare if there are, say, 16 cores, but this it's time to try a different solution.
What about per-thread count arrays?
There is an alternate solution. I could sum up the counts in individual, private/per-thread arrays and merge the final counts. Here's what the code looks like:
/* Correction 2012-01-08: Originally I had omp_get_num_threads() here but */ /* as acq points out on Hacker News, this only returns the number of active */ /* threads while in a parallel section. Otherwise it returns 1. I fixed my */ /* actual code during testing, but that fix didn't make its way over here. */ /* I also tried using omp_get_max_threads() but that wasn't supported on my Mac. */ int *parallel_counts = (int *) calloc(omp_get_max_threads() * N, sizeof(int)); int *per_thread_counts; #pragma omp parallel for private(count, target_index, per_thread_counts) for (query_index=0; query_index<N; query_index++) { per_thread_counts = parallel_counts + (N * omp_get_thread_num() ); count = 0; for (target_index=query_index+1; target_index<N; target_index++) { if (tanimoto(fingerprint[query_index], fingerprint[target_index]) >= threshold) { count++; per_thread_counts[target_index]++; } } per_thread_counts[query_index] += count; } for (query_index=0; query_index<N; query_index++) { count = 0; for (thread=0; thread<omp_get_num_threads(); thread++) { count += parallel_counts[query_index+N*thread]; } counts[query_index] += count; } free(parallel_counts);This requires no locking, and only a very small bit of sequential code which is linear in the number of fingerprints. There's more code, but this algorithm should scale better than the previous algorithm.
Here are the timings:
| Tanimoto thresholds | ||||||
|---|---|---|---|---|---|---|
| method | 0.8 | 0.6 | 0.5 | 0.4 | 0.2 | 0.01 |
| 128 critical sections | 40 | 82 | 102 | 131 | 244 | 278 |
| non-symmetric | 82 | 170 | 207 | 240 | 272 | 280 |
| per-thread counts | 40 | 83 | 100. | 116 | 135 | 137 |
And now using pthreads from Python
You might conclude this still shows a win for OpenMP. The problem is that the above is essentially identical to how I implement this algorithm using pthreads. I'm rather fond of Python's new concurrent.futures module, so I tested out a pthread-only driver to a single-threaded count function implemented in C.
import ctypes
import time
import itertools
from collections import defaultdict
import threading
import chemfp
from chemfp import futures
import _chemfp
def count_tanimoto_hits(all_counts, arena, threshold, row):
thread_id = threading.current_thread().ident
# This implements essentially:
# for (i=row; i<row+1; i++) {
# for (j=row+1; j<N; j++) {
# if tanimoto(fingerprints[i], fingerprints[j]) >= threshold {
# counts[i]++;
# counts[j]++;
# }
# }
_chemfp.count_tanimoto_hits_arena_symmetric(
threshold, arena.num_bits,
arena.start_padding, arena.end_padding, arena.storage_size, arena.arena,
row, row+1, row+1, len(arena),
arena.popcount_indices, all_counts[thread_id])
def find_counts(arena, threshold, num_threads):
# Allocate per-thread storage (based on the thread-id)
def make_empty_counts():
return (ctypes.c_int*len(arena))()
all_counts = defaultdict(make_empty_counts)
# Use a thread-pool with 4 worker threads
with futures.ThreadPoolExecutor(max_workers=4) as executor:
for row in xrange(len(arena)):
executor.submit(count_tanimoto_hits, all_counts, arena, threshold, row)
# Merge the private counts back into one total list of counts
return [sum(cols) for cols in itertools.izip(*all_counts.values())]
arena = chemfp.load_fingerprints("zinc_drug_like.fps")
chemfp.set_num_threads(1) # Don't use multiple OpenMP threads
for threshold in (0.8, 0.6, 0.5, 0.4, 0.2, 0.01):
t1 = time.time()
x = find_counts(arena, threshold, 4)
t2 = time.time()
print threshold, t2-t1, " ", sum(x)
The "chemfp.set_num_threads(1)" case bypasses the OpenMP-based code
and tells "count_tanimoto_hits_arena_symmetric" to use the simple
upper-right triangle implementation. As a result ...
| Tanimoto thresholds | ||||||
|---|---|---|---|---|---|---|
| method | 0.8 | 0.6 | 0.5 | 0.4 | 0.2 | 0.01 |
| 128 critical sections | 40 | 82 | 102 | 131 | 244 | 278 |
| non-symmetric | 82 | 170 | 207 | 240 | 272 | 280 |
| per-thread counts | 40 | 83 | 100. | 116 | 135 | 137 |
| Python/pthreads | 48 | 92 | 112 | 128 | 145 | 149 |
While I did not test it out, I expect that a corresponding C/C++ implementation would have much less performance overhead. I just don't have the experience of using C pthread API, or an async I/O library for C/C++ like Grand Central Dispatch, or C++'s new promises to try to implement that code directly in C. It really is much easier to use OpenMP than to figure out those alternate solutions for C.
Possible bad benchmark comparisons
BTW, what I ended up doing in my Python code was to define a "band" of 100 rows, and let each thread process 100 rows at a time. This should cut the overhead down from 8 seconds to 0.08 seconds, making the pthread code about comparable to the OpenMP code. I didn't test it out though, because my actual code uses a more sophisticated algorithm which also have the effect of improving cache coherency, and there's evidence that banding makes the coherency worse and causes slowdowns while waiting for memory fetches.
Unfortunately, it also looks like the analysis I did the other day had a flaw which causes the pthread benchmark to have bad memory access behavior. In short, the pthreads were randomly assigned bands to process, while the OpenMP version also gets randomly assigned bands, but all of the cores work on tasks in the same band. Hence, better coherency. (It looks like the pthread performance for one test case goes from 48 seconds with randomly assigned bands to 40 seconds with sequentially assigned bands.)I consider this a win for OpenMP. I did random assignments so I could display a progress bar. Assymmetries in the data mean that the first few bands and the last few bands are much easier to process than ones in the middle. With random band assignment, I get a much better estimate of the time to completion. Using OpenMP gives me that estimate without a noticable performance hit. With pthreads, it's much hard to get both performance and a estimate.
Conclusion
Effective parallelization with good scaleup is hard, no matter which technique you use. There are a lot of subtle issues. You need to understand how the technique works and measure your results to make sure you really do understand the problem.
My experience is that OpenMP is an effective technique to help you with your task. In a few cases, a trivially simple addition of a line of code gives instant speedup. It's more likely though that you've got some work ahead of you to make this happen.
If you're already using pthreads, Grand Central Dispatch, or some other multithreading or asyncronous I/O system, then I don't think that OpenMP adds much new. Instead, I think its biggest advantage is that you can make changes to existing code without introducing a new library API, without having to set up your own event loop/reactor, and with compiler-based thread control and syncronization primitives which make a large number of potential bugs of hand-built systems disappear.
My views on OpenMP #
In private email a correspondent observed that OpenMP makes threading very easy, but "it really seems under utilized in the community." (Here, 'community' is 'scientific programming.') I was surprised to find out that I had strong views on the topic.
OpenMP sits between several other pieces of technology, being:
- GPU computing
- cloud computing
- POSIX and other common threading libraries
The new hotness is GPUs. Wes Faler gave a presentation at the recent 28th Chaos Communication Congress on Evolving Custom Communication Protocols. He mentioned they ported C++ code over to the GPU. The unoptimized version was 7 times slower on the GPU than the CPU. However, they do many evaluations using the same function, and because there are so many compute threads in the GPU, the overall time was a factor of 7 faster. Similarly, Haque et al. showed that a 4 core desktop machine, properly tuned, was "only" about 5x slower than a GPU card.
It looks like GPU computing is currently the approach to take if you do a lot of evaluation of similar tasks, assuming you have the GPUs and programming time available. That performance (and the novel way of computing) interests people who might otherwise use OpenMP.
Cloud computing is another hotness. Alex Martelli was recently interviewed by Larry Hastings in Radio Free Python episode #2. At 33:47 Larry asked about Python's global interpreter lock and Alex's reply was:
I hate threading anyway. Multiprocessing is the way to go, and message-passing, not shared memory. That just doesn't scale. I use multithreading so I can use all of my 16 cores, or whatever is the average number of cores in a machine these days. Big furry deal. I've got a few thousand servers waiting for me in the data center and how do I use those with threading?The topic comes up several times in the ensuing discussion.
What good indeed is OpenMP, which might be used for a 16 node machine, if you're working on problems which involve 10,000 distributed servers?
Even single nodes have multiple cores these days, and a good OpenMP implemenation might help make good use of the nodes in that cloud. However, you have to compare OpenMP to traditional POSIX multithreading. OpenMP works for C/C++ and Fortran, but not for Python nor (it seems) Java, nor other languages which support pthreads. You're out of luck if you want to use OpenMP with one of those other languages.
Some things scale up wonderfully well by adding one or two OpenMP directives, but parallelism is rarely as trivial as giving a few hints to the compiler. I think that the non-trivial cases of parallelizing with OpenMP are about as much work as using pthreads, or a system like Grand Central Dispatch. I'll work through an example of doing that in my next essay.
I do believe that OpenMP scales better than these alternatives for some cases, in part because the compiler is doing the work rather than using a library API. My tests so far show that pthreads and OpenMP have about the same scaling with two processors, and I need four or more cores to show a strong OpenMP advantage.
Most desktop/laptop computers just don't yet have 8+ cores. (Alex Martelli said otherwise, but perhaps he's talking about Google's data centers.) Most people develop for their own computers, which lessens the incentive to work on good multicore scaling.
I have a four-core machine, and I'm willing to write a Python extension in C which uses OpenMP. Even then I've run into some difficulties. It took a while but I figured out how to configure Python's setup.py so it includes the right "use OpenMP" flag for each compiler. It includes a hard-coded list of compilers which do and do not support OpenMP. Also, did you know that on a Mac you must run OpenMP tasks in the main thread, and not in a pthread? Otherwise your program crashes; even when you have a single OpenMP thread! I had to figure out a workaround so I could use my library unchanged inside Django.
People are interested in OpenMP development, but some who might use OpenMP are drawn to other technologies. Some tasks are very appropriate for OpenMP, but they are almost as appropriate for other, more common technologies. OpenMP scales well, but most people don't have the hardware where OpenMP shines. Even when they do, they have to work in one of a handful of languages, and in somewhat restricted circumstances.
All these contribute to diminishing OpenMP utilization in the community.
OpenMP vs. POSIX threads #
A few years ago I heard about OpenMP. It's a form of multi-threaded programming meant to make good use of multiprocessor and multicore hardware.
Earlier this year, I read Anatomy of High-Performance 2D Similarity Calculations, which used OpenMP as part of their Tanimoto search algorithm. This summer, Kim Walisch contributed OpenMP variations to my popcount benchmark.
The changes to the code were almost trivial, so I asked Kim if he would help me add OpenMP support to chemfp. He did, and it will be part of the upcoming 1.1 release.
OpenMP is only one of many ways to make effective use of multiple processors. Another common way is through POSIX threads, or its equivalent on Windows. A third is to spawn off a new process and use IPC to communicate with it.
How do these techniques compare to each other?
I decided to use Python 3.2's new concurrent.futures module to handle the multithreaded and multiprocess cases, or rather, the backport to 2.x. This merges the underlying "threading" and "multiprocessing" APIs into a common form, based on the "future" concept for asynchronous programming. I quickly found that the multiprocess API had too much overhead so I won't talk about it.
My test case computes and stores the NxN Tanimoto similarity matrix between a set of fingerprints. I took N=110885 compounds from the ZINC data set and generated 2048-bit fingerprints using RDKit's hash fingerprint. The chemfp fingerprint type is "RDKit-Fingerprint/1 minPath=1 maxPath=7 fpSize=2048 nBitsPerHash=4 useHs=1".
I don't actually save the 12,295,483,225 values. For one, simple symmetry reduces that to 6,147,686,170 values. For another, in the problems I'm interested, I can ignore similarities below a threshold. Instead, chemfp internally uses a sparse matrix format.
For this test I used a simple parallelization. I break up the rows of the matrix into bands, and fill in the parts of the upper-right triangle which are at or above the threshold. In the OpenMP version, all the OpenMP threads work on a single band. In the concurrent.futures version, each thread processes its own band. These differences fall naturally out of the how those two APIs work.
Once I got both implementations working, debugged, and optimized, I could finally do some performance numbers. I wanted to see how OpenMP and pthreads scale over a range of processors and range of threshold values. My desktop has two dual-core CPUs, so I decided to rent time on a "High-CPU Extra Large Instance" Amazon EC2 node. It has 8 nodes of the form:
vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Xeon(R) CPU E5410 @ 2.33GHz stepping : 10 cpu MHz : 2333.336 cache size : 6144 KB fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu de tsc msr pae cx8 sep cmov pat clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good aperfmperf pni ssse3 cx16 sse4_1 hypervisor lahf_lm bogomips : 4666.67 clflush size : 64 cache_alignment : 64 address sizes : 38 bits physical, 48 bits virtual power management:Sadly, this is not a machine which supports the POPCNT instruction, so the chemfp popcount code fell back to Imran Haque's SSSE3-based implementation.
The machine has about 6GB of free memory. I knew from testing on my desktop that I only needed 4.1 GB for the worst case, so I didn't run into memory problems.
I ran my benchmark code through set of combinations of the OpenMP vs. pthreads, with thread counts from 1 to 8, and with threshold values of 1.0, 0.99, 0.97, 0.95, 0.93, 0.9, 0.88, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, and 0.5. I also ran the values several times in order to get some idea of the timing variations. Yes, that took about 20 hours to run.
You can get the raw
data if you're really interested. I used matplotlib to make a
couple of 3D plots. First, here's the overall times:
You can see that the OpenMP code (in red) is usually faster than the
pthread code (in blue). The exception is for thresholds of 0.55 and
lower. BTW, a threshold of 0.5 finds 285,371,794 matches in the NxN
matrix, which means this stores a few gigabytes of data.
To make more sense of this data, here's a plot of the speedup, defined
as T0/T where T0 is the fastest single-threaded time for a given
threshold and T is the fastest time for a given number of
threads. Perfect speedup would give a value of 8 for 8 processors.
The region near threshold=1.0 is so jagged because the search time is
less than the variability in the system, and is close to the minimum
time resolution size of 1 second.
Most people in this field use thresholds in the range 0.7-1.0. It's obvious from this graph that OpenMP is the right solution. It's almost always faster, and overall it makes much better use of a multi-core machine.
ECFP-like fragments in PubChem #
Previously I posted about unique fragments in PubChem. That used my molecular subgraph enumeration algorithm. In this essay I'll report some results from looking at the unique bit counts from RDKit's MorganFingerprint algorith, which is an ECFP variant.
My first graph in the previous essay shows that there are about 2 million unique fragments of size up to 7 in PubChem, and that the second 1/2 of the data files contained few fragments which weren't in the first 1/2. This suggests that there aren't that many substructures of size 7, compared to the number of possible structures of size 7, which is quite curious.
Rather, I expect that the number of unique fragments should level off with enough molecules. In the simplest case, there are 112 elements and 5 elements which can be aromatic, for a total of 117 possible unique atom types. I found 110 of them in my PubChem subset.
Similarly, I found only 1103 unique fragments with two atoms. The breakdown as a function of fragment length is:
- Length 1: 110 unique fragments
- Length 2: 1103 unique fragments
- Length 3: 4209 unique fragments
- Length 4: 19398 unique fragments
- Length 5: 86336 unique fragments
- Length 6: 364342 unique fragments
- Length 7: 1447488 unique fragments
How many possible substructures are there of size 7? Assuming only 10 atom types and ignoring cycles and different bond types gives about 10 million. There are 1+1+2+6+21+112+853=996 connected subgraphs of up to size 7, and I'll guess that 800 of them are chemically accessible. I'll hazard 2**6 different bond types, so about 500 billion possible substructures. That's over three orders of magnitude larger than what I see.
It's curious because it means that any substructure-based fingerprint using up to 7 atoms has at most a small multiple of ~2,000,000 different values. Sure, a hash algorithm might potentially generate values in the range 0-232, but only a few million of them will be actually be generated.
(Some algorithms will generate multiple bits per feature, eg, features with 1 or 2 atoms generates a single bit while features with 3 or more atoms generates two bits. This acts as a simple weighting scheme. There's a perfect correlation between those two bits, so I'm not counting the second one as meaningful.)
Fingerprint statistical models often assume a roughly Bernoulli process, and the number of unique features is unbounded, though increasingly rare. This observation suggests that there is actually an upper limit to the size, which changes the distribution type slightly.
Is this observable in other fingerprints? I used RDKit's MorganFingerprint algorithm, which is a variation of the ECFP "extended connectivity fingerprint." I used radii values of 1, 2, 3, 4, and 5, and with the other parameters at their default. Each step includes increasingly distant information, so should be more diverse.
The following shows the number of unique bits found as a function of the number of molecules processed. The molecules are ranked by PubChem id, although it doesn't include all of the PubChem structures since RDKit couldn't process all of the structures. (It complains about a number of bad valences.) A more complete analysis would randomize the structures to remove local coherence effects.
That graph is almost impossible to understand because the dynamic
range is so large. Log and log-log scales don't help either. The
best solution is to normalize by the maximum number for each
graph. That gives:
This sort of curve is a species discovery curve. It appears to show that the MorganFingerprint(radius=1) saturates at around 100,000 different bit values, and radius=2 might saturate at around 3.5 million unique values. This makes sense, as the radius=2 fingerprint corresponds to about 6-7 atoms, and I found 2 million unique values. (An average branching factor of 2.5 gives 2.5**2 or 6.25 atoms. However, a local branching factor of 3 gives 9 atoms, which adds more unique values to the fingerprint.)
I'll guess that there's under 40 million unique bits for radius=3 but it becomes harder to estimate. As the radius increases, the trend in the diversity of new values clearly gets closer to linear, which means there's less and less saturation. I can't predict the total number of unique values for radius=5 because it's still too flat.
The species growth curve is often fit as A(1-exp(-bx)) or A(log(b*x-1)). The first has a fixed limit, the latter implies there is no upper bound. This case is somewhere in the middle: for good chemical reasons, there's a large but finite number of possible ways to arrange a fixed number of atoms. For the smallest fragment size (1 atom), we are at that limit. For larger sizes, we are nowhere near the chemical limit, and I think a log fit works best.
Equally obvious, I would need to randomize the input order in order to get a smoother curve so I could make that prediction. But the end of the holidays was a couple of days ago and I need to get back to paying work.
Unique fragments in PubChem #
For reasons I'll get into later, I wanted to get an idea of the subgraph distribution of PubChem. That is, given my method for molecular subgraph enumeration, create all subgraphs of up to size 7 atoms and get an idea of how common they are. More specifically, atom uniqueness depends only on the atomic element and aromaticity, as assigned by OEChem, and the unique bond categories are "single-or-aromatic", double, and triple.
Last month I downloaded 2,138 sdf.gz files from PubChem and did structure perception with OpenEye's OEChem. Starting a couple of weeks ago, I use my subgraph enumeration algorithm to process 1,724 of them. For some reason, it stopped at that point. Since it took 7.5 days to process those files, and the data set is already a bit ungainly, I decided to leave the full analysis for another time and to not figure out what happened with the processing.
In the 1,724 files are 21,570,907 PubChem records and my enumeration found 1,925,185 unique substructures.
I kept track of the number of unique fragments per input file and the
running total number of unique fragments over all of the files,
plotted here:

You can see that 50% of the unique fragments are in the first 25% of
the data files and essentially all are found in the first 50% of the
files. (The number does increase after the 1000th file, but it's very
slow.) It's also interesting to see the internal structural diversity
in the different files. I suspect there are some large regions made
from contributed combinitorial libraries.
The unique fragments which exist in the most number of records are:
21387437 C 20195255 O 19959057 c 19892743 cc 19755355 ccc 19457485 cccc 19270867 CC 19015890 ccccc 18599872 cccccc 18488545 c1ccccc1 18386628 N 17672171 Cc 17324074 Ccc 17109361 CN 16985355 Cccc 16533358 C=O 16522121 Ccccc 15993406 Cc(c)c 15759069 Cc(c)cc 15508521 CcccccYou shouldn't be surprised to see that carbon is found in 21,387,437 of the 21,570,907 structures.
I made a distribution plot of the fragments, where the horizontal axis
is rank order (C then O, cc, and so on). I show it at a few different
scales in order to get a better understanding of the
distribution. It's quite obviously *not* a Zipf distribution.

The vertical axis is the count in millions. You can see that the 10,000th most common substructure is in a very small percentage of the structure; it's actually 0.5%.
At the other end of the list, 478,278 fragments (24.8%) exist only once (like C#NF), 251,372 fragments (13.1%) exist twice (like B#[Cr]), and 132,574 fragments (6.89%) exist thrice. Here's the first 20 values as a table,
1 478278 # In other words, 478,278 substructures exist only once in the data set 2 251372 3 132574 4 100665 5 67536 6 57500 7 42959 8 37983 9 31750 10 28684 11 24016 12 23169 13 18695 14 17659 15 15501 16 14717 17 13452 18 12500 19 11394 20 11276and in graphical form.

Copyright © 2001-2010 Dalke Scientific Software, LLC.


