<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0"><channel><title>Andrew Dalke's writings</title><link>http://www.dalkescientific.com/writings/diary/index.html</link><description>Writings from the software side of bioinformatics and
  chemical informatics, with a heaping of Python thrown in for good
  measure.  Code to taste.  Best served at room temperature.</description><lastBuildDate>Sat, 09 Aug 2008 23:24:08 GMT</lastBuildDate><generator>PyRSS2Gen-1.0.0</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>Read 0 bytes, run out of memory</title><link>http://www.dalkescientific.com/writings/diary/archive/2008/08/08/consume_memory.html</link><description>&lt;P&gt;

I've been working on a file format for chemical fingerprints,
influenced by the PNG file format.  To make sure I'm doing it right, I
wrote a program to dump blocks from PNG files.  I made a mistake and
my program gave a MemoryError.  How did that happen when my test file
is only a few K long?

&lt;/P&gt;&lt;P&gt;

I tracked it down.  I don't know if it's a bug.  Here's something for
you all to ponder over:

&lt;pre class="code"&gt;
BLOCKSIZE = 10*1024*1024

f=open("empty.txt", "w")
f.close()

f=open("empty.txt")
data = []
for i in range(10000):
    s = f.read(BLOCKSIZE)
    assert len(s) == 0
    data.append(s)
&lt;/pre&gt;

That's an empty file, so the read() must return empty strings.  The
assert statement verifies that that's the case.  But when I run it I get:

&lt;pre class="code"&gt;
Python(18996) malloc: *** vm_allocate(size=10489856) failed (error code=3)
Python(18996) malloc: *** error: can't allocate region
Python(18996) malloc: *** set a breakpoint in szone_error to debug
Traceback (most recent call last):
  File "mem_fill.py", line 9, in &amp;lt;module&amp;gt;
    s = f.read(BLOCKSIZE)
MemoryError
&lt;/pre&gt;

The reason why is in the C implementation of read
(Objects/fileobject.c).  The relevant line i:

&lt;pre class="code"&gt;
        v = PyString_FromStringAndSize((char *)NULL, buffersize);
&lt;/pre&gt;

That preallocates space assuming the requested read size will be
correct.  In my example code it preallocates 10MB of space even though
the result is 0 bytes long.  Since I keep the result around, all of
the preallocated space is also kept around.  Repeat that 10,000 times
and my machine quickly runs out of memory.  So will yours.

&lt;/P&gt;&lt;P&gt;

Bug in Python?  Correct behavior?  You decide.  Feel free to
&lt;a href="http://dalkescientific.blogspot.com/2008/08/python-bug-or-not.html"&gt;make
comments&lt;/a&gt; if you wish.

&lt;/P&gt;&lt;P&gt;

&lt;b&gt;Update 9 August:&lt;/b&gt; I submitted this as &lt;a
href="http://bugs.python.org/issue3531"&gt;issue3531&lt;/a&gt; in the Python
bug tracker.  Antoine Pitrou pointed out that there's a string resize
at the end of the function, so my assumption on the source of the
problem was wrong.  I dug into it some more and tracked the problem
down to obmalloc.c inside of PyObject_Realloc.  The resize ends up
calling the C function 'realloc', and it didn't seem to be reallocing.

&lt;/P&gt;&lt;P&gt;

With those clues I did some searching and found Bob Ippolito's blog post
&lt;a href="http://bob.pythonmac.org/archives/2005/01/01/realloc-doesnt/"&gt;realloc.. doesn't&lt;/a&gt;.

"Apparently, Darwin's implementation of &lt;tt&gt;realloc&lt;/tt&gt; never frees
memory if you ask for a smaller size."  There's a thread about this
from January 2005 on python-dev titled:

"&lt;a
href="http://mail.python.org/pipermail/python-dev/2005-January/050582.html"&gt;Darwin's
realloc(...) implementation never shrinks allocations&lt;/a&gt;."

There's additional discussion with
&lt;a href="http://bugs.python.org/issue1092502"&gt;issue1092502&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

I've been looking through various libraries to see if this causes a
problem in real life.  A denial of service attack is possible but only
for formats where the data describes how many bytes to read.  The PNG
format is one of many.  The reader code has to trust that number and
use it blindly, but any hardened reader knows those values are under
suspicion.  For example, the PNG reader in PIL uses
ImageFile.py:_safe_read:

&lt;pre class="code"&gt;
# Reads large blocks in a safe way.  Unlike fp.read(n), this function
# doesn't trust the user.  If the requested size is larger than
# SAFEBLOCK, the file is read block by block.
#
# @param fp File handle.  Must implement a &lt;b&gt;read&lt;/b&gt; method.
# @param size Number of bytes to read.
# @return A string containing up to &lt;i&gt;size&lt;/i&gt; bytes of data.

def _safe_read(fp, size):
    if size &lt;= 0:
        return ""
    if size &lt;= SAFEBLOCK:
        return fp.read(size)
    data = []
    while size &gt; 0:
        block = fp.read(min(size, SAFEBLOCK))
        if not block:
            break
        data.append(block)
        size = size - len(block)
    return string.join(data, "")
&lt;/pre&gt;

which does not end up with un-realloc'ed blocks.

&lt;/P&gt;&lt;P&gt;

If you aren't using a hardened reader then it's easy to do a standard
denial-of-service attack.

&lt;/P&gt;&lt;P&gt;

I could come up with hypothetical examples, but they are pretty
contrived.  I tried to write one which sounded at least somewhat
reasonable, but I couldn't.  Therefore, I don't think this realloc(3C)
behavior can cause a serious denial-of-service attack.  At worst it
seems to make already bad code somewhat more suspectable to attack.

&lt;/P&gt;
</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2008/08/08/consume_memory.html</guid><pubDate>Fri, 08 Aug 2008 12:00:00 GMT</pubDate></item><item><title>Teaching Python programming for cheminformatics</title><link>http://www.dalkescientific.com/writings/diary/archive/2008/07/23/teaching.html</link><description>&lt;P&gt;

I'm announcing two short &lt;a
href="http://dalkescientific.com/training/"&gt;training classes&lt;/a&gt; in
Python programming for chemical informatics.  I will teach the first
in Leipzig, Germany on 6-7 October.  It will be hosted by the &lt;a
href="http://www.python-academy.com/"&gt;Python Academy&lt;/a&gt;.  I'm still
working on the details of the second.  It will be in the San Francisco
Bay Area in early December.

&lt;/P&gt;&lt;P&gt;

For details about the price, topics and other information, see &lt;a
href="http://dalkescientific.com/training/"&gt;http://dalkescientific.com/training/&lt;/a&gt;
or &lt;a href="mailto:dalke@dalkescientific.com"&gt;send me an email&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

I work with and develop software tools for computational chemists,
mostly in small-molecule chemistry.  These are scientists, not
programmers, but they use computers every day as part of their
research.  Most need to do some programming, pehaps to implement a new
algorithm, or more likely to handle something that's too tedious or
error prone to do manually, like automating analysis of 10,000
compounds.

&lt;/P&gt;&lt;P&gt;

Most chemists have some training in programming, usually a semester or
two of introductory programming classes in college and a lot of
training from the school of experience.  The latter usually means
informal training from labmates, who are also not programming experts.

&lt;/P&gt;&lt;P&gt;

I've seen many of my chemists friends work hard at getting software to
do what they want it to do.  Chemists, like most other people who had
to spend years of mostly isolated work to get a PhD, know how to
perservere.  But they would rather do science, not spend time figuring
out how to use software.

&lt;P&gt;&lt;/P&gt;

My training classes are meant for them.  I'll cover different aspects
of how to be better at Python programming using examples that are
directly relevant to doing small molecule &lt;i&gt;in silico&lt;/i&gt; research.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2008/07/23/teaching.html</guid><pubDate>Wed, 23 Jul 2008 12:00:00 GMT</pubDate></item><item><title>Testing the bitslice algorithm for popcount</title><link>http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html</link><description>&lt;P&gt;

This is part 7 of an on-going series dealing with molecular
fingerprints:
&lt;ol&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/26/fingerprint_background.html"&gt;Molecular fingerprints, background&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/generating_fingerprints_with_openbabel.html"&gt;Generating molecular fingerprints with OpenBabel&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html"&gt;Computing Tanimoto scores, quickly&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/06/28/block_tanimoto_calculations.html"&gt;Block Tanimoto calculations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/07/01/faster_tanimtocat_search_search.html"&gt;Faster block Tanimoto calculations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html"&gt;HAKMEM 169 and other popcount implementations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;b&gt;Testing the bitslice algorithm for popcount&lt;/b&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;a href="http://dalkescientific.blogspot.com/2008/06/molecular-fingerprints.html"&gt;Comments?&lt;/a&gt;
&lt;/P&gt;
&lt;hr /&gt;
&lt;P&gt;

I want to compute the popcount of molecular fingerprints containing
1024 bits.  My algorithm has been to compute the popcount 8, 16, 32,
or 64 bits at a time and repeat that for all of the words in the
fingerprint.  There are a couple of algorithms which compute the
popcount of multiple words in parallel by doing a partial popcount of
each word and combining the intermediate results to get the result
over all words.

&lt;/P&gt;&lt;P&gt;

Lauradoux C&amp;eacute;dric wrote a nice &lt;a
href="http://www.princeton.edu/~claurado/ham/overview.pdf"&gt;overview on
Hamming weight&lt;/a&gt;, including explanations of the tree merging and
bit-slicing algorithms.  In personal email he pointed out &lt;a
href="http://www.ciphersbyritter.com/NEWS4/BITCT.HTM"&gt;this page&lt;/a&gt;
from comp.arch.arithmetic where someone attributes the algorithm to
Robert Harley.

&lt;/P&gt;&lt;P&gt;

It took me a while to understand it.  It's been rather a while since I
had to do much thinking about half-adders.  About 17 years.  But it's
not hard.  I'm not going to explain it though.  It's best in pictures
and that will take me too long to do.  You can follow those links
though.


&lt;/P&gt;&lt;P&gt;

While he hasn't finished the overview explanation on how to combine
the two algorithms, his &lt;a
href="http://www.princeton.edu/~claurado/hamming.html"&gt;Hamming
weight&lt;/a&gt; page includes &lt;a
href="http://www.princeton.edu/~claurado/ham/merging.tar.gz"&gt;source&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

I've added two bit-slicing implementations to my &lt;a
href="http://dalkescientific.com/writings/diary/popcnt.c"&gt;popcnt.c&lt;/a&gt;
benchmark.  The algorithm from comp.arch.arithmetic processes 7 words
at a time and the one from C&amp;eacute;dric processes 24 words at a time.
Both fall back to other methods for the last few words if the input
isn't an exact multiple of 24 or 7.

&lt;/P&gt;&lt;P&gt;

C&amp;eacute;dric implemented several algorithms in his code distributions
and included programs to compare their performance.  The results
suggested that the 24 word bitslice algorithm would be twice as fast
as as the 16 bit looup table.  I implemented it and found that while
it was fast, and in my laptop even faster than the 16 bit lookup
table, it was only about 10% faster and not a factor of 2.

&lt;pre class="code"&gt;
% sysctl machdep.cpu.brand_string
machdep.cpu.brand_string: Intel(R) Core(TM)2 CPU         T7600  @ 2.33GHz
% gcc --version
i686-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5367)
   ... deleted ...
% cc -O3 popcnt.c
% ./a.out
FreeBSD version 1   :  3854904 us; cnt=32511665
FreeBSD version 2   :  3517953 us; cnt=32511665
16-bit LUT          :  2127718 us; cnt=32511665   &lt;i&gt;was the fastest&lt;/i&gt;
8-bit LUT           :  5317244 us; cnt=32511665
8-bit LUT v2        :  3762467 us; cnt=32511665
Wikipedia #2        :  5493756 us; cnt=32511665
Wikipedia #3        :  5295541 us; cnt=32511665
gcc popcount        :  6242596 us; cnt=32511665
gcc popcountll      : 10402364 us; cnt=32511665
HAKMEM 169/X11      :  4305536 us; cnt=32511665
&lt;b&gt;Bitslice(7)         :  2077473 us; cnt=32511665
Bitslice(24)        :  1843217 us; cnt=32511665&lt;/b&gt;  &lt;i&gt;fastest&lt;/i&gt;
naive               : 23460323 us; cnt=32511665
Wegner/Kernigan     : 14976014 us; cnt=32511665
Anderson            : 63971600 us; cnt=32511665
8x shift and add    : 21728414 us; cnt=32511665
32x shift and add   : 19747808 us; cnt=32511665
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

I dug around for a while and used Toby's comment to try Shark as a
profiler.  Things made a bit more sense after I remembered to add the
'-g' option to gcc so I could get Shark to line up the code and
assembly side-by-side.  My conclusion is that C&amp;eacute;dric's code has
too much function call overhead because the 16-bit LUT is being called
for every 32 bit number, while the 24 word bitslice code is only
called every 24 words.  My code doesn't have that overhead, which is
why the times are closer to each other.

&lt;/P&gt;&lt;P&gt;

As before, I wondered how much "-m64" affects things.  I'll declare it
a tie between the new bitslice code and the old 16-bit lookup table.
(While though the numbers aren't the same, the difference is about the
same as when I run the same algorithm again.)

&lt;pre class="code"&gt;
% cc -O3 popcnt.c -m64
% ./a.out
FreeBSD version 1   :  3902385 us; cnt=32511665
FreeBSD version 2   :  3422655 us; cnt=32511665
16-bit LUT          :  1621557 us; cnt=32511665 &lt;i&gt;fastest&lt;/i&gt;
8-bit LUT           :  2070320 us; cnt=32511665
8-bit LUT v2        :  2094362 us; cnt=32511665
Wikipedia #2        :  2135130 us; cnt=32511665
Wikipedia #3        :  2142576 us; cnt=32511665
gcc popcount        :  8128024 us; cnt=32511665
gcc popcountll      :  4087390 us; cnt=32511665
HAKMEM 169/X11      :  5805272 us; cnt=32511665
&lt;b&gt;Bitslice(7)         :  1728262 us; cnt=32511665
Bitslice(24)        :  1698257 us; cnt=32511665&lt;/b&gt; &lt;i&gt;fastest&lt;/i&gt;
naive               : 24990997 us; cnt=32511665
Wegner/Kernigan     : 16864887 us; cnt=32511665
Anderson            : 12463648 us; cnt=32511665
8x shift and add    : 21715482 us; cnt=32511665
32x shift and add   : 20003160 us; cnt=32511665
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

All of this was on one processor and compiler.  I decided to try
another machine with a slightly different processor and an earlier
version of gcc.

&lt;pre class="code"&gt;
%sysctl -w hw.model
hw.model: Intel(R) Core(TM)2 CPU          6420  @ 2.13GHz
%gcc --version
gcc (GCC) 3.4.6 [FreeBSD] 20060305
    ...
%cc popcnt.c -O3 
%./a.out 
FreeBSD version 1   :  4270677 us; cnt=32511665
FreeBSD version 2   :  3785047 us; cnt=32511665
16-bit LUT          :  2034836 us; cnt=32511665
8-bit LUT           :  1935156 us; cnt=32511665   &lt;i&gt;fastest&lt;/i&gt;
8-bit LUT v2        :  1987370 us; cnt=32511665
Wikipedia #2        :  5010639 us; cnt=32511665
Wikipedia #3        :  4931266 us; cnt=32511665
gcc popcount        :  5793299 us; cnt=32511665
gcc popcountll      :  6815575 us; cnt=32511665
HAKMEM 169/X11      : 10788366 us; cnt=32511665
&lt;b&gt;Bitslice(7)         :  2467150 us; cnt=32511665
Bitslice(24)        :  2469399 us; cnt=32511665&lt;/b&gt;
naive               : 27722786 us; cnt=32511665
  ... too slow to worry about ...
&lt;/pre&gt;

In this case I would use the 8-bit lookup table, which is about 20%
faster than the the bitslice code.

&lt;P&gt;&lt;P&gt;

Here's the results on an older Pentium 4 (Prescott) CPU 2.80GHz with
gcc 3.4.6:

&lt;pre class="code"&gt;
FreeBSD version 1   :  3745509 us; cnt=32500610
FreeBSD version 2   :  3774150 us; cnt=32500610
16-bit LUT          :  2762583 us; cnt=32500610
8-bit LUT           :  2430362 us; cnt=32500610    &lt;i&gt;fastest&lt;/i&gt;
8-bit LUT v2        :  2433684 us; cnt=32500610
Wikipedia #2        : 11509758 us; cnt=32500610
Wikipedia #3        :  8540921 us; cnt=32500610
gcc popcount        :  9100055 us; cnt=32500610
gcc popcountll      : 10568664 us; cnt=32500610
HAKMEM 169/X11      : 10970609 us; cnt=32500610
Bitslice(7)         :  4226871 us; cnt=32500610
Bitslice(24)        :  3700020 us; cnt=32500610
naive               : 36117662 us; cnt=32500610
Wegner/Kernigan     : 46479326 us; cnt=32500610
Anderson            : 174285325 us; cnt=32500610
8x shift and add    : 90028247 us; cnt=32500610
32x shift and add   : 86390410 us; cnt=32500610
&lt;/pre&gt;


Bitslice(24) is still the fastest bit-twiddler solution, but quite a
bit slower than using an 8-bit lookup table.  There is no single best
solution.  It seems the code is also compiler dependent so I'm
installing GCC 4.2 to see how that affects things.  That'll probably
take a few hours so it's time to sleep.

&lt;/P&gt;
&lt;br /&gt;
&lt;P&gt;

Okay, fink installed gcc 4.2.2 (the most recent is 4.3.1 but I decided
it wasn't worth the time to compile it myself).  Here's the numbers
on my laptop for 32-bit:

&lt;pre class="code"&gt;
% gcc-4 -O3 popcnt.c
% ./a.out 
FreeBSD version 1   :  4161621 us; cnt=32511665
FreeBSD version 2   :  2930977 us; cnt=32511665  &lt;i&gt;was 3517953&lt;/i&gt;
16-bit LUT          :  2534380 us; cnt=32511665  &lt;i&gt;was 2127718&lt;/i&gt;
8-bit LUT           :  4472733 us; cnt=32511665
8-bit LUT v2        :  4499285 us; cnt=32511665  &lt;i&gt;was 3762467&lt;/i&gt;
Wikipedia #2        :  5031855 us; cnt=32511665
Wikipedia #3        :  3874931 us; cnt=32511665  &lt;i&gt;was 5295541&lt;/i&gt;
gcc popcount        :  4808307 us; cnt=32511665
gcc popcountll      :  5978393 us; cnt=32511665
HAKMEM 169/X11      :  4315751 us; cnt=32511665
Bitslice(7)         :  2111625 us; cnt=32511665
Bitslice(24)        :  1443337 us; cnt=32511665   &lt;i&gt;fastest; was 1843217&lt;/i&gt;
naive               : 51557971 us; cnt=32511665
Wegner/Kernigan     : 15616245 us; cnt=32511665
Anderson            : 63424235 us; cnt=32511665
8x shift and add    : 19179445 us; cnt=32511665
32x shift and add   : 19502222 us; cnt=32511665
%
&lt;/pre&gt;

and for 64 bit

&lt;pre class="code"&gt;
% gcc-4 -O3 popcnt.c -m64
% ./a.out
FreeBSD version 1   :  4192391 us; cnt=32511665
FreeBSD version 2   :  2812570 us; cnt=32511665    &lt;i&gt;was 3422655&lt;/i&gt;
16-bit LUT          :  1494747 us; cnt=32511665    &lt;i&gt;was 1621557&lt;/i&gt;
8-bit LUT           :  2182569 us; cnt=32511665
8-bit LUT v2        :  2170191 us; cnt=32511665    &lt;i&gt;was 2094362&lt;/i&gt;
Wikipedia #2        :  2001489 us; cnt=32511665
Wikipedia #3        :  1361488 us; cnt=32511665    &lt;i&gt;was 2142576&lt;/i&gt;
gcc popcount        :  7350367 us; cnt=32511665
gcc popcountll      :  3718116 us; cnt=32511665
HAKMEM 169/X11      :  4339839 us; cnt=32511665
Bitslice(7)         :  1693849 us; cnt=32511665
Bitslice(24)        :  1206318 us; cnt=32511665    &lt;i&gt;fastest; was 1698257&lt;/i&gt;
naive               : 24914902 us; cnt=32511665
Wegner/Kernigan     : 15557531 us; cnt=32511665
Anderson            :  8764008 us; cnt=32511665
8x shift and add    : 21710211 us; cnt=32511665
32x shift and add   : 19583803 us; cnt=32511665
%
&lt;/pre&gt;

It looks like things are nearly always faster for 64 bit code, and
sometimes faster for 32 bit code.

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html</guid><pubDate>Sat, 05 Jul 2008 12:00:00 GMT</pubDate></item><item><title>HAKMEM 169 and other popcount implementations</title><link>http://www.dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html</link><description>&lt;P&gt;

This is part 6 of an on-going series dealing with molecular
fingerprints:
&lt;ol&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/26/fingerprint_background.html"&gt;Molecular fingerprints, background&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/generating_fingerprints_with_openbabel.html"&gt;Generating molecular fingerprints with OpenBabel&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html"&gt;Computing Tanimoto scores, quickly&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/06/28/block_tanimoto_calculations.html"&gt;Block Tanimoto calculations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/07/01/faster_tanimtocat_search_search.html"&gt;Faster block Tanimoto calculations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;b&gt;HAKMEM 169 and other popcount implementations&lt;/b&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html"&gt;Testing the bitslice algorithm for popcount&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;a href="http://dalkescientific.blogspot.com/2008/06/molecular-fingerprints.html"&gt;Comments?&lt;/a&gt;

&lt;/P&gt;
&lt;h2&gt;HAKMEM&lt;/h2&gt;

&lt;P&gt;

Many other people have worked on doing high speed population counts.
In 1972 the AI Laboratory at MIT published "&lt;a
href="http://home.pipeline.com/~hbaker1/hakmem/hakmem.html"&gt;HAKMEM&lt;/a&gt;",
or "Artificial Intelligence Memo No. 239".  It is a collection of
algorithms, tricks, observations, and unsolved but interesting
problems.  One of which is to "Solve Go."  &lt;a
href="http://home.pipeline.com/~hbaker1/hakmem/hacks.html#item169"&gt;Item
169&lt;/a&gt; is a PDP-6/10 assembly way to count the number of ones in a
word.  It was called "count ones" and is now called popcount.

&lt;/P&gt;&lt;P&gt;

I first came across HAKMEM Item 169 in 2005 when I last looked at
doing fast Tanimoto calculations.  I tried it and a few other
solutions but concluded that using special Altivec instructions were
the fastest solution for my PowerPC-based PowerBook.  I don't have a
PowerPC any more.  I bought a MacBook Pro last year with a 2.33 GHz
Intel Core 2 Duo.  I've been looking at the MMX instructions
quizzically trying to figure out if there's an equivalent solution.  I
am &lt;i&gt;not&lt;/i&gt; an assembly programmer.

&lt;/P&gt;&lt;P&gt;

The best I could find was an AMD "&lt;a
href="http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf"&gt;x86
Code Optimization Guide&lt;/a&gt;".  It gives a 32-bit integer version using
normal Intel assembly and a 64-bit version using MMX which "can do
popcounts about twice as fast as the integer version (for an identical
number of bits)." 

&lt;/P&gt;&lt;P&gt;

The newer guide (I misplaced the URL) says to use the popcount instruction in SSE4a.

&lt;/P&gt;
&lt;h2&gt;Other popcount references&lt;/h2&gt;
&lt;P&gt;

Computing popcount is apparently very widely used.  You can find many
people have talked about it:

&lt;ul&gt;
 &lt;li&gt;&lt;a href="http://wiki.cs.pdx.edu/forge/popcount.html"&gt;Bart
 Massey&lt;/a&gt;:&lt;br /&gt;
&lt;pre class="code"&gt;
popcount_naive: 1.6666e+07 iters in 688 msecs for 41.28 nsecs/iter
popcount_8: 1e+08 iters in 995 msecs for 9.95 nsecs/iter
popcount_6: 2e+08 iters in 1725 msecs for 8.625 nsecs/iter
popcount_hakmem: 2e+08 iters in 1411 msecs for 7.055 nsecs/iter
popcount_keane: 1e+09 iters in 6566 msecs for 6.566 nsecs/iter
popcount_3: 1e+09 iters in 6270 msecs for 6.27 nsecs/iter
popcount_2: 1e+09 iters in 5658 msecs for 5.658 nsecs/iter
popcount_mult: 1e+09 iters in 4169 msecs for 4.169 nsecs/iter
  ...
At the suggestion of Bill Trost, I added 8 and 16-bit table-driven
popcount implementations. These perform the fastest in the benchmark
test-stand, at about 3ns/iteration. They're about the same speed, so
one would obviously use the 8-bit version.
&lt;/pre&gt;&lt;/li&gt;

&lt;li&gt;&lt;a
href="http://popcnt.org/2007/09/magic-popcount-popcnt-command.html"&gt;Majek&lt;/a&gt;
has pointers to code.  That's where I found the AMD optimization
guide link.&lt;/li&gt;

&lt;li&gt;Lauradoux C&amp;eacute;dric's &lt;a
href="http://www.princeton.edu/~claurado/hamming.html"&gt;Hamming weight&lt;/a&gt;
page has &lt;a
href="http://www.princeton.edu/~claurado/ham/overview.pdf"&gt;an
overview&lt;/a&gt; of the topic.  He reports:

&lt;table style="margin-left: 20ex; border-bottom: 1px solid black"&gt;
&lt;thead"&gt;
&lt;tr&gt;&lt;th style="border-bottom: 1px solid black"&gt;Algorithms&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/th&gt;&lt;th style="border-bottom: 1px solid black"&gt;Throughput&lt;/th&gt;&lt;/tr&gt;
&lt;/thead&gt;
&lt;tr&gt;&lt;td&gt;Naive &lt;/td&gt;&lt;td&gt;29.97 MB/s &lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Kernighan &lt;/td&gt;&lt;td&gt; 33.34 MB/s &lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Tree 1 &lt;/td&gt;&lt;td&gt; 122.97 MB/s &lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Tree 2 &lt;/td&gt;&lt;td&gt; 130.60 MB/s &lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Tree 3 &lt;/td&gt;&lt;td&gt; 125.57 MB/s &lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Table 8 bits &lt;/td&gt;&lt;td&gt; 200.94 MB/s &lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;Table 16 bits &lt;/td&gt;&lt;td&gt; 244.64 MB/s &lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

and nicely summarizes why few people use the fastest solution:

&lt;blockquote&gt;

The fastest solutions use pre-computation. However, you will never
find this two implementations in current software product (X11, linux
kernel. . . ). The reason of that is simple: the programmers are
reluctant to use solutions implying a big amount of memory since it
can affect the portability of the applications and their
performance. As a consequence, the algorithms based on tree are the
most popular.

&lt;/blockquote&gt;
&lt;/li&gt;

&lt;li&gt;There was a reddit thread in 2007 titled: &lt;a
href="http://www.reddit.com/r/programming/info/22p5v/comments"&gt;Popcount
- the "NSA instruction" missing from modern CPUs (or at least their
datasheets.)&lt;/a&gt;.  In it, a1k0n posted a link to a &lt;a
href="http://www.a1k0n.net/temp/popcnt.c.txt"&gt;popcount benchmark&lt;/a&gt;
he wrote.  After tweaking the 8-bit version slightly for better
performance I got:
&lt;pre class="code"&gt;
FreeBSD version 1   :  3784823 us; cnt=32511665
FreeBSD version 2   :  3421292 us; cnt=32511665
16-bit LUT          :  2068947 us; cnt=32511665
8-bit LUT           :  3632031 us; cnt=32511665
8x shift and add    : 21401951 us; cnt=32511665
32x shift and add   : 19419464 us; cnt=32511665
&lt;/pre&gt;
and he also found that lookups were the fastest solution.
&lt;/li&gt;

&lt;li&gt;There's a &lt;a
href="http://gcc.gnu.org/bugzilla/show_bug.cgi?id=36041"&gt;gcc bug
report&lt;/a&gt; from April 2008 showing that that __builtin_popcount isn't
even as fast as other bit twiddling solutions.  Joseph Zbiciak &lt;a
href="http://gcc.gnu.org/bugzilla/attachment.cgi?id=15529&amp;action=view"&gt;attached
a benchmark&lt;/a&gt;.  On my machine it shows that gcc's builtin solution
is about 1/3rd the speed of a better one:
&lt;pre class="code"&gt;
popcount64_1 = 4114 clocks   &lt;i&gt;gcc's __builtin_popcount&lt;/i&gt;
popcount64_2 = 1881 clocks 
popcount64_3 = 1627 clocks
popcount64_4 = 1381 clocks
&lt;/pre&gt;
&lt;/li&gt;

&lt;li&gt;There's also a number of &lt;a
href="http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive"&gt;popcount
algorithms documented by Sean Eron Anderson&lt;/a&gt; including descriptions 
and further links.&lt;/li&gt;

&lt;li&gt;C&amp;eacute;dric in private mail pointed out a nice &lt;a
href="http://www.ciphersbyritter.com/NEWS4/BITCT.HTM"&gt;collection of
popcount implementations&lt;/a&gt; from 1998/1999 including a cute 11 bit
lookup table and an implementation of Robert Harley's bit slice
solution, which is great when doing popcount across several words at a
time.&lt;/li&gt;

&lt;li&gt;Finally, you can fetch the &lt;a href="http://gmplib.org/"&gt;GMP&lt;/a&gt;
package.  It contains hand-written assembly implementations for many
different architectures.  The Intel ones are based on bit twiddling
and do not use a lookup table.&lt;/li&gt;

&lt;/ul&gt;

&lt;/P&gt;
&lt;h2&gt;Benchmarking popcount in GMP&lt;/h2&gt;
&lt;P&gt;

Interesting.  Since I have &lt;a
href="http://www.aleax.it/gmpy.html"&gt;gmpy&lt;/a&gt; installed on this
machine I can figure out just how good gmp's hand-written bit
twiddling actually is.  Here's what I did, with commentary inserted.

&lt;pre class="code"&gt;
# Read the entire data file into memory
&gt;&gt;&gt; s = open("nci.binary_fp").read()
&gt;&gt;&gt; len(s)
155667200

# Convert each character into hex

&gt;&gt;&gt; s = s.encode("hex")

# Convert the hex number into a gmp integer, using base 16

&gt;&gt;&gt; import gmpy
&gt;&gt;&gt; n = gmpy.mpz(s, 16)

# Do the popcount

&gt;&gt;&gt; gmpy.popcount(n)
136963400

# How long did it take?

&gt;&gt;&gt; import time
&gt;&gt;&gt; t1 = time.time(); gmpy.popcount(n); t2 = time.time()  
136963400
&gt;&gt;&gt; t2-t1
0.36236190795898438
&gt;&gt;&gt; print (155667200/256)*(t2-t1), "fingerprints per second"
220343.217182 fingerprints per second

# Verify that I got the right answer by using an 8-bit lookup table

&gt;&gt;&gt; bitcounts = (
... 0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
... 3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,)
&gt;&gt;&gt; bitcount_table = dict((chr(i), count) for (i,count) in enumerate(bitcounts))

# Double-check the table creation -- this should be 2

&gt;&gt;&gt; bitcount_table["A"]
2

# Do the full popcount; this will take a while

&gt;&gt;&gt; sum(bitcount_table[c] for c in s)
136963400
&gt;&gt;&gt; 
&lt;/pre&gt;

GMP takes about 0.36 seconds to compute one popcount.  My C code
before loop unwinding using the 16-bit lookup table took 0.41 seconds
to compute two popcounts, or about 0.2 seconds for one popcount.  This
is in very good agreement with the numbers from a1k0n's benchmark (3.4
seconds vs. 2.1 seconds).

&lt;/P&gt;
&lt;h2&gt;Another popcount benchmark&lt;/h2&gt;
&lt;P&gt;

So many different algorithms and the benchmarks aren't compatible
enough that I could compare them.  I decided to expand a1k0n's
benchmark and include a few other implementations.  The result is &lt;a
href="http://dalkescientific.com/writings/diary/popcnt.c"&gt;popcnt.c&lt;/a&gt;.
Yes, I didn't even change the name.  It implements quite a few
algorithms but it's still incomplete.  Anyone want to contribute
assembly versions using MMX instructions?

&lt;/P&gt;&lt;P&gt;

Here's what my laptop reports.  It's a 2.33 GHz Intel Core 2 Duo and I
compiled with gcc 4.0.1:

&lt;pre class="code"&gt;
% cc popcnt.c -O3
% ./a.out
FreeBSD version 1   :  3828096 us; cnt=32511665
FreeBSD version 2   :  3474751 us; cnt=32511665
16-bit LUT          :  2095066 us; cnt=32511665
8-bit LUT           :  5329692 us; cnt=32511665
8-bit LUT v2        :  3691475 us; cnt=32511665
8x shift and add    : 21675609 us; cnt=32511665
32x shift and add   : 19687827 us; cnt=32511665
Wikipedia #2        :  5264710 us; cnt=32511665
Wikipedia #3        :  5037636 us; cnt=32511665
gcc popcount        :  6088138 us; cnt=32511665
gcc popcountll      :  7548551 us; cnt=32511665
naive               : 23647023 us; cnt=32511665
Wegner/Kernigan     : 15026099 us; cnt=32511665
Anderson            : 63285941 us; cnt=32511665
HAKMEM 169/X11      :  4205008 us; cnt=32511655
&lt;/pre&gt;

The fastest implementation is still the 16-bit lookup table, at 40%
faster than that fastest bit-twiddling code.  The "8-bit LUT"
implementation is by a1k0n.  I noticed it was a lot slower than my
code so I investigated.  The only difference was the order of
calculations for a simple addition.  The "8-bit LUT v2" is my version.
It's interesting to see that that minor change causes a major timing
difference.  And HAKMEM is holding its own; it's even better than the
Wikipedia ones, which I didn't expect.

&lt;/P&gt;&lt;P&gt;

A Core 2 Duo is a 64-bit processor.  I can ask gcc to compile for
64-bits and see if that makes a difference.

&lt;pre class="code"&gt;
% cc popcnt.c -O3 -m64
% ./a.out
FreeBSD version 1   :  3882201 us; cnt=32511665
FreeBSD version 2   :  3382603 us; cnt=32511665
16-bit LUT          :  1562167 us; cnt=32511665
8-bit LUT           :  2013767 us; cnt=32511665
8-bit LUT v2        :  2054639 us; cnt=32511665
8x shift and add    : 21679291 us; cnt=32511665
32x shift and add   : 20050205 us; cnt=32511665
Wikipedia #2        :  2113876 us; cnt=32511665
Wikipedia #3        :  2102645 us; cnt=32511665
gcc popcount        :  8081774 us; cnt=32511665
gcc popcountll      :  4217908 us; cnt=32511665
naive               : 27717913 us; cnt=32511665
Wegner/Kernigan     : 16618378 us; cnt=32511665
Anderson            : 12300547 us; cnt=32511665
HAKMEM 169/X11      :  5643968 us; cnt=32511655
&lt;/pre&gt;

It does, for some things.  The lookup tables are faster (and the 8-bit
counts are now the same), and the 64-bit algorithms (Wikipedia, gcc
popcountll and Anderson) are much happier.  The parallel algorithm
used by the Wikipedia implementations are almost as fast as the 8-bit
lookup table and only 33% slower than the 16-bit lookup table.  While
HAKMEM got 25% slower.

&lt;/P&gt;&lt;P&gt;

Is there any way I can ask gcc to use the 64-bit instructions but
still talk to a Python which was compiled for 32 bits?

&lt;/P&gt;
&lt;h2&gt;Profile-directed optimizations with gcc&lt;/h2&gt;
&lt;P&gt;

I recently learned how to do profile-directed optimizations with gcc
using the -fprofile-arcs and -fbranch-probabilities options.
Sometimes the compiler needs to make a guess on which assembly code is
best for a given circumstance.  For instance, if a branch for an 'if'
statement is usually taken then it might provide hints to the
processor pipline to optimistically assume the branch will be taken.
It can't guess correctly all the time.  Profile-directed optimizations
works by asking gcc during compilation to instrument the program,
running the program to get timings on how the code is actually used,
saving the results to a file, then recompiling while asking gcc to use
the instrumentation data.

&lt;/P&gt;&lt;P&gt;

Is it useful for my case?

&lt;pre class="code"&gt;
     &lt;i&gt;compile with instrumentation - these will be slower than usual!&lt;/i&gt;
% cc popcnt.c -O3 -fprofile-arcs
% ./a.out
FreeBSD version 1   :  4304773 us; cnt=32511665
FreeBSD version 2   :  4062710 us; cnt=32511665
16-bit LUT          :  2835959 us; cnt=32511665
8-bit LUT           :  5344514 us; cnt=32511665
8-bit LUT v2        :  4535290 us; cnt=32511665
8x shift and add    : 22246178 us; cnt=32511665
32x shift and add   : 19915850 us; cnt=32511665
Wikipedia #2        :  5626697 us; cnt=32511665
Wikipedia #3        :  5318770 us; cnt=32511665
gcc popcount        :  7005410 us; cnt=32511665
gcc popcountll      :  8173944 us; cnt=32511665
naive               : 62215630 us; cnt=32511665
Wegner/Kernigan     : 34541338 us; cnt=32511665
Anderson            : 65256755 us; cnt=32511665
HAKMEM 169/X11      :  4727297 us; cnt=32511665


      &lt;i&gt;See the profile data?&lt;/i&gt;
% ls -l popcnt.*
-rw-r--r--   1 dalke  staff  9177 Jul  3 05:01 popcnt.c
-rw-r--r--   1 dalke  staff  1816 Jul  3 04:43 popcnt.gcda
-rw-r--r--   1 dalke  staff  7872 Jul  3 04:43 popcnt.gcno

      &lt;i&gt;recompile using the profile data&lt;/i&gt;
% cc popcnt.c -O3 -fbranch-probabilities
% ./a.out
FreeBSD version 1   :  3815036 us; cnt=32511665
FreeBSD version 2   :  3477915 us; cnt=32511665
16-bit LUT          :  2251514 us; cnt=32511665
8-bit LUT           :  5291010 us; cnt=32511665
8-bit LUT v2        :  3936142 us; cnt=32511665
8x shift and add    : 21715704 us; cnt=32511665
32x shift and add   : 19708580 us; cnt=32511665
Wikipedia #2        :  5273262 us; cnt=32511665
Wikipedia #3        :  5074212 us; cnt=32511665
gcc popcount        :  6431689 us; cnt=32511665
gcc popcountll      :  7562124 us; cnt=32511665
naive               : 28044028 us; cnt=32511665
Wegner/Kernigan     : 15280455 us; cnt=32511665
Anderson            : 63431368 us; cnt=32511665
HAKMEM 169/X11      :  4209093 us; cnt=32511665
&lt;/pre&gt;

No significant differences over the normal 32-bit code.  Nor did I see
any real differences if I compiled with the architecture-specific
flags:

&lt;pre class="code"&gt;
cc popcnt.c -O3 -march=nocona
&lt;/pre&gt;

&lt;/P&gt;
&lt;h2&gt;Domain applicability&lt;/h2&gt;
&lt;P&gt;

I expect people will come across this page when looking for popcount
implementations, see my numbers, and end up using the 16-bit lookup
table.  That might be the right solution but bear in mind domain
applicability.  I plan to compute popcount over megabytes.  Most
people, like the proverbial implementation in X11, work on only a few
words.

&lt;/P&gt;&lt;P&gt;

It takes time to bring the lookup table into memory.  On modern
machines it's something like tens of instructions.  If you're only
doing a few popcounts, or a lot of popcounts but not all the time,
then it's probably best to use bit-twiddling.  It's easier to
implement and for that case likely faster.

&lt;/P&gt;
</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html</guid><pubDate>Thu, 03 Jul 2008 12:00:00 GMT</pubDate></item><item><title>Faster block Tanimoto calculations</title><link>http://www.dalkescientific.com/writings/diary/archive/2008/07/01/faster_tanimtocat_search_search.html</link><description>&lt;P&gt;

This is part 5 of an on-going series dealing with molecular
fingerprints:
&lt;ol&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/26/fingerprint_background.html"&gt;Molecular fingerprints, background&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/generating_fingerprints_with_openbabel.html"&gt;Generating molecular fingerprints with OpenBabel&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html"&gt;Computing Tanimoto scores, quickly&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/06/28/block_tanimoto_calculations.html"&gt;Block Tanimoto calculations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;b&gt;Faster block Tanimoto calculations&lt;/b&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html"&gt;HAKMEM 169 and other popcount implementations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html"&gt;Testing the bitslice algorithm for popcount&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;a href="http://dalkescientific.blogspot.com/2008/06/molecular-fingerprints.html"&gt;Comments?&lt;/a&gt;

&lt;/P&gt;
&lt;hr /&gt;
&lt;P&gt;

After much analysis and digging in &lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/06/28/block_tanimoto_calculations.html"&gt;part
4&lt;/a&gt; I managed to get my C extension to Python to compute
fingerprints pretty quickly.  I estimated that a fingerprint search of
PubChem (19,218,991 1024 bit fingerprints) would take about 13.5
seconds, but that's still almost a factor of two worse than my
estimate of 8 seconds.  What's wrong?

&lt;/P&gt;&lt;P&gt;

To start with, the data bandwidth from the disk is not the limiting
factor.  Processing each byte of my test set using file reads takes
0.26 seconds

&lt;pre class="code"&gt;
% time wc -l nci.binary_fp
 1098650 nci.binary_fp
0.165u 0.089s 0:00.26 92.3%     0+0k 0+0io 0pf+0w
&lt;/pre&gt;

while the code takes 0.84 seconds to run, and 0.50 seconds if I leave
out the actual popcount lookups.  Did the code for the popcount
lookups change?

&lt;/P&gt;&lt;P&gt;

Actually, yes, it did.  My proof-of-concept code used 32 bit integers
and bit shifting:

&lt;pre class="code"&gt;
int popcount(unsigned int x) {
  return (_popcount_counts[x       &amp;amp; 0xff] +
          _popcount_counts[x &amp;gt;&amp;gt;  8 &amp;amp; 0xff] +
          _popcount_counts[x &amp;gt;&amp;gt; 16 &amp;amp; 0xff] +
          _popcount_counts[x &amp;gt;&amp;gt; 24 &amp;amp; 0xff]);
}
&lt;/pre&gt;

while my more recent code uses a byte lookup:

&lt;pre class="code"&gt;
    for (i=0; i&amp;lt;num_bytes; i++) {
      b += _popcount_counts[targets[i]];
      c += _popcount_counts[query[i]&amp;amp;targets[i]];
    }
&lt;/pre&gt;

The most obvious difference is that I'm working with 32 bits at a
time.  This won't work if the fingerprint has, say, 40 bits in it.  On
the plus side, I don't need to check for the end of the loop after
every 8 bits, so that reduces the branch tests by 75%.  (Branches in
modern computers cause performance problems.  Some details at the
Wikipedia page on &lt;a
href="http://en.wikipedia.org/wiki/Branch_predictor"&gt;Branch
predictor&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

Out of curiousity I generated the assembly code for the corresponding
popcount functions (using "gcc -S").  Here's the relevant source:

&lt;pre class="code"&gt;
int byte_lookup(int num_bytes, char *query) {
  int a=0, i;
  for (i=0; i&amp;lt;num_bytes; i++) {
    a += _popcount_counts[query[i]];
  }
  return a;
}

int int_lookup(int num_bytes, int *query) {
  int a=0, q, i;
  for (i=0; i&amp;lt;num_bytes/4; i+=4) {
    q = query[i];
    a += (_popcount_counts[q       &amp;amp; 0xff] +
          _popcount_counts[q &amp;gt;&amp;gt;  8 &amp;amp; 0xff] +
          _popcount_counts[q &amp;gt;&amp;gt; 16 &amp;amp; 0xff] +
          _popcount_counts[q &amp;gt;&amp;gt; 24 &amp;amp; 0xff]);
  }
  return a;
}
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

This is the inner loop of "byte_lookup"

&lt;pre class="code"&gt;
L5:
        movl    12(%ebp), %esi
        movsbl  (%esi,%edx),%eax
        movl    -16(%ebp), %esi
        addl    (%esi,%eax,4), %ecx
        addl    $1, %edx
        cmpl    %edx, %edi
        jne     L5
&lt;/pre&gt;

and this is the inner loop of "int_lookup".

&lt;pre class="code"&gt;
L13:
        movl    -16(%ebp), %eax
        movl    12(%ebp), %edx
        movl    (%edx,%eax,4), %ecx
        movzbl  %cl,%eax
        movzbl  %ch, %esi
        movl    (%edi,%eax,4), %edx
        addl    (%edi,%esi,4), %edx
        movl    %ecx, %eax
        sarl    $16, %eax
        andl    $255, %eax
        addl    (%edi,%eax,4), %edx
        shrl    $24, %ecx
        addl    (%edi,%ecx,4), %edx
        addl    %edx, -20(%ebp)
        addl    $4, -16(%ebp)
        movl    -24(%ebp), %eax
        cmpl    %eax, -16(%ebp)
        jl      L13
&lt;/pre&gt;

The first has 7 instrutions and one branch test but it's called 4
times more often than the second, which has 18 instructions and 1
branch test.  4*7=28, which is more than 18.  I don't know how long it
takes the CPU to run those instructions, so I'll just assume that all
instructions take the same time, or about 33% faster.  This isn't
always true because some instructions are faster than others, and
pipelining and other CPU tricks make things even more complicated.

&lt;/P&gt;&lt;P&gt;

My Python extension using bytes for the lookups is about 40% slower
than my high-speed C test case using ints.  These are commensurate
numbers, or at least suggestive.  It looks like I should rework my
code to use integers.  But there's something I need to worry about
first.

&lt;/P&gt;
&lt;h2&gt;__builtin_popcount and data alignment&lt;/h2&gt;
&lt;P&gt;

Earlier I talked about using the GNU C/C++ "__builtin_popcount"
extension to compute the popcount of an integer but I've been using my
own implementation based on the _popcount_counts lookup table using
bytes.  I did this for a few reasons.  You might not be using the GNU
compiler, though that's a low probability.  You might want to support
fingerprints which are a multiple of 8 but not of 32, though all my
tests are for 1024 bit fingerprints.  The main reason was I was
worried about &lt;a
href="http://en.wikipedia.org/wiki/Data_structure_alignment"&gt;memory
alignment&lt;/a&gt; concerns.

&lt;/P&gt;&lt;P&gt;

Some CPUs require non-character arrays to be memory aligned.  Even for
CPUs which don't have that requirement (like Intel machines),
non-aligned access is slower than aligned access.  I'm using Python to
pass a character pointer to a C function, but I don't know if that
pointer is correctly aligned to be used directly as an integer
pointer.  I decided it was best to ignore the problem until I got the
code working.

&lt;/P&gt;&lt;P&gt;

The Python string object is implemented in C, and my C extension gets
a pointer to the underlying string data.  Is it already aligned such
that I can cast the string directly into a "int *"?  I could figure it
in a couple of ways.  I could look at the source code and working out
what's going on, or I can run the code and asking it.  I'll do both,
starting with the first one, which is the hard one, first.

&lt;/P&gt;&lt;P&gt;

The layout for the string class is in "Include/stringobject.h" and is:

&lt;pre class="code"&gt;
typedef struct {
    PyObject_VAR_HEAD
    long ob_shash;
    int ob_sstate;
    char ob_sval[1];

    /* Invariants:
     *     ob_sval contains space for 'ob_size+1' elements.
     *     ob_sval[ob_size] == 0.
     *     ob_shash is the hash of the string or -1 if not computed yet.
     *     ob_sstate != 0 iff the string object is in stringobject.c's
     *       'interned' dictionary; in this case the two references
     *       from 'interned' to this object are *not counted* in ob_refcnt.
     */
} PyStringObject;
&lt;/pre&gt;

The "PyObject_VAR_HEAD" is used ... well, if you're really interested
then read the documentation in "Include/object.h".  The summary is
that it's a macro which in non-debug builds will expand to

&lt;pre class="code"&gt;
        Py_ssize_t ob_refcnt;
        struct _typeobject *ob_type;
        Py_ssize_t ob_size; /* Number of items in variable part */
&lt;/pre&gt;

Count up the space needed for the variables and you'll see that the
"ob_sval[1]" occurs on a multiple of 4 bytes.  It is aligned such that
it can be cast into an integer pointer without problems.

&lt;/P&gt;&lt;P&gt;

The easier way is to add a printf statement in the block_similarity
function call, like this:

&lt;pre class="code"&gt;
  printf("At %p %d %d\n", query, ((int)query) % 4, ((int)query) % 8);
&lt;/pre&gt;

This prints the pointer location and the location modulo 4 (for 32 bit
'int' alignment) and module 8 (for 64 bit 'long long' alignment).  If
it's integer aligned then the modulo result should be 0.  When I ran
this I got:

&lt;pre class="code"&gt;
At 0x13f1884 0 4
&lt;/pre&gt;

This means the array is integer aligned (because of the '0') but not
long aligned (because of the '4').  This agrees with what I figured
out from looking at the typedef layout.  I can therefore cast the
"char *" pointer to an "int *" then use __builtin_popcount on the
values, and not worry about alignment problems.  On the other hand, I
can't use __builtin_popcountll (the difference is the 'll' a the end;
this one is for "long long"s).

&lt;/P&gt;
&lt;h2&gt;My benchmark&lt;/h2&gt;
&lt;P&gt;

I'm going to focus on making the Tanimoto calculations faster.  As I
showed in earlier essays, my program's total run-time was only a rough
estimate for the calculation time.  It included Python's start-up cost
(importing numpy took 0.25 seconds!) and the implementation choice of
reading a file vs. using a memory mapped file.

&lt;/P&gt;&lt;P&gt;

I created a new benchmark program called "time_block_similarity.py"
which does exactly what it says.  I'll continue to use the
"nci.binary_fp" file as my reference data file, and to avoid disk
issues I'll read the entire file into memory before doing processing.
It's only 150 MB and I have more than that amount of free memory so
there shouldn't be any swapping.

&lt;pre class="code"&gt;
# time_block_similarity.py
import time
import tanimoto

# This is 155,667,200 bytes; 1,216,150 1024-bit fingerprints
targets = open("nci.binary_fp", "rb").read()

# Use the first 1024 bits as the query
query = targets[:1024//8]

t1 = time.time()
results = tanimoto.block_similarity(query, targets)
t2 = time.time()

N = len(results)
print "%s calculations in %.3f seconds; %.0f per second" % (N, t2-t1, N/(t2-t1))
print "Estimated PubChem time: %.1f seconds" % (19218991 * (t2-t1)/N,)
print "First few are:"
for x in results[:5]:
    print x
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Using my existing byte lookup implementation my fastest-of-three time
was:

&lt;pre class="code"&gt;
1216150 calculations in 0.750 seconds; 1620963 per second
Estimated PubChem time: 11.9 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936
&lt;/pre&gt;

That's the time I want to beat.

&lt;/P&gt;
&lt;h2&gt;Faster block_similarity using integers&lt;/h2&gt;
&lt;P&gt;

Changing libtanimoto.c:block_similarity to use integers instead of
characters is not hard.  The following assumes that integers are 32
bits, the 'query' and 'targets' are integer aligned, and that
'num_bytes' is a multiple of 4.

&lt;pre class="code"&gt;

/* This is added to libtanimoto.c */
void block_similarity(int num_bytes, unsigned char *query,
                      int num_targets, unsigned char *targets,
                      double *results) {
  int a=0, b=0, c=0, i;
  int *query_as_int = (int *) query;
  int *targets_as_int = (int *) targets;
  int num_ints_in_fp = num_bytes / sizeof(int);
  unsigned int fp_word;

  /* precompute popcount(query) */
  for (i=0; i&amp;lt;num_ints_in_fp; i++) {
    fp_word = query_as_int[i];
    a += (_popcount_counts[fp_word       &amp;amp; 0xff] + 
          _popcount_counts[fp_word &gt;&gt;  8 &amp;amp; 0xff] +
          _popcount_counts[fp_word &gt;&gt; 16 &amp;amp; 0xff] +
          _popcount_counts[fp_word &gt;&gt; 24 &amp;amp; 0xff]);
  }

  /* For each fingerprint in the block */
  while (num_targets-- &gt; 0) {
    /* compute popcount(target) and popcount(target&amp;amp;query) */
    b=0; c=0;
    for (i=0; i&amp;lt;num_ints_in_fp; i++) {
      fp_word = targets_as_int[i];
      b += (_popcount_counts[fp_word       &amp;amp; 0xff] + 
            _popcount_counts[fp_word &gt;&gt;  8 &amp;amp; 0xff] +
            _popcount_counts[fp_word &gt;&gt; 16 &amp;amp; 0xff] +
            _popcount_counts[fp_word &gt;&gt; 24 &amp;amp; 0xff]);
      fp_word &amp;= query_as_int[i];
      c += (_popcount_counts[fp_word       &amp;amp; 0xff] + 
            _popcount_counts[fp_word &gt;&gt;  8 &amp;amp; 0xff] +
            _popcount_counts[fp_word &gt;&gt; 16 &amp;amp; 0xff] +
            _popcount_counts[fp_word &gt;&gt; 24 &amp;amp; 0xff]);
    }
    /* Append the result and go on to the next fingerprint */
    *results = ((double)c)/(a+b-c);
    results++;
    targets_as_int += num_ints_in_fp;
  }
}
&lt;/pre&gt;

and my best-of-three output from this is:

&lt;pre class="code"&gt;
1216150 calculations in 0.411 seconds; 2957097 per second
Estimated PubChem time: 6.5 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936
&lt;/pre&gt;

That's almost half the time!  (45% faster if you want to be picky.)
I'm also at about the time I predicted from my prototype.  But I can
do better.

&lt;/P&gt;
&lt;h2&gt;Loop unrolling&lt;/h2&gt;
&lt;P&gt;

One reason the integer based code became faster was it's now doing one
memory fetch instead of 4.  Another reason is that it does about 1/4th
the number of branches.  Branches cause CPUs to cry.  I now assume the
fingerprint length is a multiple of 4 bytes, so I can do four
calculations before checking if I've reached the end.

&lt;/P&gt;&lt;P&gt;

This technique is called &lt;a
href="http://en.wikipedia.org/wiki/Loop_unrolling"&gt;loop unrolling&lt;/a&gt;.
(Despite what Wikipedia claims, most people use the term 'loop
unrolling' and do not use 'loop unwinding'.)  As a simple example,
here's a snippet of code which would compute the bitcount for a
fingerprint of length 256 bits, which is 8 32-bit integers.

&lt;pre class="code"&gt;
count = 0;
for (i=0; i&lt;8; i++) {
  count = __builtin_popcount(fp[i]);
}
&lt;/pre&gt;

The snippet uses the loop to make the code shorter.  I can unwind it
completely, like this:
&lt;pre class="code"&gt;
count = (__builtin_popcount(fp[0]) +
         __builtin_popcount(fp[1]) +
         __builtin_popcount(fp[2]) +
         __builtin_popcount(fp[3]) +
         __builtin_popcount(fp[4]) +
         __builtin_popcount(fp[5]) +
         __builtin_popcount(fp[6]) +
         __builtin_popcount(fp[7]));
&lt;/pre&gt;

This new version contains no tests and no branches.  CPUs like that,
and so do compilers.

&lt;/P&gt;&lt;P&gt;

In my benchmark I know the test fingerprints contain 1024 bits.  I
want to see if this idea is useful so I'll write a version which
completely unrolls the loop in the block_similarity function which
computes 'b' and 'c'.  The downside with unrolling is there's a lot
more code.  I'll work around that a bit by using a macro, but the
result is still ugly.

&lt;pre class="code"&gt;
void block_similarity(int num_bytes, unsigned char *query,
                      int num_targets, unsigned char *targets,
                      double *results) {
  int a=0, b=0, c=0, i;
  int *query_as_int = (int *) query;
  int *targets_as_int = (int *) targets;
  int num_ints_in_fp = num_bytes / sizeof(int);
  unsigned int fp_word;

  /* precompute popcount(query) */
  for (i=0; i&amp;lt;num_ints_in_fp; i++) {
    fp_word = query_as_int[i];
    a += (_popcount_counts[fp_word       &amp;amp; 0xff] + 
          _popcount_counts[fp_word &gt;&gt;  8 &amp;amp; 0xff] +
          _popcount_counts[fp_word &gt;&gt; 16 &amp;amp; 0xff] +
          _popcount_counts[fp_word &gt;&gt; 24 &amp;amp; 0xff]);
  }

  /* For each fingerprint in the block */
  while (num_targets-- &gt; 0) {
    /* compute popcount(target) and popcount(target&amp;query) */
    b=0; c=0;
    i=0;

#define _B_AND_C_POPCOUNT \
    fp_word = targets_as_int[i];                    \
    b += (_popcount_counts[fp_word       &amp;amp; 0xff] +    \
          _popcount_counts[fp_word &gt;&gt;  8 &amp;amp; 0xff] +    \
          _popcount_counts[fp_word &gt;&gt; 16 &amp;amp; 0xff] +    \
          _popcount_counts[fp_word &gt;&gt; 24 &amp;amp; 0xff]);    \
    fp_word &amp;= query_as_int[i++];                    \
    c += (_popcount_counts[fp_word       &amp;amp; 0xff] +    \
          _popcount_counts[fp_word &gt;&gt;  8 &amp;amp; 0xff] +    \
          _popcount_counts[fp_word &gt;&gt; 16 &amp;amp; 0xff] +    \
          _popcount_counts[fp_word &gt;&gt; 24 &amp;amp; 0xff]);

    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;

    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;

    /* Append the result and go on to the next fingerprint */
    *results = ((double)c)/(a+b-c);
    results++;
    targets_as_int += num_ints_in_fp;
  }
}
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Does it make any difference?

&lt;pre class="code"&gt;
1216150 calculations in 0.306 seconds; 3968043 per second
Estimated PubChem time: 4.8 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936
&lt;/pre&gt;

Nice!  This is about 25% faster than the previous best time.

&lt;/P&gt;
&lt;h2&gt;16 bit lookup table&lt;/h2&gt;
&lt;P&gt;

There are still more tricks up my sleve.  I wondered how much time was
spent in computing the popcount for a fingerprint so I removed the
calculation of 'b'

&lt;pre class="code"&gt;
    b += (_popcount_counts[fp_word       &amp; 0xff] +    \
          _popcount_counts[fp_word &gt;&gt;  8 &amp; 0xff] +    \
          _popcount_counts[fp_word &gt;&gt; 16 &amp; 0xff] +    \
          _popcount_counts[fp_word &gt;&gt; 24 &amp; 0xff]);    \
&lt;/pre&gt;

from my unwound macro definition.  This means 'b' will always be 0,
and the scores will be &lt;b&gt;completely wrong&lt;/b&gt;.  The resulting report was:

&lt;pre class="code"&gt;
1216150 calculations in 0.185 seconds; 6562436 per second
Estimated PubChem time: 2.9 seconds
First few are:
inf
2.41463414634
1.91666666667
2.04347826087
2.11111111111
&lt;/pre&gt;

which means about 40% of the time is spent computing one of the
fingerprints.  Which makes sense as the entire point is to compute two
fingerprints and I'm trying to avoid other overhead.
&lt;/P&gt;&lt;P&gt;

Can I make those faster?  I've been using an 8 bit lookup table.  I
could cut the number of operations in half by using a 16 bit lookup.
That table only takes about 64K of memory, which fits into cache these
days.  Would that be faster?

&lt;/P&gt;&lt;P&gt;

The 8 bit lookup table was small enough that I could write it in 10
lines of code.  The 16 bit table I did for my own proof-of-concept
testing took 2050 lines.  I don't want to list that here.  Instead,
I'll define space for the entire array and an 'init' function in
libtanimoto.c.  After the Python module, via ctypes, loads the module
it will call the initialization function.

&lt;/P&gt;&lt;P&gt;

It's been a long time since I've shown the full "libtanimoto.c" and
"tanimoto.py" files.  I've made lots of small changes over that time
so it's best to resynchronize.  Here are complete copies of each file,
with the important new parts in bold:

&lt;pre class="code"&gt;
/* libtanimoto.c */

/* Define space for the 16-bit lookup table */
/* Bootstrap using values for the 8-bit table */
static char _popcount_counts[&lt;b&gt;65536&lt;/b&gt;] = {
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
};

&lt;b&gt;/* Bootstrap the table.  This will be called by 'tanimoto.py' */
void init_tanimoto(void) {
  int i;
  /* If I know you are using GCC then you could call __builtin_popcount */
  for (i=256; i&amp;lt;65536; i++) {
    _popcount_counts[i] = (_popcount_counts[i &amp;amp; 0xff] +
                           _popcount_counts[i &gt;&gt; 8]);
  }
}&lt;/b&gt;

/* Compute the Tanimoto similarity between two fingerprints */
double similarity(int num_bytes,
                  unsigned char *query, unsigned char *target) {
  int a=0, b=0, c=0, i;
  for (i=0; i&amp;lt;num_bytes; i++) {
    a += _popcount_counts[query[i]];
    b += _popcount_counts[target[i]];
    c += _popcount_counts[query[i]&amp;amp;target[i]];
  }
  return ((double)c)/(a+b-c);
}

/* Compute the similarity beteen a query fingerprint and a block
   of target fingerprints.  Result values go into 'results'. */
/* This assumes num_bytes == 32 and that query and targets are int aligned */
void block_similarity(int num_bytes, unsigned char *query,
                      int num_targets, unsigned char *targets,
                      double *results) {
  int a=0, b=0, c=0, i;
  int *query_as_int = (int *) query;
  int *targets_as_int = (int *) targets;
  int num_ints_in_fp = num_bytes / sizeof(int);
  unsigned int fp_word;

  /* precompute popcount(query) */
  for (i=0; i&amp;lt;num_ints_in_fp; i++) {
    fp_word = query_as_int[i];
&lt;b&gt;    a += (_popcount_counts[fp_word &amp;amp; 0xffff] + 
          _popcount_counts[fp_word &gt;&gt; 16]);&lt;/b&gt;
  }

  /* For each fingerprint in the block */
  while (num_targets-- &gt; 0) {
    /* compute popcount(target) and popcount(target&amp;amp;query) */
    b=0; c=0;
    i=0;

#define _B_AND_C_POPCOUNT    \
    fp_word = targets_as_int[i];                \
&lt;b&gt;    b += (_popcount_counts[fp_word &amp;amp; 0xffff] +  \
          _popcount_counts[fp_word &gt;&gt; 16]);&lt;/b&gt;        \
    fp_word &amp;amp;= query_as_int[i++];                \
&lt;b&gt;    c += (_popcount_counts[fp_word &amp;amp; 0xffff] +  \
          _popcount_counts[fp_word &gt;&gt; 16]);&lt;/b&gt;

    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;

    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;
    _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT; _B_AND_C_POPCOUNT;

    /* Append the result and go on to the next fingerprint */
    *results = ((double)c)/(a+b-c);
    results++;
    targets_as_int += num_ints_in_fp;
  }
}
&lt;/pre&gt;

and the high-level Python interface to it:

&lt;pre class="code"&gt;
# tanimoto.py - wrapper to the libtanimoto shared library

import ctypes, ctypes.util
import numpy

# Load the shared library (the extension depends on the OS type)
libtanimoto = ctypes.CDLL(ctypes.util.find_library("tanimoto"))

&lt;b&gt;
# Call its initialization function
_init_tanimoto = libtanimoto.init_tanimoto
_init_tanimoto.argtypes = []
_init_tanimoto.restype = None
_init_tanimoto()&lt;/b&gt;

# Annotate the function call argument and result types
_similarity = libtanimoto.similarity
_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p]
_similarity.restype = ctypes.c_double

# Define the high-level public interface as a Python function
def similarity(query, target):
    n = len(query)
    m = len(target)
    if n != m:
        raise AssertionError("fingerprints must be the same length")
    return _similarity(n, query, target)

_block_similarity = libtanimoto.block_similarity
_block_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p,
                              ctypes.c_int, ctypes.c_char_p,
                              ctypes.POINTER(ctypes.c_double)]
_block_similarity.restype = None

# The high-level public interface
def block_similarity(query, targets):
    n = len(query)
    m = len(targets)
    if m % n != 0:
        raise AssertionError("length mismatch")
    num_targets = m//n
    result = numpy.zeros(num_targets, numpy.double)
    result_ctypes = ctypes.cast(result.ctypes.data,
                                ctypes.POINTER(ctypes.c_double))
    _block_similarity(n, query, num_targets, targets, result_ctypes)
    return result
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

When I ran it I found:

&lt;pre class="code"&gt;
1216150 calculations in 0.231 seconds; 5269985 per second
Estimated PubChem time: 3.6 seconds
First few are:
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936
&lt;/pre&gt;

The code is about 25% faster than the previous fastest version.  The
original byte lookup implementation at the start of this essay took
0.750 seconds, so the code is now about 3 times faster.  Although it's
not good enough for general use because it's hard-coded to 1024 bit,
integer aligned fingerprints.

&lt;/P&gt;&lt;P&gt;

That's okay.  I can add a test for that and if the input meets all the
right criteria then call the specialized function, otherwise use the
more general purpose but slower function.  Not something I'm going to
worry about now, but feel free to do it yourself.  It might start
something like:

&lt;pre class="code"&gt;
void block_similarity(int num_bytes, unsigned char *query,
                      int num_targets, unsigned char *targets,
                      double *results) {
    /* ensure the data is all integer aligned */
    if (query % 4 == 0 &amp;amp;&amp;amp; targets % 4 == 0) {
      switch(num_bytes) {
        case 128: .. specialized code for 1024 bits ..
        case 64: .. specialized code for 512 bits ..
        case 40: .. specialized code for the MDL MACCS 320 keys ..
        case 32: .. specialized code for 256 bits ..
           .. more special cases ..
       default: break;
      }
    }
     ... general purpose code goes here ..
&lt;/pre&gt;

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2008/07/01/faster_tanimtocat_search_search.html</guid><pubDate>Tue, 01 Jul 2008 12:00:00 GMT</pubDate></item><item><title>Block Tanimoto calculations</title><link>http://www.dalkescientific.com/writings/diary/archive/2008/06/28/block_tanimoto_calculations.html</link><description>&lt;P&gt;

This is part 4 of an on-going series dealing with molecular
fingerprints:
&lt;ol&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/26/fingerprint_background.html"&gt;Molecular fingerprints, background&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/generating_fingerprints_with_openbabel.html"&gt;Generating molecular fingerprints with OpenBabel&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html"&gt;Computing Tanimoto scores, quickly&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;b&gt;Block Tanimoto calculations&lt;/b&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/07/01/faster_tanimtocat_search_search.html"&gt;Faster block Tanimoto calculations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html"&gt;HAKMEM 169 and other popcount implementations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html"&gt;Testing the bitslice algorithm for popcount&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;a href="http://dalkescientific.blogspot.com/2008/06/molecular-fingerprints.html"&gt;Comments?&lt;/a&gt;

&lt;/P&gt;

&lt;h2&gt;One query, many targets&lt;/h2&gt;
&lt;P&gt;

In &lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html"&gt;part
3&lt;/a&gt; I wrote a C function to compute the Tanimoto similarity score
between two fingerprints.  The Tanimoto is most often used when
searching a large set of target compounds.  Some questions are "what's
the Tanimoto similarity for every compound in the data set", "what are
the 10 target compounds that are most similar to the query, sorted by
similarity, and what are their similarity values?" and "how many
compounds are at least some specified similarity threshold to the
query compound?"

&lt;/P&gt;&lt;P&gt;

With the Tanimo.similarity function extension I defined earlier, it's
simple to implement these algorithms.  Here's one which prints the
Tanimoto score for every compound in a given ".binary_fp" file using
the first compound as the query.

&lt;pre class="code"&gt;
# print_similarity.py
import sys
import tanimoto

filename = "drugs.binary_fp"
if len(sys.argv) &amp;gt; 1:
    filename = sys.argv[1]

# Get the query fingerprint
f = open(filename, "rb")
query_fp = f.read(128)  # 1024 bits


f.seek(0)
while 1:
    target_fp = f.read(128)   # 1024 bits
    if not target_fp:
        # reached the end of file
        break
    print tanimoto.similarity(query_fp, target_fp)
&lt;/pre&gt;

Here's some output to show you what it writes:

&lt;pre class="code"&gt;
% &lt;b&gt;python print_similarity.py nci.binary_fp | head&lt;/b&gt;
1.0
0.492537313433
0.450980392157
0.546511627907
0.505319148936
0.686813186813
0.132701421801
0.132701421801
0.602564102564
0.459330143541
Traceback (most recent call last):
  File "print_similarity.py", line 19, in &lt;module&gt;
    print tanimoto.similarity(query_fp, target_fp)
IOError: [Errno 32] Broken pipe
&lt;/pre&gt;

How fast is this code?  I took the PubChem SD file for compounds
"09425001 - 09450000" and converted it into a Babel fingerprint file,
and thence into a ".binary_fp" file.

&lt;pre class="code"&gt;
% &lt;b&gt;babel Compound_09425001_09450000.sdf.gz nci_09425001_09450000.fpt -xh&lt;/b&gt;
24323 molecules converted
364134 audit log messages 
% &lt;b&gt;python fpt_to_binary.py nci_09425001_09450000.fpt&lt;/b&gt;
% &lt;b&gt;ls -l nci_09425001_09450000.*&lt;/b&gt;
-rw-r--r--   1 dalke  staff  3113344 Jun  9 16:30 nci_09425001_09450000.binary_fp
-rw-r--r--   1 dalke  staff  8215091 Jun  9 16:20 nci_09425001_09450000.fpt
% &lt;b&gt;python -c 'print 3113344/24323.'&lt;/b&gt;
128.0
&lt;/pre&gt;

I concatenated 50 copies of the binary file together to get a single
large file, called "nci.binary_fp":

&lt;pre class="code"&gt;
% &lt;b&gt;ls -l nci.binary_fp&lt;/b&gt;
-rw-r--r--   1 dalke  staff  155667200 Jun  9 16:46 nci.binary_fp
%
&lt;/pre&gt;

I had to make it this large because I wanted it to be somewhat
realistic.  This corresponds to a fingerprint database of 1,216,150
compounds.

&lt;/P&gt;&lt;P&gt;

How long does it take to print the Tanimoto scores for the entire data
set?

&lt;pre class="code"&gt;
% &lt;b&gt;time python print_similarity.py nci.binary_fp &amp;gt; /dev/null&lt;/b&gt;
9.419u 0.442s 0:09.87 99.7%     0+0k 0+0io 0pf+0w
%
&lt;/pre&gt;

Decent.  It means that PubChem (19,218,991 compounds) would take about
2.5 minutes, which is several times faster than the 7 minutes I
estimated in pure Python, but still quite a bit slower than the 20
seconds I estimated was the limit using pure C.

&lt;/P&gt;
&lt;h2&gt;Push blocks of targets to the C code&lt;/h2&gt;
&lt;P&gt;

Where is the extra time being spent?  Probably because I'm doing a lot
of work in Python in order to do a very little bit of work in C.
Every Tanimoto calculation requires a Python function, plus another to
read 1024 bits, plus a test to see if it's the end of file.  Python
function calls take a lot of time compared to C function calls.

&lt;/P&gt;&lt;P&gt;

I could do some profiling to figure out if this guess is correct but
it's about as easy to implement another solution and compare the
timings.  Rather than computing a single Tanimoto at a time I'm going
to pass many fingerprints to the C function, and get a list of
Tanimoto values back.

&lt;/P&gt;&lt;P&gt;

One of the points I've been bring up in these writings is that there
are many solutions to a problem.  In this case I'm going to pass a set
of fingerprints to a C function, but I get to decide how I want to
pass the data in.  I could use a Python list of strings, or a Numpy
array of unsigned integers, or many more possibilities.  The solutions
I'm exploring are based on experience but it doesn't mean I'm right,
nor does it mean I can get to a good solution without some
exploration.

&lt;/P&gt;
&lt;h2&gt;tanimoto.block_similarity - my desired API&lt;/h2&gt;
&lt;P&gt;

For my solution now I'm going to pass the set of query fingerprints as
a single string, with the first fingerprint as the first 128 bytes,
the second as the next 128 bytes, and so on.  I'll call this single
string which contains many fingerprints a "block" so my new function
will be called "block_similarity."  I want the result to be something
like a Python list, where I can iterate over the results for each
fingerprint, in order.  In code, I want the main loop of the
calcultion to act like the following interface:

&lt;pre class="code"&gt;
while 1:
    target_fps = f.read(128*1024)   # 1024 fingerprints of 1024 bits each
    if not target_fps:
        # reached the end of file
        break
    for score in tanimoto.block_similarity(query_fp, target_fps):
        print score
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

I want the results returned as a list but C functions can only return
a simple data type like an integer or a pointer.  Again, there are
several possible solutions.  I'll have my Python wrapper module
allocate space for the results and pass a pointer to that space to the
C function, which assigns the correct values.

&lt;/P&gt;
&lt;h2&gt;libtanimoto.c:block_similarity&lt;/h2&gt;
&lt;P&gt;

With most of the hard work of allocating results handled by some
yet-to-be-written Python code, the C code isn't that hard:

&lt;pre class="code"&gt;
/* This is in libtanimoto.c */
void block_similarity(int num_bytes, unsigned char *query,
                      int num_targets, unsigned char *targets,
                      double *results) {
  int a=0, b=0, c=0, i;

  /* precompute popcount(query) */
  for (i=0; i&amp;lt;num_bytes; i++) {
    a += _popcount_counts[query[i]];
  }

  /* For each fingerprint in the block */
  while (num_targets-- &amp;gt; 0) {
    /* compute popcount(target) and popcount(target&amp;amp;query) */
	b=0; c=0;
    for (i=0; i&amp;lt;num_bytes; i++) {
      b += _popcount_counts[targets[i]];
      c += _popcount_counts[query[i]&amp;amp;targets[i]];
    }
    /* Append the result and go on to the next fingerprint */
    *results = ((double)c)/(a+b-c);
    results++;
    targets += num_bytes;
  }
}
&lt;/pre&gt;

Because I expect to do a Tanimoto calculation using many targets I
could do a bit of optimization.  The Tanimoto function is "c/(a+b-c)"
where "a" is the number of on-bits in the query fingerprint.  That's a
constant so I compute it once.

&lt;/P&gt;
&lt;h2&gt;tanimoto.py:block_similarity&lt;/h2&gt;
&lt;P&gt;

The hard work of allocating memory goes into the 'tanimoto.py' file.
I decided to use a &lt;a href="http://numpy.scipy.org/"&gt;numpy&lt;/a&gt; array
to store the results because they are designed to work with ctypes.  I
can pass the underlying data storage area to a C function with only a
'cast'.  Numpy also supports functions like 'numpy.sort' so working
with numpy arrays can be more convenient than working with a memory
block allocated directly by ctypes, which is my other likely
possibility.  On the other hand it does put a third-party dependency
in the library.

&lt;/P&gt;&lt;P&gt;

Here's the code:

&lt;pre class="code"&gt;
# This is added to 'tanimoto.py'
_block_similarity = libtanimoto.block_similarity
_block_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p,
                              ctypes.c_int, ctypes.c_char_p,
                              ctypes.POINTER(ctypes.c_double)]
_block_similarity.restype = None

import numpy
def block_similarity(query, targets):
    n = len(query)
    m = len(targets)
    if m % n != 0:
        raise AssertionError("length mismatch")
    num_targets = m//n
    result = numpy.zeros(num_targets, numpy.double)
    result_ctypes = ctypes.cast(result.ctypes.data,
                                ctypes.POINTER(ctypes.c_double))
    _block_similarity(n, query, num_targets, targets, result_ctypes)
    return result
&lt;/pre&gt;

Looks easy, right?  You'll be surprised at how long it took to get
that working.  I figured that because the numpy array is a
'numpy.double' then ctypes would see it as a ctypes.c_double array,
without needing the cast.  Instead, without the case ctypes complains:

&lt;pre class="code"&gt;
ctypes.ArgumentError: argument 5: &amp;lt;type 'exceptions.TypeError'&amp;gt;:
 expected LP_c_double instance instead of numpy.ndarray
&lt;/pre&gt;

Strange.  But what I did seems to work, so hopefully others who are
confused will search on the error message and find this essay useful.

&lt;/P&gt;&lt;P&gt;

After I compiled the modified libtanimoto shared library (using
'scons') and updated the 'tanimoto.py' wrapper library I did a quick
test:

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; tanimoto.block_similarity("A", "AaB")
array([ 1.        ,  0.66666667,  0.33333333])
&amp;gt;&amp;gt;&amp;gt; 
&amp;gt;&amp;gt;&amp;gt;  [hex(ord(c)) for c in "AaB"]
['0x41', '0x61', '0x42']
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

In this case the query fingerprint is 8 bits long (the character 'A",
which is the bit pattern 01000001) and the target fingerprint block
contains three 8 bit fingerprints (01000001, 11000001 and 01000010).
The result should be three Tanimoto scores, of values 1.0, 2/3 and 1/3
each.  Which I got.  It seems to work.

&lt;/P&gt;&lt;P&gt;

Not shown here is the step where I compared the results of the
"block_similarity" function to one based on the older "similaity"
function, to make sure they gave the same results.

&lt;/P&gt;
&lt;h2&gt;Computing all Tanimoto scores using block_similarity&lt;/h2&gt;
&lt;P&gt;

Once I was convinced I got the right answers, I wrote a
"print_similarity2.py" function which works the same as
"print_similarity.py" (it calculates and prints all of the Tanimoto
scores for the given fingerprint file) except it uses the block-based
similarity calculations.

&lt;pre class="code"&gt;
# print_similarity2.py
import sys
import tanimoto

filename = "drugs.binary_fp"
if len(sys.argv) &amp;gt; 1:
    filename = sys.argv[1]

# Get the query fingerprint
f = open(filename, "rb")
query_fp = f.read(128)  # 1024 bits


f.seek(0)
while 1:
    target_fps = f.read(128*1024)   # 1024 fingerprints of 1024 bits each
    if not target_fps:
        # reached the end of file
        break
    for score in tanimoto.block_similarity(query_fp, target_fps):
        print score
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

This code takes about 4.1 seconds to run on my 'nci.binary_fp' data
set:

&lt;pre class="code"&gt;
% &lt;b&gt;time python print_similarity2.py nci.binary_fp &amp;gt; /dev/null&lt;/b&gt;
3.571u 0.546s 0:04.12 99.7%     0+0k 0+33io 0pf+0w
&lt;/pre&gt;

which is quite a bit better than the 9.9 seconds of
"print_similarity.py".  Searching PubChem this way should take about
65 seconds, which is about 1/3rd of the speed I expect from C code.

&lt;/P&gt;
&lt;h2&gt;The effects of block size&lt;/h2&gt;
&lt;P&gt;

The code uses a tunable parameter, which is the number of fingerprints
to include in a block.  I timed a variety of block sizes to figure out
what is optimal for my machine:

&lt;pre class="code"&gt;
  number
  of fps    time
 in block  (in s)
      1    42.06
      2    23.41
      5    11.85
     10     8.31
     25     5.68
     50     4.82
    100     4.46
   1000     4.17
   5000     4.17
  10000     4.22
 100000     5.15
&lt;/pre&gt;

It looks like my chosen size of 1024 is a good one.  It gives good
performance while minimizing memory use.  Your machine might be
different.  (Note that I used my computer to listen to music during
this, which is why the numbers are slightly slower than 4.12 seconds.)

&lt;/P&gt;&lt;P&gt;

My data set is "only" 155,667,200 bytes long (149 MB).  It easily fits
into memory.  I was curious on how much time could be saved if I
didn't do any I/O, that is, if I read the entire data file into a
string and only computed the time needed to run "block_similarity" on
that string.  Result: 3.30s, or an estimated PubChem search time of 52
seconds.  About 0.9 seconds is being spent in Python startup time and
reading the data file.

&lt;/P&gt;
&lt;h2&gt;Why's my code so slow?&lt;/h2&gt;
&lt;P&gt;

Almost all of my time is in C code, yet I'm still at 1/3rd the time I
expect from the fastest possible C code.  What's going on?  There's
still a few differences between my code and the "fast" code I wrote.
&lt;ul&gt;
 &lt;li&gt;The fast code repeated the same calculations many times, which meant
it didn't need to fetch data from disk and memory to the CPU.&lt;/li&gt;
 &lt;li&gt;My block_similarity wrapper code allocated memory for the
 results&lt;/li&gt;
 &lt;li&gt;I iterate over the results, and print them to /dev/null&lt;/li&gt;
 &lt;li&gt;My popcount algorithm uses a lookup table, not __builtin_popcount()&lt;/li&gt;
&lt;/ul&gt;

Of these, I'm guessing "print score" probably takes the longest.  Each
number is converted (using __str__) into a string and sent to stdout,
only to be redirected to /dev/null.  To test that guess I changed
"print_similarity2.py" so the inner loop was:

&lt;pre class="code"&gt;
    for score in tanimoto.block_similarity(query_fp, target_fp):
        #print score
        pass
&lt;/pre&gt;

I no longer print out the score.  When I run the code I get:
&lt;pre class="code"&gt;
% &lt;b&gt;time python print_similarity2.py nci.binary_fp&lt;/b&gt;
1.301u 0.538s 0:01.84 99.4%     0+0k 0+0io 0pf+0w
%
&lt;/pre&gt;

4.12 to 1.84 seconds is quite a difference!  The estimated time to
compute the Tanimoto for all of PubChem (but not print out the
results) is now 30 seconds, which is only 4 times slower than my
expected fastest possible performance (using this approach) of 8
seconds.

&lt;/P&gt;&lt;P&gt;

Of course, if I'm not going to print the values then there's no need
to iterate over the return values.  This iteration is slower than
usual in Python because the data is stored in Numpy array, which hold
C doubles, not Python floats.  The iteration has to convert each C
double into a Python float, which also takes time.

&lt;/P&gt;&lt;P&gt;

I removed iteration from my code so the main loop becomes:

&lt;pre class="code"&gt;
while 1:
    target_fp = f.read(128*4096)   # 1024 of 1024 bits
    if not target_fp:
        # reached the end of file
        break
    tanimoto.block_similarity(query_fp, target_fp)
&lt;/pre&gt;

and the total time for my test set is 1.6 seconds,

&lt;pre class="code"&gt;
% &lt;b&gt;time python print_similarity2.py nci.binary_fp&lt;/b&gt;
1.033u 0.536s 0:01.57 99.3%     0+0k 0+11io 0pf+0w
&lt;/pre&gt;

or an estimated 26 seconds for all of PubChem.  I feel better now, as
I wasn't sure why my code was so much slower than it should be.  I
could go back and rewrite my text so you would never see my confusion
and wonderings about this, but where's the fun in that?  In this way
you can see some of how I got to where I did.

&lt;/P&gt;
&lt;h2&gt;mmap - memory-mapped files&lt;/h2&gt;
&lt;P&gt;

Earlier I mentioned that the block size was a tuneable parameter.  I
don't like tuneable parameters.  I prefer self-tuning systems or using
a better algorithm.  It turns out there is another solution - a
memory-mapped file.  This is an operating system mechanism to treat a
a file as a string in memory instead of using file operations like
'read()'.  In theory I would mmap the file to a block of memory then
pass the block directly to to the "block_similarity" function.

&lt;/P&gt;&lt;P&gt;

Sadly, it doesn't work as easily as I thought it would.  Python has a
&lt;a href="http://docs.python.org/lib/module-mmap.html"&gt;mmap module&lt;/a&gt;
which handles making a memory-mapped file, but the mmap object isn't
compatible with the 'ctypes' interface.  It seems that perhaps Python
2.6 makes this easier, but I'm not sure.

&lt;/P&gt;&lt;P&gt;

Instead I bypassed Python's mmap module and called the mmap function
directly from libc, using ctypes.  I ran into some difficulties, which
ended up being because the "off_t" C type is a 64 bit integer, and not
a 32 bit one like I expected.  Getting the data types correct is one
of the hard parts of using ctypes.

&lt;/P&gt;&lt;P&gt;

Here's the code for "print_similarity3.py".  It's using the same
libtanimoto:block_similarity function as "print_similarity2.py" but
because the entire file is mapped to a string, I don't have to worry
about reading a block at a time - I can pass the entire mmap'ed file
to the function.  (This will create in the PubChem case a 152 MByte
Numpy array of doubles, which may be unexpected.)

&lt;pre class="code"&gt;
# print_similarity3.py
import sys
import os
import mmap
import ctypes, ctypes.util
from ctypes import c_void_p, c_int, c_char, c_int64

import tanimoto

libc = ctypes.CDLL(ctypes.util.find_library("c"))
libc.mmap.restype = c_void_p
libc.mmap.argtypes = (c_int, c_int, c_int, c_int, c_int, c_int64)

filename = "drugs.binary_fp"
if len(sys.argv) &amp;gt; 1:
    filename = sys.argv[1]

# Get the query fingerprint and reset
f = open(filename, "rb")
query_fp = f.read(128)  # 1024 bits
f.seek(0)

size = os.path.getsize(filename)

# mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset);
ptr = libc.mmap(0, size, mmap.PROT_READ, mmap.MAP_SHARED,  f.fileno(), 0)

targets = ctypes.cast(ptr, ctypes.POINTER(c_char*size))[0]

#  Don't iterate and print the scores; that takes a long time
#  and I want to focus on the computation part.
#for score in tanimoto.block_similarity(query_fp, targets):
#    print score
tanimoto.block_similarity(query_fp, targets)

&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Getting timing numbers for this has been a bit tricky.  The first time
I ran this took a few tens of seconds as my Mac swapped pages around.
I had a lot of apps using virtual memory.  The second time it went
faster.  I've found in general that using mmap'ed files gives
performance profiles that are harder to predict, so I tend to shy away
from them.  I also have read strange things about using them with big
files (&amp;gt;1GB, or ast least &amp;gt; main memory) but I don't have the
experience to judge.

&lt;/P&gt;&lt;P&gt;

On the plus side, here's the timings:

&lt;pre class="code"&gt;
% &lt;b&gt;time python print_similarity3.py nci.binary_fp&lt;/b&gt;
0.950u 0.318s 0:01.27 99.2%     0+0k 0+0io 0pf+0w
&lt;/pre&gt;

PubChem's fingerprint file would be 2.3 GB.  If the mmap access scales
to that file size then I can compute all its Tanimoto values in about
20 seconds, which is is impressively fast.

&lt;/P&gt;
&lt;h2&gt;Python and NumPy startup costs&lt;/h2&gt;
&lt;P&gt;

My reference C code could in theory calculate Tanimoto values for
PubChem in 8 seconds so my Python program timings, even with mmap,
seems rather slow.  Then again, these aren't computing the same
things.  For one thing, by simply scaling the time I'm exaggerating
the overhead of starting Python.  That should not be affected by the
number of compounds to compute.

&lt;/P&gt;&lt;P&gt;

I changed the last few lines of the "print_similarity3.py" file so I
could see how much time was spent in running "block_similarity"

&lt;pre class="code"&gt;
  ...
import time
t1 = time.time()
tanimoto.block_similarity(query_fp, targets)
t2 = time.time()
print &amp;gt;&amp;gt;sys.stderr, t2-t1
&lt;/pre&gt;

It was 0.90 seconds out of a total time of 1.28 seconds.  That means
0.4 seconds is being spend on startup costs?  That's unexpected.  Of
that, Python's startup time is 0.05 seconds

&lt;pre class="code"&gt;
% &lt;b&gt;time python -c 'a=1'&lt;/b&gt;
0.014u 0.040s 0:00.05 100.0%    0+0k 0+0io 0pf+0w
&lt;/pre&gt;

I added some print statements and calls to time.time() and found the
time was spent in importing 'numpy'!  You can see that in the
following:

&lt;pre class="code"&gt;
% &lt;b&gt;time python -c 'import numpy'&lt;/b&gt;
0.141u 0.221s 0:00.37 97.2%     0+0k 0+0io 0pf+0w
&lt;/pre&gt;

For whatever reason, 'import numpy' also imports a huge number of
modules, about half of which are submodules.

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; import sys
&amp;gt;&amp;gt;&amp;gt; len(sys.modules)
28
&amp;gt;&amp;gt;&amp;gt; import numpy
&amp;gt;&amp;gt;&amp;gt; len(sys.modules)
225
&amp;gt;&amp;gt;&amp;gt; len([s for s in sorted(sys.modules) if 'numpy' in s])
127
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

Timing just 'libtanimoto.block_similarity' gave me 0.9 seconds.  Of
that, the difference between

&lt;pre class="code"&gt;
    for (i=0; i&amp;lt;num_bytes; i++) {
      b += _popcount_counts[targets[i]];
      c += _popcount_counts[query[i]&amp;amp;targets[i]];
    }
&lt;/pre&gt;
and the code without the table lookup
&lt;pre class="code"&gt;
	for (i=0; i&amp;lt;num_bytes; i++) {
      b += targets[i];
      c += query[i]&amp;amp;targets[i];
    }
&lt;/pre&gt;

is 0.88 seconds vs. 0.50 seconds (or estimated 14s for PubChem
vs. 7.9s).  Okay, I'm convinced now that a lot time is now being spent
on the Tanimoto calculation.

&lt;/P&gt;&lt;P&gt;

I also tried replacing the "i&amp;lt;num_bytes" with "i&amp;lt;128" and it
turns out that reduces the time from 0.88 seconds to 0.84 seconds (13
seconds for PubChem).  I'll work on more performance optimizations in
the next article.

&lt;/P&gt;
</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2008/06/28/block_tanimoto_calculations.html</guid><pubDate>Sat, 28 Jun 2008 12:00:00 GMT</pubDate></item><item><title>Computing Tanimoto scores, quickly</title><link>http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html</link><description>&lt;P&gt;

This is part 3 of an on-going series dealing with molecular
fingerprints:
&lt;ol&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/26/fingerprint_background.html"&gt;Molecular fingerprints, background&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/generating_fingerprints_with_openbabel.html"&gt;Generating molecular fingerprints with OpenBabel&lt;/a&gt;
 &lt;li&gt;&lt;b&gt;Computing Tanimoto scores, quickly&lt;/b&gt;&lt;/li&gt;
&lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/06/28/block_tanimoto_calculations.html"&gt;Block Tanimoto calculations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/07/01/faster_tanimtocat_search_search.html"&gt;Faster block Tanimoto calculations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a
href="http://www.dalkescientific.com/writings/diary/archive/2008/07/03/hakmem_and_other_popcounts.html"&gt;HAKMEM 169 and other popcount implementations&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/07/05/bitslice_and_popcount.html"&gt;Testing the bitslice algorithm for popcount&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;a href="http://dalkescientific.blogspot.com/2008/06/molecular-fingerprints.html"&gt;Comments?&lt;/a&gt;

&lt;/P&gt;
&lt;hr /&gt;&lt;P&gt;

For now my driving example will be to print the Tanimoto values for
each compound in a data set, in order.  I get to control where the
data comes from and the format that it's in, and I can precompute
whatever I want about the target data set.  I deliberately don't care
about the identifiers, only the Tanimoto scores, so I can focus on the
search algorithm.  In this essay I'll focus on computing the Tanimoto
between two fingerprints.  The next one will compute the Tanimoto
between a single query fingerprint and a large number of targets.


&lt;/P&gt;
&lt;h2&gt;Converting Babel's "fpt" format to "binary_fp" format&lt;/h2&gt;
&lt;P&gt;

I'll start by converting the hex fingerprints from the "fpt" file into
a new binary file format, and I'll assume that there are 1024 bits in
the fingerprint.  This new format will be very simple.  Each
fingerprint takes up 1024/8 = 128 bytes, and in the same order as the
input file.  This is &lt;i&gt;not&lt;/i&gt; a good format for general use.  I'll
talk about why later.  Because it's a bad format, I'm using an ugly
and long extension to the filename.  Don't use this format for real
work!

&lt;/P&gt;&lt;P&gt;

I decided to use Python for this conversion rather than C because
while my eventual goal is to do fast Tanimoto searches, this is a
preprocessing step where the performance is not critical.  Plus, I
usually don't like parsing in C.

&lt;/P&gt;&lt;P&gt;

Parsing the "fpt" format turns out to be a bit of a pain, in almost
the same way that &lt;a
href="http://www.dalkescientific.com/writings/NBN/parsing.html"&gt;parsing
a FASTA file&lt;/a&gt; is a pain.  There only way to know you are done with
one record is to read the first line of the next record, or get to the
end of the file.  I read the header line (which is different for the
first record vs. the other records in the file) and merge the hex
lines into a single hex fingerprint string.

&lt;/P&gt;&lt;P&gt;

(Earlier I made it easier for myself by assuming I could read the
entire file into memory, but because I'm going to parse a large
fingerprint file soon, I can't make that assumption now.)

&lt;/P&gt;&lt;P&gt;

After I have the fingerprint as a hex digits string I need to convert
it into the raw byte values.  I thought this would be tedious until I
remembered there's a "hex" encoding in Python which does that for me,
and which makes this step trivial. 

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; "Hello!".encode("hex")
'48656c6c6f21'
&amp;gt;&amp;gt;&amp;gt; "48656c6c6f21".decode("hex")
'Hello!'
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Here's the conversion code, which I named "fpt_to_binary.py" for
boringly obvious reasons.

&lt;pre class="code"&gt;
# fpt_to_binary_fp.py
# Convert from an OpenBabel "fpt" format with hex fingerprints (use
# -xf to generate the fingerprints) into a binary fingerprint file.

# Fingerprints are written consecutively, with no title information or
# separator between the fingerprints.  This is a very raw format and
# should not be used for anything other then benchmarking and
# bootstrapping into a better format.

# Parsing formats with no explict end-of-record marker is a nuisance.
def read_records(infile):
    # Parse the header line for the first record, which looks like:
    #   &amp;gt;Strychnine   183 bits set 
    # I'm not splitting on the " " because the title might have a space.
    # Using a regular expression proved a bit too cumbersome.
    line = infile.readline()
    assert line[0] == "&amp;gt;"
    # Turn "&amp;gt;Strychnine   183 bits set " into
    #     ["&amp;gt;Strychnine", "183", "bits", "set"]
    words = line.rsplit(None, 3)
    assert words[-2:] == ["bits", "set"]
    title = words[0][1:]

    while 1:
        
        hex_blocks = []
        for line in infile:
            if line.startswith("Possible"):  # "Possible superstructure"
                continue
            if line[:1] == "&amp;gt;":
                break
            hex_blocks.append("".join(line.split()))
        else:
            # End of input
            yield title, "".join(hex_blocks)
            break

        # "line" contains a title line for the next record
        yield title, "".join(hex_blocks)

        # Parse the header line for the other records, which look like:
        #   &amp;gt;cocaine   Tanimoto from Strychnine = 0.353234 
        assert line[0] == "&amp;gt;"
        i = line.rindex("Tanimoto from")
        title = line[1:i].strip()
    
import sys, os

input_filename = sys.argv[1]
output_filename = os.path.splitext(input_filename)[0] + ".binary_fp"

infile = open(input_filename)
outfile = open(output_filename, "wb")
for title, hexdata in read_records(infile):
    #print repr(title)
    outfile.write(hexdata.decode("hex"))
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Does it work?  I'll convert the "drugs.fpt" file and compare its
contents with the newly created "drugs.binary_fp" file.

&lt;pre class="code"&gt;
% &lt;b&gt;python fpt_to_binary_fp.py drugs.fpt&lt;/b&gt;
% &lt;b&gt;head drugs.fpt&lt;/b&gt;
&amp;gt;Strychnine   183 bits set 
00054062 81009600 00100102 81010700 200c0000 00202850 
03102000 28000404 10200882 001849e4 483c0024 50039002 
1402801a 01b90000 00020540 010c100a 22a4c000 82000300 
2ac02018 00002201 60102120 183c9630 2100a080 81500019 
00041401 80008c00 00484810 90001080 040c8202 0006081b 
24020080 a2042610 
&amp;gt;cocaine   Tanimoto from Strychnine = 0.353234
00010000 01000800 00100100 00010700 000c0050 00204010 
00000000 20000000 00000100 010008c0 00200000 50019000 
% &lt;b&gt;od -t xC drugs.binary_fp | head&lt;/b&gt;
0000000    00  05  40  62  81  00  96  00  00  10  01  02  81  01  07  00
0000020    20  0c  00  00  00  20  28  50  03  10  20  00  28  00  04  04
0000040    10  20  08  82  00  18  49  e4  48  3c  00  24  50  03  90  02
0000060    14  02  80  1a  01  b9  00  00  00  02  05  40  01  0c  10  0a
0000100    22  a4  c0  00  82  00  03  00  2a  c0  20  18  00  00  22  01
0000120    60  10  21  20  18  3c  96  30  21  00  a0  80  81  50  00  19
0000140    00  04  14  01  80  00  8c  00  00  48  48  10  90  00  10  80
0000160    04  0c  82  02  00  06  08  1b  24  02  00  80  a2  04  26  10
0000200    00  01  00  00  01  00  08  00  00  10  01  00  00  01  07  00
0000220    00  0c  00  50  00  20  40  10  00  00  00  00  20  00  00  00
% 
&lt;/pre&gt;

The "od -t cH" displays the hex values for the bytes of the
"drugs.binary_fp" file.  You can see that the output file is simply
the input hex digits, expressed directly in binary.  (You might have
to stare at it for a bit.)  So yes, it does work.

&lt;/P&gt;
&lt;h2&gt;Computing the popcount using only Python&lt;/h2&gt;
&lt;P&gt;

Onwards to computing the Tanimoto scores!  This will require computing
the popcount of the intersection of two fingerprints.  Normal Python
is really quite bad at that.  I could convert the 1024 bit/128 byte
fingerprint into a Python integer (like I did earlier).  Computing the
bit intersection between two integers is easy with the '&amp;amp;'
operator, but finding the popcount is a bit complicated.  (See below
for elaboration.)

&lt;/P&gt;&lt;P&gt;

I could keep everything as a byte string.  Finding the popcount of the
fingerprint is the same as the sum of the popcount of each byte.
There are only 256 possible bytes, so that's a small lookup table.
The problem is computing the bits in common to both fingerprints.
Python doesn't support bit operations on characters, so I would have
to convert the bytes to a number before doing the &amp;amp;.  This is
completely doable, but there's a somewhat better solution.

&lt;/P&gt;&lt;P&gt;

I can use the '&lt;a
href="http://docs.python.org/lib/module-array.html"&gt;array&lt;/a&gt;' module
and create a byte array.  This is a seldom used Python module but it
seems just the thing for this case.

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt;import array
&amp;gt;&amp;gt;&amp;gt; array.array("b", "This is a string")
array('b', [84, 104, 105, 115, 32, 105, 115, 32, 97, 32, 115, 116, 114, 105, 110, 103])
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

This converted the input string into an array of bytes, stored as
integers.  To compute the total fingerprint popcount I'll sum the
popcount for each byte in the fingerprint.  I can get that by looking
it up in a table of precomputed values for each possible byte.  To
compute the number of bits in common between the two fingerprints I'll
compute the sum of the number of bits in common between the
corresponding bytes in the fingerprints.

&lt;pre class="code"&gt;
import array

# Create a lookup table from byte value to 'on' bit count
# chr(0) has 0 'on' bits
# chr(1) has 1 'on' bit
# chr(2) has 1 'on' bit
# chr(3) has 2 'on' bits
#  ...
# chr(255) has 8 'on' bits
popcount_in_byte = (
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
    )
fp = array.array('b', [84, 104, 105, 115, 32, 105, 115, 32,
                       97, 32, 115, 116, 114, 105, 110, 103])
print "popcount =", sum(popcount_in_byte[byte] for byte in fp)
&lt;/pre&gt;

If you copy and past that into a Python terminal (or save it and run
it), you'll get:

&lt;pre class="code"&gt;
popcount = 57
&lt;/pre&gt;

which is the number of 'on' bits in the string "This is a string".  To
verify that, here's a completely different (and slower) method based
on simple bit counting.

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; "This is a string".encode("hex")
'54686973206973206120737472696e67'
&amp;gt;&amp;gt;&amp;gt; int("This is a string".encode("hex"), 16)
112197289293498629157862941796076187239L
&amp;gt;&amp;gt;&amp;gt; x = int("This is a string".encode("hex"), 16)
&amp;gt;&amp;gt;&amp;gt; popcount = 0
&amp;gt;&amp;gt;&amp;gt; while x:
...   if x &amp; 1:
...     popcount += 1
...   x &amp;gt;&amp;gt;= 1
... 
&amp;gt;&amp;gt;&amp;gt; popcount
57
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

The two methods agree, so it's very likely that the code works.
Although there could still be mistakes in the lookup table.

&lt;/P&gt;
&lt;h2&gt;Computing the Tanimoto using only Python&lt;/h2&gt;
&lt;P&gt;

Now that I have a popcount it's easy to compute the Tanimoto.  In the
following program I'll print the Tanimoto for every fingerprint in the
specified binary_fp file, using the first fingerprint as the query
fingerprint.  I'll include computing the Tanimoto of the query
fingerprint against itself.

&lt;pre class="code"&gt;
# compute_tanimoto.py
import sys
import array

# Create a lookup table from byte value to 'on' bit count
# chr(0) has 0 'on' bits
# chr(1) has 1 'on' bit
# chr(2) has 1 'on' bit
# chr(3) has 2 'on' bits
#  ...
# chr(255) has 8 'on' bits
popcount_in_byte = (
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
    )


def tanimoto(query, target):
    a = b = c = 0
    for query_byte, target_byte in zip(query, target):
        a += popcount_in_byte[query_byte]
        b += popcount_in_byte[target_byte]
        c += popcount_in_byte[query_byte &amp; target_byte]
    # I'll define that if neither fingerprint has any
    # set bits (so a == b == 0) then the Tanimoto is 0.
    # Otherwise the equation will try to compute 0/0
    # and Python will throw an exception.
    if not c:
        return 0
    return float(c)/(a+b-c)

if len(sys.argv) &amp;gt; 1:
    filename = sys.argv[1]
else:
    filename = "drugs.binary_fp"

# Use the first fingerprint as the query
infile = open(filename, "rb")
s = infile.read(1024//8)  # 128 bytes in a fingerprint
query = array.array("b", s)

# Reopen and compute the Tanimoto against all fingerprints
# including itself.
infile = open(filename, "rb")

while 1:
    s = infile.read(1024//8)
    if not s:
        # End of file
        break

    target = array.array("b", s)
    print tanimoto(query, target)
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

The output from this program on the previously created
"drugs.binary_fp" file is:

&lt;pre class="code"&gt;
% &lt;b&gt;python compute_tanimoto.py drugs.binary_fp&lt;/b&gt;
1.0
0.353233830846
0.330316742081
0.401486988848
0.332046332046
0.303571428571
0.317406143345
0.22009569378
0.124444444444
0.1
%
&lt;/pre&gt;

How do I verify those are correct?  OpenBabel also computed the
Tanimoto scores when it generated the ".fpt" file.  You can go back
and compare (as I did) and see that they agree.  The only difference
is that I'm printing the results to double precision while OpenBabel
prints them to single precision.  Since the denominator is at best
1024, single precision is more than accurate enough.  In fact, double
implies an accuracy that isn't warranted.

&lt;/P&gt;
&lt;h2&gt;Can I do this faster without writing C code?&lt;/h2&gt;
&lt;P&gt;

There are ways to make the popcount code go faster using standard
Python or available 3rd party extensions.  I'll review some of them
but won't use them outside this section.


&lt;/P&gt;&lt;P&gt;

The standard Python distribution is not a good platform for doing fast
Tanimoto calculations.  There is no builtin function for it and while
I can build them they are all relatively slow.  That's why I'll
shortly develop a C extension for doing this calculation.  Before
doing that, just what other options exist?

&lt;/P&gt;&lt;P&gt;

I could convert the 1024 bit/128 byte fingerprint into a Python
integer.  The first problem is converting the raw fingerprint bytes
into an integer.  I don't know a direct way, but
&lt;tt&gt;int(s.encode("hex"),&amp;nbsp;16)&lt;/tt&gt; works.  Computing the bit
intersection is simple - just use "&amp;amp;".  The problem is computing
the popcount.

&lt;/P&gt;&lt;P&gt;

The simple popcount algorithm is to see if the number has the 1 bit
set, then the 2 bit, then the 4 bit, ...

&lt;pre class="code"&gt;
# This computes about 14,000 popcounts per second
all_bits = [1&amp;lt;&amp;lt;bitno for bitno in range(1024)]

def popcount_1024(x):
    n = 0
    for bit in all_bits:
        if x &amp;amp; bit:
            n += 1
    return n
&lt;/pre&gt;

That's really quite slow.  I could use a clever bit-manipulation
algorithm to calculate it, which I extended from the Wikipedia article
on &lt;a href="http://en.wikipedia.org/wiki/Hamming_weight"&gt;Hamming
weight&lt;/a&gt;.  Here it is:

&lt;pre class="code"&gt;
# Compute the popcount of a Python long, up to 1024 bits.
# My MacBook Pro only does about 47,000 per second.

m1 = int("0x" + "5" * (1024//4), 16)
m2 = int("0x" + "3" * (1024//4), 16)
m3 = int("0x" + "0f" * (1024//8), 16)
m4 = int("0x" + "00ff" * (1024//16), 16)
m5 = int("0x" + "0000ffff" * (1024//32), 16)
m6 = int("0x" + ("0"*8+"f"*8) * (1024//64), 16)
m7 = int("0x" + ("0"*16+"f"*16) * (1024//128), 16)
m8 = int("0x" + ("0"*32+"f"*32) * (1024//256), 16)
m9 = int("0x" + ("0"*64+"f"*64) * (1024//512), 16)
m10 = int("0x" + ("0"*128+"f"*128), 16)
    

def popcount_1024(x):
    x = (x &amp;amp; m1 ) + ((x &amp;gt;&amp;gt;   1) &amp;amp; m1 )
    x = (x &amp;amp; m2 ) + ((x &amp;gt;&amp;gt;   2) &amp;amp; m2 )
    x = (x &amp;amp; m3 ) + ((x &amp;gt;&amp;gt;   4) &amp;amp; m3 )
    x = (x &amp;amp; m4 ) + ((x &amp;gt;&amp;gt;   8) &amp;amp; m4 )
    x = (x &amp;amp; m5 ) + ((x &amp;gt;&amp;gt;  16) &amp;amp; m5 )
    x = (x &amp;amp; m6 ) + ((x &amp;gt;&amp;gt;  32) &amp;amp; m6 )
    x = (x &amp;amp; m7 ) + ((x &amp;gt;&amp;gt;  64) &amp;amp; m7 )
    x = (x &amp;amp; m8 ) + ((x &amp;gt;&amp;gt; 128) &amp;amp; m8 )
    x = (x &amp;amp; m9 ) + ((x &amp;gt;&amp;gt; 256) &amp;amp; m9 )
    x = (x &amp;amp; m10) + ((x &amp;gt;&amp;gt; 512) &amp;amp; m10)
    return x
&lt;/pre&gt;

By being more clever I could make this a bit faster, but I'm not even
going to try.

&lt;/P&gt;&lt;P&gt;

During proofreading I remembered another way to make the popcount of
the and-ed fingerprints be faster - I could use 16 bit integers for
everything and look up the popcounts in a 16 bit lookup table.  But
this doesn't work that well because Python doesn't have a 16 bit array
type.  You would have to use one of the Numpy/numeric/numarray arrays.

&lt;/P&gt;&lt;P&gt;

Another way to compute the popcount of an integer is to use &lt;a
 href="http://members.dsl-only.net/~daniels/bits.html"&gt;Scott David
Daniels' 'bits' module&lt;/a&gt;.  It's a C extension and is quite fast.  I
used it a couple of years ago when I was doing some experiments with
fingerprint searches.

&lt;/P&gt;&lt;P&gt;

If you're going to use a third-party C extension then you might
instead consider using &lt;a
href="http://www.aleax.it/gmpy.html"&gt;GMPY&lt;/a&gt;.  It's based on the &lt;a
href="http://gmplib.org/"&gt;GNU Multiple Precision Library&lt;/a&gt; and
includes a popcount function.  Digging in the source code I see that
the popcount function is implemented in hand-crafted assembly for
different platforms.

&lt;/P&gt;&lt;P&gt;

Yet another possibility for computing the popcount of a fingerprint is
to convert the integer into hex string, or generate a binary pickle
string, and work with the resulting string using the same lookup table
approach I did earlier.  This has the advantage of working with any
length Python integer, but has a lot of overhead.

&lt;pre class="code"&gt;
# Compute the popcount of a Python long
# For 1024 bit fingerprints my MacBook Pro does about 
# 24,000 per second.

popcount_lookup_table = {}
for i, n in enumerate( (
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
    ) ):
    popcount_lookup_table[chr(i)] = n

# "pickle" does 15,000/second
# "cPickle" does 24,000/second
import cPickle as pickle

def popcount(x):
    s = pickle.dumps(x, 2)[4:-1]
    L = popcount_lookup_table
    return sum(L[c] for c in s)
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Or I could stick with the raw byte string.  The hard part there was
computing the number of bits in common between two characters.  The
easiest solution is to convert the characters into integers, do the
bitwise-and, and convert back to a character, but that's a lot of
Python function calls for something that's a single assembly
instruction.  I could perhaps lookup table, where the key in the table
is (query_byte, target_byte) and the value is the number of bits in
common.  I don't like that solution.  For one, it requires
256*256=65,536 entries.

&lt;/P&gt;
&lt;h2&gt;Computing the Tanimoto in C, using GCC extensions&lt;/h2&gt;
&lt;P&gt;

Computing the Tanimoto score is much easier in C, and even easier with
gcc because the latter includes a built-in function for computing the
popcount of an integer, a long, and a long long.  I'll use gcc for now
so I don't need to define a popcount just yet.  (I'll do that in the
next section.)

&lt;/P&gt;&lt;P&gt;

In the following I assume that integers are 32 bits in length, so
there are 32 integer parts to a 1024 bit fingerprint.  I'll write each
unsigned integer in hex, so the number starts with "0x" (to mean
"hex") and ends with "u" (to mean "unsigned").


&lt;pre class="code"&gt;
/* simple_tanimoto.c */
#include &amp;lt;stdio.h&amp;gt;

/* This uses a non-portable builtin function for gcc */
#define popcount(x) __builtin_popcount(x)

unsigned int query[] = {
  /* &amp;gt;Strychnine   183 bits set  */
  0x00054062u, 0x81009600u, 0x00100102u, 0x81010700u, 0x200c0000u, 0x00202850u,
  0x03102000u, 0x28000404u, 0x10200882u, 0x001849e4u, 0x483c0024u, 0x50039002u,
  0x1402801au, 0x01b90000u, 0x00020540u, 0x010c100au, 0x22a4c000u, 0x82000300u,
  0x2ac02018u, 0x00002201u, 0x60102120u, 0x183c9630u, 0x2100a080u, 0x81500019u,
  0x00041401u, 0x80008c00u, 0x00484810u, 0x90001080u, 0x040c8202u, 0x0006081bu,
  0x24020080u, 0xa2042610u};

unsigned int target[] = {
  /* &amp;gt;cocaine   Tanimoto from Strychnine = 0.353234 */
  0x00010000u, 0x01000800u, 0x00100100u, 0x00010700u, 0x000c0050u, 0x00204010u,
  0x00000000u, 0x20000000u, 0x00000100u, 0x010008c0u, 0x00200000u, 0x50019000u,
  0x1c00801au, 0x01890000u, 0x00000000u, 0x02081008u, 0x00a20000u, 0x02000000u,
  0x01c02010u, 0x00000201u, 0x00000020u, 0x18020a30u, 0x01000080u, 0x00500014u,
  0x00001401u, 0x80002000u, 0x00088000u, 0x10001000u, 0x00000200u, 0x0006000eu,
  0xc0020000u, 0x02002200u};

int main(void) {
  int A=0, B=0, c=0;
  int i;
  for (i=0; i&amp;lt;32; i++) {
    /*printf("  %lx %lx\n", fp1[i], fp2[i]);*/
    A += popcount(query[i]);
    B += popcount(target[i]);
    c += popcount(query[i]&amp;amp;target[i]);
  }
  /* Should print "183 89 71 -&amp;gt; 0.353234" */
  printf("%d %d %d -&amp;gt; %f\n", A, B, c, ((double)c)/(A+B-c));
}
&lt;/pre&gt;

I'll compile and run it:

&lt;pre class="code"&gt;
% &lt;b&gt;cc simple_tanimoto.c -O3 -o simple_tanimoto&lt;/b&gt;
% &lt;b&gt;./simple_tanimoto&lt;/b&gt;
183 89 71 -&amp;gt; 0.353234
%
&lt;/pre&gt;

so you can see that it gave the same Tanimoto score that OpenBabel
computed.

&lt;/P&gt;&lt;P&gt;

To me this bit of C is much easier to understand than the same Python
code.  Part of the reason is that I'm using the __builtin_popcount of
gcc but the real reason is that in C things like byte arrays, integer
arrays, and the bytes and bits in an integer are very fluid.  It's
very easy to go from one form to another.  Even if I have to implement
the bitcount myself, with the following bit of code, I can compute
over 1 million Tanimoto similarities per second. which is more than 20
times faster than my fastest Python code.  Note though that this
computing score between the same pair.  Computing Tanimoto scores for
a large number of targets will be slower because bringing that data to
the CPU takes additional time, but then there are other way to make
the performance be better.  I'll talk about this in a bit.

&lt;/P&gt;
&lt;h2&gt;Computing the Tanimoto in C, without GCC extensions&lt;/h2&gt;
&lt;P&gt;

Here's the equivalent C code using an 8-bit lookup table to compute
the popcount for a 32 bit integer.

&lt;pre class="code"&gt;
/* simple_tanimoto2.c */
#include &amp;lt;stdio.h&amp;gt;

/* Implement popcount(unsigned int) myself. */
/* Source based on http://popcnt.org/2007/09/magic-popcount-popcnt-command.html */
const int _popcount_counts[] = {
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
};
/* This assumes that 'unsigned int' is 32 bits */
int popcount(unsigned int x) {
  return (_popcount_counts[x       &amp;amp; 0xff] +
          _popcount_counts[x &amp;gt;&amp;gt;  8 &amp;amp; 0xff] +
          _popcount_counts[x &amp;gt;&amp;gt; 16 &amp;amp; 0xff] +
          _popcount_counts[x &amp;gt;&amp;gt; 24 &amp;amp; 0xff]);
}

unsigned int query[] = {
  /* Strychnine   183 bits set */
  0x00054062u, 0x81009600u, 0x00100102u, 0x81010700u, 0x200c0000u, 0x00202850u,
  0x03102000u, 0x28000404u, 0x10200882u, 0x001849e4u, 0x483c0024u, 0x50039002u,
  0x1402801au, 0x01b90000u, 0x00020540u, 0x010c100au, 0x22a4c000u, 0x82000300u,
  0x2ac02018u, 0x00002201u, 0x60102120u, 0x183c9630u, 0x2100a080u, 0x81500019u,
  0x00041401u, 0x80008c00u, 0x00484810u, 0x90001080u, 0x040c8202u, 0x0006081bu,
  0x24020080u, 0xa2042610u};

unsigned int target[] = {
  /* cocaine  Tanimoto from Strychnine = 0.353234 */
  0x00010000u, 0x01000800u, 0x00100100u, 0x00010700u, 0x000c0050u, 0x00204010u,
  0x00000000u, 0x20000000u, 0x00000100u, 0x010008c0u, 0x00200000u, 0x50019000u,
  0x1c00801au, 0x01890000u, 0x00000000u, 0x02081008u, 0x00a20000u, 0x02000000u,
  0x01c02010u, 0x00000201u, 0x00000020u, 0x18020a30u, 0x01000080u, 0x00500014u,
  0x00001401u, 0x80002000u, 0x00088000u, 0x10001000u, 0x00000200u, 0x0006000eu,
  0xc0020000u, 0x02002200u};

int main(void) {
  int A=0, B=0, c=0;
  int i;
  for (i=0; i&amp;lt;sizeof(query)/sizeof(unsigned int); i++) {
    /*printf("  %lx %lx\n", fp1[i], fp2[i]);*/
    A += popcount(query[i]);
    B += popcount(target[i]);
    c += popcount(query[i]&amp;amp;target[i]);
  }
  /* Should print "183 89 71 -&amp;gt; 0.353234" */
  printf("%d %d %d -&amp;gt; %f\n", A, B, c, ((double)c)/(A+B-c));
}
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Computing the pairwise Tanimoto in C is fast.  I put the Tanimoto
calculation in a loop and calculated it 100,000,000 times, which took
42 seconds.  The last time I looked, PubChem contained 19,218,991
compounds.  Assuming everything was available in fast memory then I
should be able to search it in 8 seconds using a C program, or about 7
minutes using Python.  Which would you prefer?

&lt;/P&gt;&lt;P&gt;

This is interesting.  Doing the popcount using a lookup table took 43
seconds.  Using __builtin_popcount took 109 seconds.  I was not
expecting that.

&lt;/P&gt;
&lt;h2&gt;Computing the Tanimoto in Python via a C extension&lt;/h2&gt;
&lt;P&gt;

I prefer the speed of C but the general ease of programming using
Python.  (That's a sort of "have my cake and eat it too" answer.)
Using Python for Tanimoto searches is slow in part because there is no
fast way to compute the popcount.  I can fix that by writing a Python
extension using C.  Rather than add only the popcount I'll have the
extension compute the entire Tanimoto between two fingerprints.

&lt;/P&gt;&lt;P&gt;

The easiest way to pass fingerprints around in Python is as string,
and more specifically a byte string.  (As compared to a Unicode
string.)  I want my Python code to work like:

&lt;pre class="code"&gt;
import tanimoto
query = "... 128 bytes ..."
target = "... 128 bytes ..."
print "Tanimoto score is", tanimoto.similarity(query, target)
&lt;/pre&gt;

There are many ways to do this.  Some ways I'm not going to use now
are: writing &lt;a href="http://docs.python.org/ext/ext.html"&gt;C
extension&lt;/a&gt; directly, or using &lt;a
href="http://www.boost.org/doc/libs/1_35_0/libs/python/doc/index.html"&gt;Boost
Python&lt;/a&gt;, &lt;a href="http://www.swig.org/"&gt;SWIG&lt;/a&gt; or &lt;a
href="http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/"&gt;Pyrex&lt;/a&gt;
and its offshoot &lt;a href="http://www.cython.org/"&gt;Cython&lt;/a&gt;.
Instead, I'll use &lt;a
href="http://docs.python.org/lib/module-ctypes.html"&gt;ctypes&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

Ctypes is a "foreign function interface" ("FFI") package for Python.
It lets Python load a shared library and call C functions directly.
You still have to write some interface code to tell ctypes how to call
the function, and perhaps write some extra error checking code, but
not much.

&lt;/P&gt;&lt;P&gt;

I find ctypes to be the easiest way to get Python to talk to C code,
but there are a few downsides.  Calling a function through ctypes has
a bit more overhead (it's a bit slower) than using a special-purpose
compiled interface like the other solutions use.  It's also very easy
to crash Python using ctypes.

&lt;/P&gt;&lt;P&gt;

But the advantage is that I can write C code without worrying about
writing the Python/C interface in C.  I do have to worry, but I'll
worry about it from Python.  Here's a simple C library to compute the
Tanimoto similarity between two fingerprints as stored in unsigned
byte strings.

&lt;pre class="code"&gt;
/* libtanimoto.c */
static const int _popcount_counts[] = {
    0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
    3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
};

double similarity(int num_bytes,
                  unsigned char *query, unsigned char *target) {
  int a=0, b=0, c=0, i;
  for (i=0; i&amp;lt;num_bytes; i++) {
    a += _popcount_counts[query[i]];
    b += _popcount_counts[target[i]];
    c += _popcount_counts[query[i]&amp;amp;target[i]];
  }
  return ((double)c)/(a+b-c);
}
&lt;/pre&gt;

Very simple, yes?  Next, I need to compile it.  Again there are many
possible solutions.  I'll use &lt;a
href="http://www.scons.org/"&gt;SCons&lt;/a&gt; which one of several
replacements for make.  Instead of using a "Makefile" it uses a
configuration file named "SConstruct".  The configuration language is
Python, extended with functions that know how to build binaries,
shared libraries, and other targets.  This is very nice because
building shared libraries across multiple platforms is a nasty mess
and I don't want to do it myself.

&lt;/P&gt;&lt;P&gt;

Here's my 'SConstruct' file:

&lt;pre class="code"&gt;
# SConstruct
SharedLibrary("libtanimoto", ["libtanimoto.c"],
              CFLAGS="-O3")
&lt;/pre&gt;

I compiled it using the "scons" command:

&lt;pre class="code"&gt;
% &lt;b&gt;scons&lt;/b&gt;
scons: Reading SConscript files ...
scons: done reading SConscript files.
scons: Building targets ...
gcc -o libtanimoto.os -c -O3 -fPIC libtanimoto.c
gcc -o libtanimoto.dylib -dynamiclib libtanimoto.os
scons: done building targets.
&lt;/pre&gt;

The result is a "libtanimoto.dylib", which is what Macs use as a
run-time loadable shared library.  Linux and other Unix OSes use the
extension ".so" and MS Windows uses ".dll".

&lt;/P&gt;
&lt;h2&gt;Using ctypes to access the shared library&lt;/h2&gt;
&lt;P&gt;

To load the function in ctypes I first load the shared library
(remember, the extension depends on which OS you are using -
'find_library' handles that detail) then get the function from the
module.  I then have to annotate the function so ctypes knows how to
convert the Python call arguments to the right C types and how to
convert the C result into Python.  Here's an example of me doing that
by hand, with a fingerprint of 4 bytes:

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; import ctypes, ctypes.util
&amp;gt;&amp;gt;&amp;gt; libtanimoto = ctypes.CDLL(ctypes.util.find_library("tanimoto"))
&amp;gt;&amp;gt;&amp;gt; libtanimoto.similarity
&amp;lt;_FuncPtr object at 0x88c00&amp;gt;
&amp;gt;&amp;gt;&amp;gt; libtanimoto.similarity.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p] 
&amp;gt;&amp;gt;&amp;gt; libtanimoto.similarity.restype = ctypes.c_double
&amp;gt;&amp;gt;&amp;gt; libtanimoto.similarity(4, "\0\0\0\0", "\0\0\0\0")
nan
&amp;gt;&amp;gt;&amp;gt; libtanimoto.similarity(4, "\0\0\0\0", "\0\0\0\1")
0.0
&amp;gt;&amp;gt;&amp;gt; libtanimoto.similarity(4, "\0\0\0\1", "\0\0\0\1")
1.0
&amp;gt;&amp;gt;&amp;gt; libtanimoto.similarity(4, "\0\0\0\1", "\0\0\0\3")
0.5
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

If I know the number of bytes in the fingerprint then I can call this
interface directly.  Most people would prefer a simpler solution and
not have to worry about passing in the length every time.  After all,
Python strings already know their length.

&lt;/P&gt;
&lt;h2&gt;Making a simplified interface&lt;/h2&gt;
&lt;P&gt;

What I'll do is create a new Python module in the file "tanimoto.py".
This will load the shared library and do the ctypes annotation.  Once
done I'll define a "high-level" function also called "similarity"
which gets the fingerprint lengths, checks that they are the same, and
calls the underlying C similarity implementation.

&lt;pre class="code"&gt;
# tanimoto.py - wrapper to the libtanimoto shared library

import ctypes, ctypes.util

# Load the shared library (the extension depends on the OS type)
libtanimoto = ctypes.CDLL(ctypes.util.find_library("tanimoto"))

# Annotate the function call argument and result types
_similarity = libtanimoto.similarity
_similarity.argtypes = [ctypes.c_int, ctypes.c_char_p, ctypes.c_char_p]
_similarity.restype = ctypes.c_double

# Define the high-level public interface as a Python function
def similarity(query, target):
    n = len(query)
    m = len(target)
    if n != m:
        raise AssertionError("fingerprints must be the same length")
    return _similarity(n, query, target)
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

In action:

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; import tanimoto
&amp;gt;&amp;gt;&amp;gt; tanimoto.similarity("Andrew", "andrew")
0.95999997854232788
&amp;gt;&amp;gt;&amp;gt; tanimoto.similarity("Andrew", "ANDREW")
0.79166668653488159
&amp;gt;&amp;gt;&amp;gt; tanimoto.similarity("Andrew", "123456")
0.40625
&amp;gt;&amp;gt;&amp;gt; tanimoto.similarity("Andrew", "13456")
Traceback (most recent call last):
  File "&amp;lt;stdin&amp;gt;", line 1, in &amp;lt;module&amp;gt;
  File "tanimoto.py", line 23, in similarity
    m = len(target)
AssertionError: fingerprints must be the same length
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

Looks good.  But does it give the right answer?  I'll give it a test
case where we know the right answer and we'll see what it says:

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; s1 = "".join("""
... 00054062 81009600 00100102 81010700 200c0000 00202850 
... 03102000 28000404 10200882 001849e4 483c0024 50039002 
... 1402801a 01b90000 00020540 010c100a 22a4c000 82000300 
... 2ac02018 00002201 60102120 183c9630 2100a080 81500019 
... 00041401 80008c00 00484810 90001080 040c8202 0006081b 
... 24020080 a2042610 
... """.split()).decode("hex")
&amp;gt;&amp;gt;&amp;gt; s2 = "".join("""
... 00010000 01000800 00100100 00010700 000c0050 00204010 
... 00000000 20000000 00000100 010008c0 00200000 50019000 
... 1c00801a 01890000 00000000 02081008 00a20000 02000000 
... 01c02010 00000201 00000020 18020a30 01000080 00500014 
... 00001401 80002000 00088000 10001000 00000200 0006000e 
... c0020000 02002200 
... """.split()).decode("hex")
&amp;gt;&amp;gt;&amp;gt; 
&amp;gt;&amp;gt;&amp;gt; import tanimoto
&amp;gt;&amp;gt;&amp;gt; tanimoto.similarity(s1, s2)
0.35323384404182434
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

Sweet!  The right answer!  Though we're display it to double precision
while OpenBabel only reports to float precision.  That's fine -
there's only 1024 bits so the resolution of the answer can't be better
than 1/1024.

&lt;/P&gt;&lt;P&gt;

The high-level Python wrapper interface is easier to use than calling
the C interface exposed by ctypes.  This will usually be the case.  C
and Python have different world views.  C libaries tend to optimize
towards performance and require that programmers think more about all
the details, while Python libraries tend be safer and easier to use,
but slower.  I'm hoping to show that it's possible to get both by
mixing the two together.

&lt;/P&gt;
&lt;h2&gt;Performance timings for the C extension&lt;/h2&gt;
&lt;P&gt;

I also did some performance timings.  The public high-level interface
can compute Tanimoto scores about 340,000 fingerprints (of 1024 bits
each) per second.  That's about 15% the speed of the native C code but
about 7 times faster than my pure Python solution.  Remember, the
high-level interface does extra work and then calls the low-level
ctypes function.  If I remove that extra work then I can do about
440,000 calculations per second.

&lt;/P&gt;&lt;P&gt;

Here's how I got those numbers:

&lt;pre class="code"&gt;
# Time the tanimoto.similarity function calls
import tanimoto

import time
def go(n):
  range_n = range(n)
  s = "\1"*128
  
  similarity = tanimoto.similarity
  t1 = time.time()
  for i in range_n:
    similarity(s, s)
  t2 = time.time()
  print "Using public interface", float(n)/(t2-t1), "calls/sec"
  
  _similarity = tanimoto._similarity
  t1 = time.time()
  for i in range_n:
    _similarity(128, s, s)
  t2 = time.time()
  print "Using ctypes interface", float(n)/(t2-t1), "calls/sec"
&lt;/pre&gt;

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; go(100000)
Using public interface 341912.926942 calls/sec
Using ctypes interface 446837.633061 calls/sec
&amp;gt;&amp;gt;&amp;gt; go(1000000)
Using public interface 340170.861425 calls/sec
Using ctypes interface 443564.430149 calls/sec
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

One nice thing about having a benchmark in-place is I can test
alternate implementations and see if I get better performance.  My
Tanimoto calculation uses fingerprint bit counts 'a', 'b', and 'c',
based on the Daylight usage.  This has general applicability because
other similarity measures can be described as a function of those
variables.  The Tanimoto calculation has a simpler form, which is the
number of on bits in the bitwise-and of the two fingerprints divided
by the number of on bits in their bitwise-or.  It's is succienctly
written as:

&lt;pre class="code"&gt;
double similarity(int num_bytes,
                  unsigned char *query, unsigned char *target) {
  int and_count=0, or_count=0, i;
  for (i=0; i&amp;lt;num_bytes; i++) {
    or_count  += _popcount_counts[query[i] | target[i]];
    and_count += _popcount_counts[query[i] &amp;amp; target[i]];
  }
  return ((double)and_count)/or_count;
}
&lt;/pre&gt;

Doing a bitwise-or is faster than doing another popcount.  My
benchmark became about 1% to 2% faster with this variation.  That's
almost in the noise, but useful to note if you want to eek out a bit
more performance.

&lt;/P&gt;&lt;P&gt;

&lt;a href="http://dalkescientific.blogspot.com/2008/06/molecular-fingerprints.html"&gt;Comments?&lt;/a&gt;

&lt;/P&gt;
</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html</guid><pubDate>Fri, 27 Jun 2008 12:00:00 GMT</pubDate></item><item><title>Generating molecular fingerprints with OpenBabel</title><link>http://www.dalkescientific.com/writings/diary/archive/2008/06/27/generating_fingerprints_with_openbabel.html</link><description>&lt;P&gt;
This is part 2 of an on-going series dealing with molecular fingerprints:
&lt;ol&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/26/fingerprint_background.html"&gt;Molecular fingerprints, background&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;b&gt;Generating molecular fingerprints with OpenBabel&lt;/b&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="http://www.dalkescientific.com/writings/diary/archive/2008/06/27/computing_tanimoto_scores.html"&gt;Computing 