<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0"><channel><title>Andrew Dalke's writings</title><link>http://www.dalkescientific.com/writings/diary/index.html</link><description>Writings from the software side of bioinformatics and
  chemical informatics, with a heaping of Python thrown in for good
  measure.  Code to taste.  Best served at room temperature.</description><lastBuildDate>Tue, 16 Mar 2010 01:29:03 GMT</lastBuildDate><generator>PyRSS2Gen-1.0.0</generator><docs>http://blogs.law.harvard.edu/tech/rss</docs><item><title>KNIME and beginners</title><link>http://www.dalkescientific.com/writings/diary/archive/2010/03/16/knime_and_beginners.html</link><description>&lt;P&gt;

I gave a presentation at &lt;a
href="http://www.eyesopen.com/about/events/cups-2010/cup11-program-details.html"&gt;OpenEye's
CUP&lt;/a&gt; last week. More precisely, I was assigned a talk with the
title "Evils of KNIME." I don't chose that sort of name, but the CUP
organizers like to be a bit confrontational with presentation
titles. I used my speaking slot as a platform for expressing my views
on dataflow/visual languages. I don't like them, and think their
effectivity is limited compared to a text language, so I explained
why. Other people do like them and enjoy them. I've asked them why,
and they have some good reasons. My presentation outlined those
responses with some observations of my own, including suggestions for
ways to improve the text-based toolkits so they are more accessible to
"non-programmers."

&lt;/P&gt;&lt;P&gt;

The next few posts will be based on parts of that talk. Feel free to
&lt;a href="http://dalkescientific.blogspot.com/2010/03/knime-pipeline-pilot-and-visual.html"&gt;leave comments&lt;/a&gt;.

&lt;/P&gt;
&lt;h3&gt;Upcoming training classes (pre-announcement)&lt;/h3&gt;
&lt;P&gt;

I ended by pointing out that these are technological solutions. Why
not spend some time training computational chemists to be more
effective at writing software? I provide &lt;a
href="http://dalkescientific.com/training/"&gt;that sort of
training&lt;/a&gt;. If you are interested, &lt;a
href="mailto:dalke@dalkescientific.com"&gt;email me&lt;/a&gt;. I'm pinning down
the dates for a course in Leipzig in mid-May (likely 18-20 May), and
another in Boston in late July. I'll announce them when the dates are
determined. if you want to influence those dates or schedule a course
at your site, let me know.

&lt;/P&gt;
&lt;h2&gt;Sample test case for KNIME&lt;/h2&gt;
&lt;P&gt;

I haven't used KNIME for about two years. That experience was with
KNIME 1.x. People told me that it's gotten better, so I decided it was
well time to take a fresh look. Last time I couldn't get it to work on
my Mac. I'm happy to report that things have changed, although there
are still some difficulties with it regarding updates.

&lt;/P&gt;&lt;P&gt;

My test case was the first example from the &lt;a href="http://ctr.wikia.com/"&gt;Chemistry Toolkit Rosetta&lt;/a&gt;, specifically, to &lt;a href="http://ctr.wikia.com/wiki/Heavy_atom_counts_from_an_SD_file"&gt;compute the heavy atom counts from an SD file&lt;/a&gt;. The pybel solution is:

&lt;pre class="code"&gt;
import pybel
 
for mol in pybel.readfile("sdf", "benzodiazepine.sdf.gz"):
    print mol.OBMol.NumHvyAtoms()
&lt;/pre&gt;

It's not as short as I would like because I had to specify "sdf" twice
and because it had to reach down into the underlying OpenBabel
molecule object. Still, it's a lot more succint than using any of the
base toolkits directly, and a good reference of what a text-based
programming language is capable of when designed for ease of use.

&lt;/P&gt;
&lt;h2&gt;What molecular properties can I compute? And how do I do it?&lt;/h2&gt;
&lt;P&gt;

The first step was to find out if KNIME could compute the number of
heavy atoms. When I say "KNIME" I mean "the CDK nodes which come with
KNIME" since KNIME is a dataflow-based visual programming language
with support for a number of extension packages, including chemistry
nodes based on the CDK. Schrodinger, Tripos, ChemAxon and likely other
companies provide nodes based on their respective toolkits, but I
don't have a license to those tools. In any case the Mac version of
KNIME doesn't yet support adding new nodes.

&lt;/P&gt;&lt;P&gt;

The most likely candidate was "Molecular Properties." The help says:

&lt;blockquote&gt;

Create new columns holding molecular properties, computed for each
structure. The computations are based on the CDK toolkit and include
logP, molecular weight, number of aromatic bonds, and many others.

&lt;/blockquote&gt;

What other properties does it compute? I put the node on the workspace
and double clicked on it to bring up the dialog box. The result is:

&lt;blockquote&gt;
The dialog cannot be opened for the following reason:&lt;br /&gt;
No column in spec compatible to "CDKValue".
&lt;/blockquote&gt;

&lt;center&gt;&lt;img src="http://dalkescientific.com/writings/diary/knime1.png" /&gt;&lt;/center&gt; 

Huh? What does that mean?

&lt;/P&gt;&lt;P&gt;

A Google search for that error message found &lt;a
href="http://www.knime.org/node/638#comments"&gt;the same question&lt;/a&gt;
from 9 September 2009 although concerning a different node. Bernd
Wiswedel answered:

&lt;blockquote&gt;

We obviously need to improve on the error messages. You need to
process the output of the SD reader with the "Molecule to CDK" node,
which will parse the structures into an appropriate format for the
Lipinski node. Reason is that the Lipinski node is contributed from
the CDK plugin, so it needs its desired input format.

&lt;/blockquote&gt;

What this means is the inputs need to be set up correctly before I can
see more details. However, it's more complicated then that. If I set
up the nodes as shown: &lt;br /&gt;

&lt;center&gt;&lt;img src="http://dalkescientific.com/writings/diary/knime2.png" /&gt;&lt;/center&gt; 

I still get the same error message when I click on the "Molecular
Properties" box. Double-clicking on the "Molecule to CDK box" gives me

&lt;blockquote&gt;
The dialog cannot be opened for the following reason:&lt;br /&gt;
No column in spec compatible to "SdfValue" "SmilesValue" "MolValue" "Mol2Value" or "CMLValue".
&lt;/blockquote&gt;

Turns out I need to put in a valid SD filename in the "SDF Reader" box
(the one with the exclaimation point under it), in order to get the
right inputs to "Molcule to CDK", in order to see the "Molecular
Properties."

&lt;/P&gt;
&lt;h2&gt;How accessible is KNIME to first-time users?&lt;/h2&gt;
&lt;P&gt;

Is that really friendly for first-time users? That is, how is a
first-time user supposed to: 1) know which options are available if
they can't open an unconnected node, 2) know which inputs are required
for a node, or for that matter see what outputs are available, 3) know
that the "SDF Reader" needs to be converted from "Molecule to CDK"
before it can be used by the CDK nodes?

&lt;/P&gt;&lt;P&gt;

Of course all those can be explained in the documentation, and perhaps
they are explained. I admit I haven't read it, but then again the
knime.org documentation doesn't show how to use the CDK nodes. And
should someone have to read the documentation in order to do something
basic like this task? If so, are dataflow systems really any easier
than working with a text-based programming language?

&lt;/P&gt;
&lt;h2&gt;Can't compute the number of heavy atoms?&lt;/h2&gt;
&lt;P&gt;

I looked through the list of properties which could be computed:

&lt;ul&gt;
 &lt;li&gt;Atomic Polarizabilities&lt;/li&gt;
 &lt;li&gt;Aromatic Atoms Count&lt;/li&gt;
 &lt;li&gt;Aromatic Bonds Count&lt;/li&gt;
 &lt;li&gt;Element Count&lt;/li&gt;
 &lt;li&gt;Bond Polarizabilities&lt;/li&gt;
 &lt;li&gt;Bond Count&lt;/li&gt;
 &lt;li&gt;Carbon connectivity index (order 1)&lt;/li&gt;
 &lt;li&gt;Carbon connectivity index (order 0)&lt;/li&gt;
 &lt;li&gt;Eccentric Connectivity Index&lt;/li&gt;
 &lt;li&gt;Fragment Complexity&lt;/li&gt;
 &lt;li&gt;Hydrogen Bond Acceptors&lt;/li&gt;
 &lt;li&gt;Hydrogen Bond Donors&lt;/li&gt;
 &lt;li&gt;Largest Chain&lt;/li&gt;
 &lt;li&gt;Largest Pi Chain&lt;/li&gt;
 &lt;li&gt;Petitjean Number&lt;/li&gt;
 &lt;li&gt;Rotatable Bonds Count&lt;/li&gt;
 &lt;li&gt;Topological Polar Surface Area&lt;/li&gt;
 &lt;li&gt;Vertex adjacency information magnitude&lt;/li&gt;
 &lt;li&gt;Molecular Weight&lt;/li&gt;
 &lt;li&gt;Zagreb Index&lt;/li&gt;
&lt;/ul&gt;

(BTW, it really does have mixed capitalization. Why yes, I am a
nitpicker. How did you guess? &lt;tt&gt;;)&lt;/tt&gt; )

&lt;/P&gt;&lt;P&gt;

No "heavy atom count." Next option is to see if there's a way to
specify the counts based on a SMARTS pattern. Nope, didn't find
anything.

&lt;/P&gt;&lt;P&gt;

As far as I can tell, there's no way with the default nodes to do much
of anything with KNIME. I assume there are additional packages which I
can install, but why aren't there more useful CDK nodes as part of the
standard installation? An obvious one to me would be a SMARTS count
pattern matcher, where I could specify the SMARTS pattern, the option
for unique or non-unique matche counts, and the output column name.

&lt;/P&gt;&lt;P&gt;

Is my problem because I'm on a Mac? Do Linux users get more nodes? Or
is there something else I'm missing? How would you find the number of
heavy atoms using KNIME? Is there a solution using the default CDK
nodes or do I have to use one of the commercial toolkits?

&lt;/P&gt;&lt;P&gt;

&lt;a href="http://dalkescientific.blogspot.com/2010/03/knime-pipeline-pilot-and-visual.html"&gt;Leave answers and comments here.&lt;/a&gt;</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2010/03/16/knime_and_beginners.html</guid><pubDate>Tue, 16 Mar 2010 12:00:00 GMT</pubDate></item><item><title>Instrumenting the AST</title><link>http://www.dalkescientific.com/writings/diary/archive/2010/02/22/instrumenting_the_ast.html</link><description>&lt;P&gt;

The following is a rough retelling of my presentation for the Testing
in Python BoF at PyCon 2010, including removing some in-jokes relevant
only for that session. On the other hand, I expanded it to include
working code. This means you, yes you, could work on this. It's not
for the faint of heart. Have fun! And &lt;a
href="http://dalkescientific.blogspot.com/2010/02/instrumenting-ast.html"&gt;let
me know&lt;/a&gt; what you think.

&lt;/P&gt;
&lt;h2&gt;The AST module&lt;/h2&gt;
&lt;P&gt;

Code coverage is a good thing. I want to do branch coverage. Last year
Ned Batchelder added branch coverage support to coverage.py, which
works by analyzing the byte code. I want to see if there's a better
solution through an entirely different approach.

&lt;/P&gt;&lt;P&gt;

Let me introduce you to Python's "ast" module.

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; import ast
&lt;/pre&gt;

It's an interface to Python's internal Python parser so program can
convert string containing Python code into an abstract syntax tree
(AST).

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; ast.parse("for i in range(10): print i")
&amp;lt;_ast.Module object at 0x1004d06d0&amp;gt;
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

The ast module contains some code to display the contents of the AST
as a string.

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; ast.dump(ast.parse("for i in range(10): print i"))
"Module(body=[For(target=Name(id='i', ctx=Store()),
   iter=Call(func=Name(id='range', ctx=Load()),
   args=[Num(n=10)], keywords=[], starargs=None,
   kwargs=None), body=[Print(dest=None,
   values=[Name(id='i', ctx=Load())], nl=True)],
   orelse=[])])"
&lt;/pre&gt;

I've reformatted it to fit my slides as otherwise it's a long
string. I can also ask it to display the position information, which
is the second True in the following.

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; ast.dump(ast.parse("for i in range(10): print i"), True, True)
"Module(body=[For(target=Name(id='i', ctx=Store(),
   lineno=1, col_offset=4), iter=Call(func=Name(
   id='range', ctx=Load(), lineno=1, col_offset=9),
   args=[Num(n=10, lineno=1, col_offset=15)],
   keywords=[], starargs=None, kwargs=None,
   lineno=1, col_offset=9), body=[Print(dest=None,
   values=[Name(id='i', ctx=Load(), lineno=1, col_offset=26)],
   nl=True, lineno=1, col_offset=20)], orelse=[], lineno=1,
   col_offset=0)])"
&lt;/pre&gt;

&lt;/P&gt;
&lt;h2&gt;Programmatically building an AST&lt;/h2&gt;
&lt;P&gt;

You can use the ast module to build a tree directly, without parsing a
string, then compile and execute that code.

&lt;pre class="code"&gt;
&amp;gt;&amp;gt;&amp;gt; from ast import *
&amp;gt;&amp;gt;&amp;gt; tree = Module([Print(None, [Str("PyCon2010!")], True)])
&amp;gt;&amp;gt;&amp;gt; tree.lineno = 1
&amp;gt;&amp;gt;&amp;gt; tree.col_offset = 1
&amp;gt;&amp;gt;&amp;gt; fix_missing_locations(tree)
&amp;lt;_ast.Module object at 0x1004dff50&amp;gt;
&amp;gt;&amp;gt;&amp;gt; tree = fix_missing_locations(tree)
&amp;gt;&amp;gt;&amp;gt; compile(tree, "&amp;lt;TiP&amp;gt;", "exec")
&amp;lt;code object &amp;lt;module&amp;gt; at 0x1004d38a0, file "&amp;lt;TiP&amp;gt;", line 1&amp;gt;
&amp;gt;&amp;gt;&amp;gt; exec compile(tree, "&amp;lt;TiP&amp;gt;", "exec")
PyCon2010!
&amp;gt;&amp;gt;&amp;gt; 
&lt;/pre&gt;

"ast.fix_missing_locations" is a helper function to assign missing
position information the compiler needs when generating byte code. I
end up using it and "ast.copy_location" a lot, which copies the
location information from one node to another.

&lt;/P&gt;
&lt;h2&gt;The mystery of the wrong TypeError&lt;/h2&gt;
&lt;P&gt;


What's the bug with the following?

&lt;pre class="code"&gt;
try:
  raise TypeError("blah: %d" % "I said 'PyCon2010'!")
except TypeError:
  pass
&lt;/pre&gt;

The code correctly raises a TypeError, but it's the wrong
TypeError. I've made this mistake a few times, which is why I try to
remember to check the contents of the exception during my tests.
Notice that unittest doesn't help here, since assertRaises only checks
the exception type, and not the content.

&lt;/P&gt;&lt;P&gt;

It is possible to check all of these manually. You could defer the
calculation to a "check_mod()" function

&lt;pre class="code"&gt;
try:
  raise TypeError(
         check_mod("blah: %d", "I said 'PyCon2010'!"))
except TypeError:
  pass
&lt;/pre&gt;

A check_mod function might look like

&lt;pre class="code"&gt;
def check_mod(left, right):
    try:
        return left % right
    except Exception, err:
        print "Could not interpolate: %s" % (err,)
        traceback.print_stack()
        raise
&lt;/pre&gt;

&lt;/P&gt;
&lt;h2&gt;Rewriting the AST for fun (and profit?)&lt;/h2&gt;
&lt;P&gt;

The ast module has some support code for creating a new parse tree
based on transforming another parse tree. I can transform all "%"
binary operations to call a function for the left and right
sides. Here's a non-working but mostly complete version of how that
might look.

&lt;pre class="code"&gt;
from ast import *

class RewriteInterpolation(NodeTransformer):
    def visit_BinOp(self, node):
        if isinstance(node.op, Mod):
            new_node = Call(func=Name(id='check_string', ctx=Load()),
                            args=[node.left, node.right,
                                  Num(n=node.lineno),
                                  Num(n=node.col_offset)],
                            keywords = [], starargs=None, kwargs=None
                            )
            copy_location(new_node, node)
            fix_missing_locations(new_node)
            return new_node
        return node

code = open(filename).read()
tree = parse(code, filename)
tree = RewriteInterpolation().visit(tree)
&lt;/pre&gt;

What's missing is the code to define or import check_string. I'll
leave that for later. For now, just get the idea that you can parse
Python code to an AST, rewrite it in order to instrument certain
parts, then execute the result.

&lt;/P&gt;&lt;P&gt;

I ran something similar to this against the Python standard library and
tests, in the hopes that I could find an bug. It took a lot of hands
on fiddling, since some essential Python modules cannot be
instrumented because the full path wasn't fully defined. I got it to
work, and found no bugs. The closest was this code from difflib.py

&lt;pre class="code"&gt;
        try:
            linenum = '%d' % linenum
            id = ' id="%s%s"' % (self._prefix[side],linenum)
        except TypeError:
            # handle blank lines where linenum is '&amp;gt;' or ''
            id = ''
&lt;/pre&gt;

where the comment is needed because otherwise the reason isn't
immediately obvious. While not a bug, it perhaps does show you that
the test revealed something. (Oh, and it also showed the several
hundred tests that the standard library does to test string
interpolation failures.)

&lt;/P&gt;
&lt;h2&gt;(Well-known) Limitations in coverage.py&lt;/h2&gt;
&lt;P&gt;

Take a look at this program. I've used coverage.py to run the program
and annotate the code to display the coverage and highlight the lines
which weren't executed.

&lt;P&gt;
&lt;img src="http://dalkescientific.com/writings/diary/testing_coverage.png"&gt;

&lt;/P&gt;&lt;P&gt;

You can see there are a other few problems which coverage did not
test. Line 9 never executes "x=9" and the "raise TypeError" in line 17
is never reached, because of the string interpolation error in the
parameter list.

I've hacked together something over the last 30 hours to show that
something better is possible.

&lt;h2&gt;A different approach&lt;/h2&gt;

I want to generate coverage for this statement:

&lt;pre class="code"&gt;
x = 1
&lt;/pre&gt;

I'll do that by parsing the code into the AST then rewriting the AST
so it's equivalent to:

&lt;pre class="code"&gt;
from ast_report import register_module

ast_enter, ast_leave, ast_reached = \
   register_module('spam/testing.py',
      {0: (1, 0)}, {} )   #  unique identifer -&amp;gt; (lineno, col_offset)
 
if 1:
  ast_enter[0] += 1
  x = 3
  ast_leave[0] += 1
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

The "register_module" function is something I'll write in a bit. It
will take a filename and two location dictionaries. The first
dictionary is for statements like assignment which are supposed to go
to completion. That is, code before it and after it are supposed to
execute. (Compare this to 'return', which will never allow code after
it to run. That's what the second dictionary is used for.)

&lt;/P&gt;&lt;P&gt;

The key is a unique identifier associated with each statement with a
coverage test, and the value is the lineno and col_offset pair which
come from the AST.

&lt;/P&gt;&lt;P&gt;

The ast_enter and ast_leave dictionaries here are default dicts
(though that's an implementation point). The "0" is same unique
identifier in the location dictionary, and can be used to say that the
statment at line 1, column 1 (col_offset starts with 0), was reached
and left.

&lt;/P&gt;&lt;P&gt;

At this point someone in the audience astutely asked why I used an "if
1:" in the above. That's a limitation of how the ast.NodeTransformer
works. It lets derived classes tranform a single term to a single
other term, which means I need to transform a single statement (the
assignment here) into a single other statement, and not three
statements. I chose the "if 1:" because it's easy to write, it can
contain an arbitrary number of sub-statements, and because Python's
byte compiler knows how to optimize away the "if 1:" test.

&lt;/P&gt;&lt;P&gt;

If you think about this approach, the run-time overhead is pretty low,
but it's a lot more than simple assignment. I don't know how it
affects real-world code. Remember, it's been 30 hours since I started
this, and I'm at conference as well.

&lt;/P&gt;
&lt;h2&gt;Instrumenting the AST for code coverage&lt;/h2&gt;
&lt;P&gt;

The next step is to automate all of this: convert a .py file into an
AST, instrument the code to add these coverage checks, implement the
reporting mechanism as an atexit hook, and for good measure, add the
"%" TypeError check. To see if this is effective, convert the AST to
byte code and save it to a .pyc file.

&lt;/P&gt;&lt;P&gt;

This calls for a horrible hack around a call to compileall.py. I've
named the result "ast_compileall.py"

&lt;pre class="code"&gt;
# ast_compileall.py
import __builtin__
from ast import *
import compileall
import sys
import traceback
import itertools


class RewriteInterpolation(NodeTransformer):
    def __init__(self, filename):
        self.filename = filename
        self.enter_linenos = {}  # id -&amp;gt; (lineno, col_offset)
        self.reach_linenos = {}  # id -&amp;gt; (lineno, col_offset)
        self.counter = itertools.count()
    def visit_Module(self, module_node):
        # Need to import and call ast_report.register_module().
        # These must occur after the "from __future__ import ..." statements.
        # Find where I can insert them.
        body_future = []
        body_rest = []
        for node in module_node.body:
            node = self.visit(node)
            if (not body_rest and isinstance(node, ImportFrom) and
                node.module == "__future__"):
                body_future.append(node)
            else:
                body_rest.append(node)

        # It's easier to let Python convert the code to an AST
        import_line = parse("from ast_report import register_module, check_string").body[0]
        print ("ast_enter, ast_leave, ast_reached = register_module(%r, %r, %r)" %
               (self.filename, self.enter_linenos, self.reach_linenos))
        register_line = parse(
            "ast_enter, ast_leave, ast_reached = register_module(%r, %r, %r)" %
            (self.filename, self.enter_linenos, self.reach_linenos)).body[0]

        # Assign a reasonable seeming line number.
        lineno = 1
        if body_future:
            lineno = body_future[0].lineno
        for new_node in (import_line, register_line):
            new_node.col_offset = 1
            new_node.lineno = lineno

        new_body = body_future + [import_line, register_line] + body_rest
        return Module(body=new_body)

    # These are statements which should have an enter and leave
    # (In retrospect, this isn't always true, eg, for 'if')
    def track_enter_leave_lineno(self, node):
        node = self.generic_visit(node)
        id = next(self.counter)
        enter = parse("ast_enter[%d] += 1" % id).body[0]
        leave = parse("ast_leave[%d] += 1" % id).body[0]
        self.enter_linenos[id] = (node.lineno, node.col_offset)
        for new_node in (enter, leave):
            copy_location(new_node, node)

        # This is the code for "if 1: ..."
        n = Num(n=1)
        copy_location(n, node)
        if_node = If(test=n, body=[enter, node, leave], orelse=[])
        copy_location(if_node, node)
        return if_node

    visit_FunctionDef = track_enter_leave_lineno
    visit_ClassDef = track_enter_leave_lineno
    visit_Assign = track_enter_leave_lineno
    visit_AugAssign = track_enter_leave_lineno
    visit_Delete = track_enter_leave_lineno
    visit_Print = track_enter_leave_lineno
    visit_For = track_enter_leave_lineno
    visit_While = track_enter_leave_lineno
    visit_If = track_enter_leave_lineno
    visit_With = track_enter_leave_lineno
    visit_TryExcept = track_enter_leave_lineno
    visit_TryFinally = track_enter_leave_lineno
    visit_Assert = track_enter_leave_lineno
    visit_Import = track_enter_leave_lineno
    visit_ImportFrom = track_enter_leave_lineno
    visit_Exec = track_enter_leave_lineno
    #Global
    visit_Expr = track_enter_leave_lineno
    visit_Pass = track_enter_leave_lineno

    # These statements can be reached, but they change
    # control flow and are never exited.
    def track_reached_lineno(self, node):
        node = self.generic_visit(node)
        id = next(self.counter)
        reach = parse("ast_reached[%d] += 1" % id).body[0]
        self.reach_linenos[id] = (node.lineno, node.col_offset)
        copy_location(reach, node)

        n = Num(n=1)
        copy_location(n, node)
        if_node = If(test=n, body=[reach, node], orelse=[])
        copy_location(if_node, node)
        return if_node

    visit_Return = track_reached_lineno
    visit_Raise = track_reached_lineno
    visit_Break = track_reached_lineno
    visit_Continue = track_reached_lineno
    
    # Some code to instrument the run-time and check for '%' failures.
    def visit_BinOp(self, node):
        if isinstance(node.op, Mod):
            new_node = Call(func=Name(id='check_string', ctx=Load()),
                            args=[node.left, node.right,
                                  Num(n=node.lineno),
                                  Num(n=node.col_offset)],
                            keywords = [], starargs=None, kwargs=None
                            )
            copy_location(new_node, node)
            fix_missing_locations(new_node)
            return new_node
        return node

old_compile = __builtin__.compile

def compile(source, filename, mode, flags=0): # skipping a few parameters
    # My rewrite code uses ast.parse, which ends up calling this
    # function with this argument, so pass it back to the real compile.
    if flags == PyCF_ONLY_AST:
        return old_compile(source, filename, mode, flags)
    assert mode == "exec"
    #traceback.print_stack()
    code = open(filename).read()
    tree = parse(code, filename)
    tree = RewriteInterpolation(filename).visit(tree)
    code = old_compile(tree, filename, "exec")
    return code

# Ugly hack so I can force compileall to use my compile function.
__builtin__.compile = compile

exit_status = int(not compileall.main())
sys.exit(exit_status)
&lt;/pre&gt;

I placed this file in "spam/testing.py"

&lt;pre class="code"&gt;
def main():

  def f(x):
    if x &amp;gt; 0:
      return x*x
    1/0

  for i in range(4, 9):
    if f(i) &amp;lt; 0: x=9
    if i == 8:
       continue
       print "Here"
    if i == 10:
       continue

  try:
      raise TypeError("Hi! %d" % "sdfa")
  except TypeError:
      pass

main()
&lt;/pre&gt;

I then compiled all of the .py files in the 'spam' directory with

&lt;pre class="code"&gt;
python ast_compileall.py spam
&lt;/pre&gt;

and I made sure the following was on my PYTHONPATH as "ast_report.py"

&lt;pre class="code"&gt;
# ast_report.py
from collections import defaultdict
import traceback
import atexit
import linecache

loaded_modules = []

class FileInfo(object):
    def __init__(self, filename, enter_linenos, reach_linenos):
        self.filename = filename
        self.enter_linenos = enter_linenos
        self.reach_linenos = reach_linenos
        self.ast_enter = defaultdict(int)
        self.ast_leave = defaultdict(int)
        self.ast_reach = defaultdict(int)

def register_module(filename, enter_linenos, reach_linenos):
    #print filename, enter_linenos, reach_linenos
    info = FileInfo(filename, enter_linenos, reach_linenos)
    loaded_modules.append(info)
    return info.ast_enter, info.ast_leave, info.ast_reach

def check_string(left, right, lineno, col_offset):
    if not isinstance(left, basestring):
        return left % right
    try:
        return left % right
    except Exception, err:
        print "Could not interpolate: %s" % (err,)
        traceback.print_stack()
        raise

# Basic coverage report
def report_coverage():
    for fileinfo in loaded_modules:
        # This will contain a list of all results as a 3-ple of
        #   lineno, col_offset, "text message"
        report = []
        # These should have both 'enter' and 'leave' counts.
        for id, (lineno, col_offset) in fileinfo.enter_linenos.items():
            if id not in fileinfo.ast_enter:
                report.append( (lineno, col_offset, "not entered") )
            elif id not in fileinfo.ast_leave:
                report.append( (lineno, col_offset, "enter %d but never left" %
                                fileinfo.ast_enter[id]) )
            else:
                delta = fileinfo.ast_leave[id] - fileinfo.ast_enter[id]
                report.append( (lineno, col_offset, "enter %d leave %d (diff %d)" %
                                (fileinfo.ast_enter[id], fileinfo.ast_leave[id], delta)) )

        # These only need to be 'reach'ed
        for id, (lineno, col_offset) in fileinfo.reach_linenos.items():
            if id not in fileinfo.ast_reach:
                report.append( (lineno, col_offset, "not reached") )
            else:
                report.append( (lineno, col_offset, "reach %d" % (fileinfo.ast_reach[id],)) )

        # sort by line number, breaking ties by column offset
        report.sort()

        print "Coverage results for file", fileinfo.filename
        for lineno, col_offset, msg in report:
            print "%d:%d %s" % (lineno, col_offset+1, msg)
            print linecache.getline(fileinfo.filename, lineno).rstrip()

# Dump the coverage results when Python exist.
atexit.register(report_coverage)
&lt;/pre&gt;

(While I used an atexit hook here, I did that because it was the
fastest way to get to a proof-of-concept solution. Really I think this
should be more like how coverage.py works, with a command-line script
which sets up the run environment and reports the results at the end.)

&lt;/P&gt;
&lt;h2&gt;Try it out!&lt;/h2&gt;
&lt;P&gt;

This coverage code will only work on modules which were imported,
where the .pyc file is used instead of the .py file. (But perhaps an
import hook would be useful or at least interesting here?) What I do
is import the module via the command-line

&lt;pre class="code"&gt;
% cd spam/
% python -c 'import testing'
Could not interpolate: %d format: a number is required, not str
  File "&amp;lt;string&amp;gt;", line 1, in &amp;lt;module&amp;gt;
  File "spam/testing.py", line 21, in &amp;lt;module&amp;gt;
    main()
  File "spam/testing.py", line 785, in main
  File "ast_report.py", line 30, in check_string
    traceback.print_stack()
Coverage results for file spam/testing.py
1:1 enter 1 leave 1 (diff 0)
def main():
3:3 enter 1 leave 1 (diff 0)
  def f(x):
4:5 enter 5 but never left
    if x &amp;gt; 0:
5:7 reach 5
      return x*x
6:5 not entered
    1/0
8:3 enter 1 leave 1 (diff 0)
  for i in range(4, 9):
9:5 enter 5 leave 5 (diff 0)
    if f(i) &amp;lt; 0: x=9
9:18 not entered
    if f(i) &amp;lt; 0: x=9
10:5 enter 5 leave 4 (diff -1)
    if i == 8:
11:8 reach 1
       continue
12:8 not entered
       print "Here"
13:5 enter 4 leave 4 (diff 0)
    if i == 10:
14:8 not reached
       continue
16:3 enter 1 leave 1 (diff 0)
  try:
17:7 reach 1
      raise TypeError("Hi! %d" % "sdfa")
19:7 enter 1 leave 1 (diff 0)
      pass
21:1 enter 1 leave 1 (diff 0)
main()
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

You can see that it reports the string interpolation without a
problem, and if you look closely you'll see that it catches that the
"if" on line 9 is executed while the "x=9" also on line 9 is never
executed.

&lt;/P&gt;&lt;P&gt;

There's also some problems. Line 4 reports that the code was entered 5
times and never left, but that's a bit of a false positive since it
left through a return statement. I think now, after additional
thought, that the better solution is to put the "leave" test on the
first line of each possible branch.

&lt;/P&gt;
&lt;h2&gt;Pluses and minuses&lt;/h2&gt;
&lt;P&gt;

There are some great advantages to this approach.

&lt;ul&gt;
&lt;li&gt;I don't need to look at the stack frame to figure out where I
  am, or even use the sys.settrace() hook.&lt;/li&gt;

&lt;li&gt;I get coverage testing of every statement on a line.&lt;/li&gt;

&lt;li&gt;I can instrument a specific and limited set of Python files&lt;/li&gt;

&lt;li&gt;Full branch coverage is possible.&lt;/li&gt;
  
&lt;li&gt;I can add tests which are almost impossible to add otherwise (like
   "%d" % "asdf"; or what about checking if the RHS of an assert will
   actually work?)&lt;/li&gt;

&lt;li&gt;What about instrumenting all "d.keys()" calls in Python 2.x code
   to check and report if a dict keys() result is ever used as
   something other than the iterator, like it would be in Python 3.x?&lt;/li&gt;

&lt;/ul&gt;

Some very complex things are possible. Some very evil things are also
possible.

&lt;/P&gt;&lt;P&gt;

There are some difficult problems as well. Consider:

&lt;pre class="code"&gt;
x = arg or default_arg or die(_("missing arg"))
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

Branch reporting should say that 'arg' tested both True and False,
that default_arg tested True and False and ... that the result of
die() tested both True and False? 

&lt;/P&gt;&lt;P&gt;

And just how should someone visualize all this extra data?

&lt;/P&gt;
&lt;h2&gt;"See also" and ruminations&lt;/h2&gt;
&lt;P&gt;

I talked with Ned some after my presentation. He pointed out that the
complex part of coverage.py, which he's worked on a lot during the
last year, is to make the system configurable so it can be told which
coverage to ignore. I know what he means. In the late 1990s I added
the "#pragma: no cover" option to the early form of coverage.py, which
exists (although not my actual code) to this day.

&lt;/P&gt;&lt;P&gt;

If coverage works on a more fine-grained level, how do you suppress
the false warnings so the true issues aren't hidden in the noise?

&lt;/P&gt;&lt;P&gt;

Ned also pointed out Matthew J. Desmarais' work with
&lt;a href="http://bitbucket.org/desmaj/canopy/wiki/Home"&gt;Canopy&lt;/a&gt; 

&lt;blockquote&gt;
instrument python code to generate robust coverage information. the goal is to provide modified condition/decision coverage metrics.
&lt;/blockquote&gt;

I'm not the only one who has thought about instrumenting the AST, even
in Python. (The Lisp community likely thought of this before I was
born.) What I've hoped to do here is explain it well enough so that
you can figure out how this approach works and come up with ways to
extend it for the future.... or figure out why it fails.

&lt;/P&gt;&lt;P&gt;

If you are doing that, do bear in mind my &lt;a
href="http://www.dalkescientific.com/Python/python4ply.html"&gt;python4ply&lt;/a&gt;
package. It contains a full grammar definition for Python using PLY,
with support for the decrepit AST from the compiler
module. Potentially you could use it to have Python 3 generate an AST
for Python 2, or even vice versa, with a lot more work.

&lt;/P&gt;&lt;P&gt;

Or, if you have both money and interest, perhaps you'll fund me? I am
a consultant, after all. I mostly work in computational chemistry and
my clients aren't interested in this sort of deep language analysis,
so I only work on this during rare intervals. It's not only money, but
access to people who want these sorts of capabilities and can give me
feedback on what they want and how effective a solution is.

&lt;/P&gt;&lt;P&gt;

Or, if you want to work on it yourself - feel free! I hereby release
all of this code to the public domain, and disavow any copyright
interest in the code expressed in this article. You don't even have to
mention my name. Just develop good testing tools.

&lt;/P&gt;&lt;P&gt;

I know there are a number of tools in the greater world of computing
which can work on ASTs. I have no experience with them. Perhaps it's
best to convert the Python AST to some other tree grammar where there
is a tree manipulation language? When I'm feeling crazy I think "just
convert the AST to XML then use XSLT to add the instrumentation, and
convert the resulting XML back to an AST." How sane is that? And it
would mean I would have to learn a lot more about XSLT. Or what about
ANTLR's tree grammars? But then there's &lt;a
href="http://www.antlr.org/article/1170602723163/treewalkers.html"&gt;Manual
Tree Walking Is Better Than Tree Grammars&lt;/a&gt;. It's a Brave New World.

&lt;/P&gt;
&lt;h2&gt;Thanks!&lt;/h2&gt;
&lt;P&gt;

I thank Armin Rigo, Brett Cannon, Grant Edwards, John Ehresman, Jeremy
Hylton, Kurt Kaiser, Neal Norwitz, Neil Schemenauer, Nick Coghlan, Tim
Peters, Martin von L&amp;ouml;wis and everyone else who worked on the ast
module. Without them this would be a much harder problem.

&lt;/P&gt;
&lt;h3&gt;Any comments?&lt;/h3&gt;
&lt;P&gt;

Leave them &lt;a href="http://dalkescientific.blogspot.com/2010/02/instrumenting-ast.html"&gt;here&lt;/a&gt;.

&lt;/P&gt;
</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2010/02/22/instrumenting_the_ast.html</guid><pubDate>Mon, 22 Feb 2010 12:00:00 GMT</pubDate></item><item><title>New Cheminformatics Projects</title><link>http://www.dalkescientific.com/writings/diary/archive/2010/02/04/new_cheminformatics_projects.html</link><description>&lt;P&gt;

I've started two new open projects for cheminformatics and I'm looking
for help in both of them.

&lt;/P&gt;
&lt;h2&gt;Chemistry Toolkit Rosetta&lt;/h2&gt;
&lt;P&gt;

The &lt;a
href="http://ctr.wikia.com/wiki/Chemistry_Toolkit_Rosetta_Wiki"&gt;Chemistry
Toolkit Rosetta&lt;/a&gt; (CTR) is a set of common cheminformatics tasks
implemented using a variety of different toolkits and approaches. It
is meant primarily as a way for people to understand and compare how
the different APIs work.

&lt;/P&gt;&lt;P&gt;

Currently there are 16 tasks, 14 of which are well-defined and have at
least one solution (in OpenEye/Python since that's what I know best).
Several also have solutions in Pybel, and there are a couple RDKit and
CDK solution as well.

&lt;/P&gt;&lt;P&gt;

Some of the CTR tasks are:

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://ctr.wikia.com/wiki/Heavy_atom_counts_from_an_SD_file"&gt;Heavy atom counds from an SD file&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ctr.wikia.com/wiki/Working_with_SD_tag_data"&gt;Working with SD tag data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ctr.wikia.com/wiki/Find_the_10_nearest_neighbors_in_a_data_set"&gt;Find the 10 nearest neighbors in a data set&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ctr.wikia.com/wiki/Calculate_TPSA"&gt;Calculate TPSA&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;/P&gt;&lt;P&gt;

It needs your help. The project started in part because I don't know
RDKit, CDK, or Indigo that well - to say nothing of the commercial
tools available from Symyx, Accelrys, Schrodinger, and others. I know
them a bit better now, but not enough.

&lt;/P&gt;&lt;P&gt;

Feel free to contribute a solution in your toolkit of choice! Or
provide commentary, feedback, or improve an existing solution. You can
even contribute a new task, if it's characteristic of a frequently
encountered cheminformatics-related problem which several toolkits can
handle.

&lt;/P&gt;&lt;P&gt;

By the way, I give a big thanks to Noel O'Boyle for his feedback on
the project direction and for his Pybel and Cinfony contributions to
help flesh out CTR before this public annoucement.

&lt;/P&gt;
&lt;h2&gt;Chem Fingerprints&lt;/h2&gt;
&lt;P&gt;

The other project I started is called "&lt;a
href="http://code.google.com/p/chem-fingerprints/"&gt;chem-fingerprints&lt;/a&gt;"
or "chemfp" for short. Its goal is to develop a couple of file formats
for cheminformatics fingerprints as well as tools and libraries which
work with those formats.

&lt;P&gt;&lt;P&gt;

The main problem it addresses is that there is no widely used
fingerprint format, so each research group or even individual
researcher ends up making a new one, as well as the tools to work with
it. See the &lt;a
href="http://code.google.com/p/chem-fingerprints/wiki/UseCases"&gt;use
cases&lt;/a&gt; for some more detailed examples.

&lt;/P&gt;&lt;P&gt;

So far I've written a proposal for a line-oriented text format called
"&lt;a
href="http://code.google.com/p/chem-fingerprints/wiki/FPS"&gt;FPS&lt;/a&gt;"
meant to be easy to generate and parse, and have sketched out a inary
format called &lt;a
href="http://code.google.com/p/chem-fingerprints/wiki/FPB"&gt;FPB&lt;/a&gt;
meant for fast loading, at the expense of some preprocessing.

&lt;/P&gt;&lt;P&gt;

The FPS format is simple enough that you can likely figure out most of
it from this example, taken from the specification:

&lt;pre class="code"&gt;
 #FPS1
 #num_bits=256
 #software=RDKit/2009Q3_1
 #params=RDKit-Fingerprint/1 minPath=1 maxPath=7 fpSize=256 nBitsPerHash=4 useHs=True
 #source=/Users/dalke/databases/Compound_00000001_00025000.sdf.gz
 #date=2010-01-27T02:22:26
 fffeffbfb7fffedff7beefdbddf7ffffabff76cf6df7fcf6f7fffebf7d7ffd6f 1
 fffeffbfb7fffedff7beefdbddf7ffffabff76cf6df7fcf6f7fffebf7d7ffd6f 2
 ffffbfdfffffffffbfeffffffffffffffffffffffff77efffffffebfffffffef 3
 00c02010002610000080800041100002084000440d100000c055048801224400 4
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

I've developed a set of tools to generate FPS fingerprints from
OpenEye, OEChem, and RDKit, as well as to extract fingerprints from SD
tags; specifically the CACTVS substructure keys in PubChem. These are
available from &lt;a
href="http://code.google.com/p/chem-fingerprints/source/checkout"&gt;the
Mercurial repository&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

These tools are in development status, and are primarily meant at this
time as a way to get concrete feedback for the specification.g

&lt;/P&gt;&lt;P&gt;

Other tools I would like to develop, perhaps with your help, are
command-line programs for similarity search and substructure filters.

&lt;P&gt;&lt;P&gt;

I'm also looking for input and feedback on the format definitions, and
for people who want to add support for these formats in their tools.

&lt;/P&gt;&lt;P&gt;

If you are interested in chemfp, then sign up on the &lt;a
href="http://eight.pairlist.net/mailman/listinfo/chemfp"&gt;chemfp
mailing list&lt;/a&gt;.

&lt;/P&gt;
</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2010/02/04/new_cheminformatics_projects.html</guid><pubDate>Thu, 04 Feb 2010 12:00:00 GMT</pubDate></item><item><title>Project hosting options?</title><link>http://www.dalkescientific.com/writings/diary/archive/2010/01/30/project_hosting_options.html</link><description>&lt;P&gt;

I started a new project for cheminformatics fingerprints and want to
make it available for general use. It contains software under the MIT
license and specifications under a license as lenient as I can make
it. (Likely CC-BY.)

&lt;/P&gt;&lt;P&gt;

I looked around for project hosting. My requirements are:
&lt;ul&gt;

&lt;li&gt;Mailing list with only Mailman-style double opt-in. Specifically,
I expect most people who subscribe will not want to have to set up an
account on the project hosting provider before signing up for the
mailing list. &lt;/li&gt;

 &lt;li&gt;Mercurial support&lt;/li&gt;

 &lt;li&gt;Simple web hosting, preferable a wiki where I can put up
specifications and a few related documents and where others can do
some edits&lt;/li&gt;

 &lt;li&gt;Bug and issue tracking would be nice, but not essential since
this is a small project and a TODO in version control should be
fine.&lt;/li&gt;

&lt;/ul&gt;

That's it. Very simple, yes?

&lt;/P&gt;
&lt;h2&gt;The options&lt;/h2&gt;
&lt;P&gt;

I know there's a bunch of resources these days, and in my searches I
found Wikipedia's 

&lt;a
href="http://en.wikipedia.org/wiki/Comparison_of_open_source_software_hosting_facilities"&gt;Comparison
of open source software hosting facilities&lt;/a&gt;. As you can see, there
are quite a few. Sort on version control systems and it's Alioth,
Assembla, BerliOS, Bitbucket, CodePlex, GNU Savannah, Google Code,
JavaForge, KnowledgeForge, Project Kenai, and SourceForge.

&lt;/P&gt;
&lt;h3&gt;Must have mailing list and web page or wiki hosting&lt;/h3&gt;
&lt;P&gt;

Next, filter out those which don't have mailing lists, which removes
Assembla, Bitbucket, and JavaForge. It's a shame about losing
BitBucket since that's what I would have liked. With reluctance I also
dropped Google Code since its mailing lists require a Google account.
I think that's too high of a barrier of entry. I also dropped GNU
Savannah since it doesn't have web or wiki hosting.

&lt;/P&gt;&lt;P&gt;

What's left are: Alioth, BerliOS, CodePlex, KnowledgeForge, Project
Kenai, and SourceForge.

&lt;/P&gt;
&lt;h3&gt;I want to try something other than SourceForge or a clone&lt;/h3&gt;
&lt;P&gt;

Of those I have only used SourceForge, and done that for over 10
years. It feels very clunky and cluttered compared to Google Code and
downloading packages is a nuisance for people like me who would rather
curl the files than use a browser. Perhaps it's time to try something
different? That puts BerliOS out, since it's derived from the
SourceForge code base, as is GNU Savannah, and so is Alioth through
GForge.

&lt;/P&gt;&lt;P&gt;

What's left? CodePlex, KnowledgeForge, and Project Kenai.

&lt;/P&gt;
&lt;h3&gt;Must support non-member access to a mailing list&lt;/h3&gt;
&lt;P&gt;

I looked at CodePlex. I think you have to be a CodePlex member in
order to leave dicussions, and it uses web-based forum software
instead of email. That is, I selected some of the project which have
been downloaded the most often but never could find a "subscribe to
the mailing list" option. Perhaps most people in the Microsoft Windows
and .Net space don't do email? 

&lt;/P&gt;&lt;P&gt;

In any case, it doesn't seem to fit my requirements.

&lt;/P&gt;
&lt;h3&gt;Remaining options: KnowledgeForget and Project Kenai&lt;/h3&gt;
&lt;P&gt;

I looked at KnowledgeForge and while it seems to fit my requirements,
there aren't many people using it, although others may be using the
underlying KForge application to host their own system. My concern is
that the rough edges wouldn't have been worn down by other users.

&lt;P&gt;&lt;/P&gt;

That left me with Project Kenai, which also seemed to do what I
wanted, and it has more and larger development projects, including &lt;a
href="http://jruby.org/"&gt;JRuby&lt;/a&gt;. Okay, I'll try it out.

&lt;P&gt;
&lt;h2&gt;Project Kenai&lt;/h2&gt;
&lt;P&gt;

(Update based on &lt;a
href="http://dalkescientific.blogspot.com/2010/01/project-hosting-options.html"&gt;feedback&lt;/a&gt;.
As of 27 Jan 2010 (or about two days after I registered on Kenai, and
two days before I posted this essay), Oracle, who owns Sun, said they
would be "&lt;a
href="http://blogs.sun.com/projectkenai/entry/the_future_of_kenai_com"&gt;phasing
out of the public-facing domain used for the Project Kenai Beta
site&lt;/a&gt;." Therefore, you shouldn't use it.)

&lt;/P&gt;&lt;P&gt;
I requested a new project hosting and got it. I set up the project,
working on code, and updated the wiki. Seems to be nice enough, with
really no problems to speak about. I was happy enough.

&lt;/P&gt;&lt;P&gt;

I liked some of the tweaks, like how it uses AJAX to update the
displayed content rather than doing a full page submission like when
editing Wikipedia. Though now that I think of it, I adore how
StackOverflow shows the formatted content while you type.

&lt;/P&gt;
&lt;h3&gt;Show stopper - non-member access to the mailing list&lt;/h3&gt;
&lt;P&gt;

Until I got to the email part. Turns out Project Kenai &lt;a
href="http://projectkenai.com/projects/help/pages/MailingLists#Subscribing_to_or_Unsubscribing_From_a_List"&gt;does
allow non-members to join a list&lt;/a&gt;, but they have to &lt;a
href="http://projectkenai.com/projects/help/pages/MailingLists#Using_Email_to_Subscribe_or_Unsubscribe"&gt;email
subscribe request&lt;/a&gt; to the Sympa email system. Very much like the
old majordomo list manager, and with no web-based front-end to help
out.

&lt;P&gt;&lt;P&gt;

I found that be searching the help files. There's no clue that that's
even possible from the normal "mailing lists" page for a project. But
perhaps I could remedy that with instructions on how to sign up
without being a member.

&lt;/P&gt;&lt;P&gt;

The only way I could do that was on the wiki home page. I did that
then asked a friend of mine to try it out. He followed the main
mailing-list link from Kenai and never saw my note. Once on that page
he couldn't figure out how to join without being a member, and he
doesn't want to do all that just to join a mailing list.


&lt;/P&gt;&lt;P&gt;

Once I pointed out the manual instructions, he tried that out. I got
an email which said I need to manualy confirm him as a member. On his
side he only saw that he was now a member, and didn't like the lack of
the Mailman-style double opt-in. As far as he could tell, anyone could
register anyone else through a forged email.

&lt;/P&gt;&lt;P&gt;

That's a serious down-check, since while technically it meets my
requirements, it doesn't meet the spirit.

&lt;/P&gt;
&lt;h3&gt;No response to a feature requst&lt;/h3&gt;
&lt;P&gt;

I posted this request &lt;a
href="http://projectkenai.com/projects/help/forums/features/topics/2398-non-member-mailing-list-subscribers-through-the-web
"&gt;to the features list&lt;/a&gt; a couple of days ago and got no
response.

&lt;/P&gt;&lt;P&gt;

I do realize this is a free project, so I can make no demands nor
should I expect fast response. That's why i waited a couple of days
before writing this posting. But a reason for trying Project Kenai was
because its size should mean it has more of these kinks worked out,
and its support by Sun should imply there's someone to answer mail.

&lt;/P&gt;
&lt;h2&gt;Just choose SourceForge?&lt;/h2&gt;
&lt;P&gt;

As for the project, my conculsion is to just go ahead and use
SourceForge. It's clumsy but I know it handles my needs.

&lt;/P&gt;&lt;P&gt;

Unless you have a &lt;a
href="http://dalkescientific.blogspot.com/2010/01/project-hosting-options.html"&gt;better
suggestion&lt;/a&gt;? Perhaps you think I should try BerliOS?

&lt;/P&gt;
</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2010/01/30/project_hosting_options.html</guid><pubDate>Sat, 30 Jan 2010 12:00:00 GMT</pubDate></item><item><title>Cheminformatics, bioinformatics, and system biology positions available</title><link>http://www.dalkescientific.com/writings/diary/archive/2010/01/20/informatics_positions_available.html</link><description>&lt;P&gt;

A couple of people I know have computational chemsitry and biology positions available,
which might interest some of my readers.

&lt;/P&gt;
&lt;h2&gt;University College Cork, Ireland&lt;/h2&gt;
&lt;P&gt;

Noel O'Boyle (author of &lt;a
href="http://github.com/baoilleach/twirlymol"&gt;TwirlyMol&lt;/a&gt;, and &lt;a
href="http://code.google.com/p/cinfony/"&gt;Cinfony&lt;/a&gt;, contributor to
&lt;a href="http://openbabel.org/"&gt;OpenBabel&lt;/a&gt; and &lt;a
href="http://blueobelisk.sourceforge.net/wiki/Main_Page"&gt;BlueObelisk&lt;/a&gt;,
researcher in algorithms in cheminformatics and scoring functions for
protein-ligand docking, and all-around good guy), has funding for a
PhD student at the School of Pharmacy, University College Cork.

&lt;/P&gt;&lt;P&gt;

You would be developing open source tools for cheminformatics. I'm
actually tempted by this one, except I think consulting pays better, I
might have to move from Sweden, and I've got a different thesis I
would like to work on. 

&lt;/P&gt;&lt;P&gt;

&lt;a href="http://baoilleach.blogspot.com/2010/01/invitation-to-apply-for-phd-in.html"&gt;Details about this PhD position.&lt;/a&gt;

&lt;/P&gt;
&lt;h2&gt;Technical University of Denmark, on the north side of Copenhagen&lt;/h2&gt;
&lt;P&gt;

Thomas Sicheritz-Ponten (whom I last saw at ISMB/Copenhagen, where he
treated our table at the Scottish bar to a round of whiskys and I
danced hustle with his girlfriend while the American musician played
covers which everyone in the bar followed along to), has four open
positions in his Metagenomics group at the Center for Biological
Sequence Analysis.  That's 3 PhD/postdoc positions in next gen
sequencing related metagenomics and 1 postdoc position in archaeal
genetics. 

&lt;/P&gt;&lt;P&gt;

He also pointed out that there are 13 open position (PhD, postdoc, and
programmer) for the entire center, including: postdoc within
predictive bioinformatics for molecular epidemiology, scientific
programmer for molecular epidemiology, PhD within epitope prediction,
scientific programmer within disease systems biology, and postdoc or
research assistant, bioinformatics generalist.

&lt;/P&gt;&lt;P&gt;

&lt;a href="http://www.cbs.dtu.dk/staff/jobs.html"&gt;Details about
the CBSA positions.&lt;/a&gt;

&lt;/P&gt;&lt;P&gt;

Get out there and start researching!

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2010/01/20/informatics_positions_available.html</guid><pubDate>Wed, 20 Jan 2010 12:00:00 GMT</pubDate></item><item><title>Fingerprint File Format</title><link>http://www.dalkescientific.com/writings/diary/archive/2010/01/11/fingerprint_file_format.html</link><description>&lt;P&gt;

(This has nothing at all to do with human fingerprints. I'm talking
about a technique used in chemistry used to represent characteristics
of a small molecule as a bit string so that fast bit operations can be
used instead of slow graph operations. It's often rather like a
&lt;a href="http://en.wikipedia.org/wiki/Bloom_filter"&gt;Bloom filter&lt;/a&gt;
where the hash function is based on molecular substucture.)

&lt;/P&gt;&lt;P&gt;

SD files and SMILES files are the two most common small molecule
structure formats. PDB is the most common macromolecular structure
format. FASTA and GenBank are similarly common in sequence
databases. But molecular fingerprints don't have a common format.

&lt;/P&gt;&lt;P&gt;

I propose a new common format for these, called "FingerPrint Format",
or "FPF file", with the standard extension of ".fpf." I'm looking for
&lt;a href="http://dalkescientific.blogspot.com/2010/01/fingerprint-file-format.html"&gt;feedback and suggestions&lt;/a&gt;, and hopefully also uptake by others.

&lt;/P&gt;&lt;P&gt;

One problem of course is that they are many types of fingerprints and
many ways to represent them. I'm limiting myself to those which are
easily represented as a fixed-length bit string of length between 8
and about 8K, and which are dense enough to store the bits directly as
bits rather than run-length or other encoding.

&lt;/P&gt;&lt;P&gt;

My goal is to make fingerprint data sets more portable, so that tools
and algorithms developed by one group can more easily be shared and
tested by another. 

&lt;/P&gt;
&lt;h2&gt;Use cases&lt;/h2&gt;
&lt;P&gt;

Yes, these are all oriented around me. If it wasn't something I wanted
then I would be getting paid for this. (Got funding?)

&lt;/P&gt;&lt;P&gt;
1) One of my clients wanted a 3-nearest-neighbors search (within some
cutoff) of a set of Daylight fingerprints, to be used in part of their
dataflow system. The files were static, and only updated once every 6
months or so. There was no need for a database, so I wrote a one-off
search system for them.

&lt;/P&gt;&lt;P&gt;

2) I want to write a high-speed Tanimoto similarity search algorithm
which works with a pre-built data structure and makes good use of
modern processors. Most formats are not designed for high performance
and require a preprocessing phase to load the data, and in a
command-line tool the data load time will be the limiting factor, not
the search.

&lt;/P&gt;&lt;P&gt;

3) I've found it a bit trying to determine the format and bit order of
each fingerprint tool I use, since they are almost all different.

&lt;/P&gt;&lt;P&gt;

4) I want to develop some screening data sets for substructure queries
and evaluate how effective they are against different inputs.

&lt;/P&gt;&lt;P&gt;

5) Oh, here's one which others want! Do the N**2 nearest-neighbor
searches of a data set across a set of machines without spending much
time building a fingerprint generation and parsing infrastructure.

&lt;/P&gt;
&lt;h2&gt;Design goals&lt;/h2&gt;
&lt;P&gt;

Here are my design goals:

&lt;ul&gt;
  &lt;li&gt;be able to share fingerprint files between different tools&lt;/li&gt;
  &lt;li&gt;store "small" dense fingerprints and unique identifiers&lt;/li&gt;
  &lt;li&gt;fast load times&lt;/li&gt;
  &lt;li&gt;support for memory-mapped access&lt;/li&gt;
  &lt;li&gt;word-aligned binary data (with selectable word sizes)&lt;/li&gt;
  &lt;li&gt;fast lookup from fingerprint match to identifier&lt;/li&gt;
  &lt;li&gt;input order does not need to be preserved&lt;/li&gt;
  &lt;li&gt;allow some future extensibility&lt;/li&gt;
  &lt;li&gt;everything is stored in a single file&lt;/li&gt;
  &lt;li&gt;allow fingerprints sorted by popcount, for Baldi optimization&lt;/li&gt;
  &lt;li&gt;portable across different architectures&lt;/li&gt;
&lt;/ul&gt;

&lt;/P&gt;&lt;P&gt;

These are not design goals:

&lt;ul&gt;
  &lt;li&gt;human-readable format&lt;/li&gt;
  &lt;li&gt;compatibility with an existing fingerprint format&lt;/li&gt;
  &lt;li&gt;minimize space/compressibility&lt;/li&gt;
&lt;/ul&gt;

&lt;/P&gt;&lt;P&gt;

I haven't decided how important these are:

&lt;ul&gt;
  &lt;li&gt;error detection&lt;/li&gt;
  &lt;li&gt;preserving input order&lt;/li&gt;
  &lt;li&gt;superlinear search of the fingerprint names is acceptable (vs. log time) &lt;/li&gt;
  &lt;li&gt;support largish data sets (&gt;4 million structures with a 1Kbit fingerprint)&lt;/li&gt;
&lt;/ul&gt;

&lt;/P&gt;&lt;P&gt;

Also, while not a design goal, I would like to have a set of reference
implementation for different types of Tanimoto searches and
substructure filtering and others. I also want them to be fast enough to
be used in benchmarking new implementations, since I've seen too many
benchmarks in general where the reference baseline was a naive
(meaning "not fast") implementation.

&lt;/P&gt;
&lt;h2&gt;FPF is based on the PNG block structure&lt;/h2&gt;
&lt;P&gt;

I decided to base the format around the PNG specification. The first 8
bytes are a unique signature, and the rest of the file is a set of
blocks. Each block contains a 4 byte identifier tag, followed by the
length as a 4 byte integer, followed by 'length' many bytes for the
block data, followed by 4 bytes for the CRC checksum.

&lt;/P&gt;&lt;P&gt;

All integers in this proposal are unsigned 32 bit values and stored in
network byte order, taking up four bytes. That leads to a limitation
in the fingerprint block size which I'll get to later.

&lt;/P&gt;&lt;P&gt;

FPF uses the same chunk layout, but with a different signature and
with tags which are more appropriate for fingerprints.

&lt;/P&gt;&lt;P&gt;

The FPF signature as hex bytes is &lt;tt&gt;0x89 0x46 0x50 0x46 0x0D 0x0A 0x1A 0x0A&lt;/tt&gt;. The subsequence &lt;tt&gt;0x46 0x50 0x46&lt;/tt&gt; is "FPF", where PNG uses "PNG".

&lt;/P&gt;&lt;P&gt;

I keep the PNG's use of a CRC checksum as a way to check for
incomplete or corrupted files. I don't know how useful that is in
practice.

&lt;/P&gt;
&lt;h2&gt;FPF block types&lt;/h2&gt;

&lt;h3&gt;FHDR: header block&lt;/h3&gt;
&lt;P&gt;

I haven't figured out what goes in here. Perhaps something which
indicates if the fingerprint type is known to be good/bad for
substructure filtering or comparison?

&lt;/P&gt;
&lt;h3&gt;tEXt (and potentially iTXt, zTXt): text block(s)&lt;/h3&gt;
&lt;P&gt;

These are exactly the same as
&lt;a href="http://www.mirrorservice.org/sites/www.libpng.org/pub/png/spec/1.2/png-1.2-pdg.html#C.Anc-text"&gt;from PNG&lt;/a&gt;, which are key/value pairs including the controlled vocabulary of:

&lt;pre&gt;
   Title            Short (one line) title or caption for image
   Author           Name of image's creator
   Description      Description of image (possibly long)
   Copyright        Copyright notice
   Creation Time    Time of original image creation
   Software         Software used to create the image
   Disclaimer       Legal disclaimer
   Warning          Warning of nature of content
   Source           Device used to create the image
   Comment          Miscellaneous comment; conversion from
                    GIF comment
&lt;/pre&gt;

Needless to say, but I'll do so anyway, some of these need to be tweaked for fingerprints.

&lt;/P&gt;&lt;P&gt;

I think FPF only needs tEXt, which is a simple key/value record using
Latin-1. If those should be UTF-8 then look at iTXt or come up with a
new type of text block.

&lt;/P&gt;&lt;P&gt;

Question: should there be some field which encodes the generation
parameters ("OpenBabel FP2, folded to 32 bits")? That lets someone
pass in structures without having to know the details of the given
data set. Yes, these options are very implementation specific and not
portable.

&lt;/P&gt;
&lt;h2&gt;FDNS: dense fingerprint block&lt;/h2&gt;
&lt;P&gt;

The block format is:

&lt;pre class="code"&gt;
[ integer number of bits per fingerprint ]
[ integer number of bytes per fingerprint (including alignment) ]
[ integer number of bytes used for initial alignment = "initial alignment"]
[ "initial alignment" number of bytes of value 0 ]
[ fingerprint 1, which  is "number of bytes per fingerprint" long ]
[ fingerprint 2 ]
[ fingerprint 3 ]
  ...
[ fingerprint N ]
&lt;/pre&gt;

The fingerprints are stored in binary as a set of bytes. The bytes are
written in little-endian order and the bits of each byte are written
in big-endian order. To get bit &lt;i&gt;i&lt;/i&gt; from a fingerprint:

&lt;pre class="code"&gt;
byte_offset = i / 8;
bit_offset = i % 8;
bit = (fingerprint[byte_offset] &amp;gt;&amp;gt; bit_offset) &amp;amp; 1;
&lt;/pre&gt;

(CACTVS uses little-endian for the bytes and the bits, OpenBabel uses
big-endian. I decided this mixture was easier to code.)

&lt;/P&gt;&lt;P&gt;

The minimum number of bytes per fingerprint is
(bits_per_fingerprint+7)/8.  Extra bits used to fill in the remainder
of the last byte must be set to 0. The fingerprint may be padded with
extra 0 bytes in order to make the fingerprint a multiple of a word
size (typically 32 bits, 64 or 128 bits).

&lt;/P&gt;&lt;P&gt;

The "initial alignment" part is there to align the fingerprints in the
case of memory-mapped files. It can be judiciously chosen so that the
first byte of the first fingerprint is word aligned or even page
aligned, if that makes a difference.

&lt;/P&gt;&lt;P&gt;

The number of fingerprints is not stored. It can be calculated based
on the block size, the header size, and the number of bytes per
fingerprint.

&lt;/P&gt;&lt;P&gt;

By the way, I expect memory alignment to be a down-stream issue where
one program can generate a simple fingerprint file, using no extra
alignment bytes, and another program can tweak the alignments to be
more optimal for a given OS and implementation.

&lt;/P&gt;
&lt;h3&gt;POPC: ordered population count offsets block&lt;/h3&gt;

&lt;P&gt;
&lt;pre class="code"&gt;
[ integer offset in DENS to first fingerprint with popcount=0 ]
[ integer offset  "  "   to first fingerprint with popcount=1 ]
   ...   
[ integer offset  "  "   to first fingerprint with popcount=N ]
[ integer offset in DENS to the byte after the last fingerprint with popcount=N ]
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

One way to speed up similarity queries and substructure filters is to
exclude testing fingerprints which trivially cannot be accepted based
on the popcounts of the target and query fingerprints. "Popcount" is
short for "population count" and is the number of set bits in the
fingerprint. For example, in a substructure filter test the query
cannot have a smaller popcount than the target, and Baldi explained
how reduce the Tanimoto search space.

&lt;/P&gt;&lt;P&gt;

FPF stores the popcount information implicitly, by ordering the
fingerprints in the FDNS block so that all fingerprints with
popcount=0 are listed first, then those with popcount=1, then
popcount=2 and so on.

&lt;/P&gt;&lt;P&gt;

The "POPC" block contains byte offsets to the start and end of each
set of fingerprints with the same popcount. If there are N bits per
fingerprint then there will be N+2 offsets, and the fingerprints with
popcount &lt;i&gt;0&amp;lt;=i&amp;lt=N&lt;/i&gt; are between offsets POPC[i] and POPC[i+1].

&lt;/P&gt;&lt;P&gt;

The last offset is redundant and must be identical to the length of
the DENS block. I include it here because my C implementations of the
algorithms were easier if they have this information, and storing it
in the data file means one less malloc or special case to consider.

&lt;/P&gt;&lt;P&gt;

If the POPC block is present then the DENS fingerprints must be
ordered by population count.

&lt;/P&gt;
&lt;h2&gt;NAME block - list of fingerprint identifiers&lt;/h2&gt;
&lt;P&gt;
&lt;pre class="code"&gt;
[NUL byte]
[ name 1 ]
[NUL byte]
[ name 2 ]
 ....
[ name N ]
[NUL byte]
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

If present, each fingerprint is associated with a single name and vice
versa. The name must not contain the NUL character and should be a
unique identifier and should not contain any control characters. (Are
the names Latin-1 or UTF-8 encoded?  If UTF-8, does the string also
undergo a canonicalization step? Or are only ASCII names allowed?)

&lt;/P&gt;&lt;P&gt;

The names are stored in the same order as the fingerprints in the DENS
block, so that the first name corresponds to the first fingerprint,
the second with the second, and so on. Each name must be NUL
terminated. (However, sorted names would make searching possible in
log time instead of linear. On the other hand, most people will load
the names into a local dictionary, and take the linear hit once.)

&lt;/P&gt;&lt;P&gt;

The first byte in the NAME block must be 0 (the NUL character). This
is present even if there are no fingerprints and hence no names.

&lt;/P&gt;&lt;P&gt;

The purpose of the leading NUL character is to simplify text
searches. To find the identifier "ID", construct a search for
NUL+"ID"+NUL and do a linear search. (It also simplifies binary
searches; pick a point p then backtrack to the previous NUL before
doing the test.)

&lt;/P&gt;
&lt;h3&gt;NOFF block - offsets into the NAME block&lt;/h3&gt;
&lt;P&gt;
&lt;pre class="code"&gt;
[integer offset to name for fingerprint 1]
[integer offset to name for fingerprint 2]
[integer offset to name for fingerprint 3]
  ...
[integer offset to name for the last fingerprint]
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

This block maps contains offsets into the NAME block. To get the name
for fingerprint &lt;i&gt;i&lt;/i&gt;, go to the i&lt;sup&gt;th&lt;/sup&gt; entry in the table
then use that as the byte offset of the name the NAME block.

&lt;/P&gt;&lt;P&gt;

To go the other way requires a binary search. Given the byte offset to
a name in the NAME block, use a binary search to find the entry in the
NOFF block. The byte position of the entry gives the index into the
NOFF block, which is also the index of the fingerprint in the DENS
block.

&lt;/P&gt;&lt;P&gt;

Using sorted names means the mapping from identifier to fingerprint
can be done in log time. Using unsorted names requires linear time.

&lt;/P&gt;&lt;P&gt;

However, I expect most people to load the NAME data into a local
dictionary/hash table, which will take linear time but only once. In
that case, keeping the names in the same sort order as the
fingerprints is much easier to deal with.

&lt;/P&gt;
&lt;h2&gt;Unresolved issues&lt;/h2&gt;

&lt;h3&gt;Sorted or unsorted names?&lt;/h3&gt;
&lt;P&gt;

I'm repeating it here because it's tricky. How well supported should
fingerprint lookup by name be? If the names are in sorted order then I
can find the name in log(n) time, then find the fingerprint index in
log(n) time, but I had to do that every time. While building a table
mapping from name to index requires disentangling the lookup table and
an linear operation.

&lt;/P&gt;&lt;P&gt;

Preserving the fingerprint sort order means the fastest program
requires O(n) time lookup for every case but makes building the
mapping from name to fingerprint very easy.

&lt;/P&gt;
&lt;h3&gt;Scaling&lt;/h3&gt;
&lt;P&gt;

This description used 32-bit unsigned integers as lengths and
offsets. That means the largest fingerprint data set can contain at
most 2**32 bytes. Consider someone using the 881 CACTVS substructure
keys aligned to 128 bits. This uses 896 bits or 112 bytes per
fingerprint, which means the block can hold a bit over 38 million
fingerprints.

&lt;P&gt;&lt;P&gt;

Of course, if someone does an 8K fingerprint then that drops to just
over 4 million fingerprints.

&lt;P&gt;&lt;P&gt;

Is this a realistic concern for the next 10 years? I think most people
don't use fingerprints over 1024 bits, but on the other hand the
databases are currently in the 50 million compound range.

&lt;/P&gt;&lt;P&gt;

There are three ways to address it. One is to use 64 bit values
instead of 32 bit. Another is to use multiple blocks, with a special
block to indicate the start of a new set of blocks. A third is to
simply have multiple FPF files. One thing to bear in mind is that
multiple section or multiple files is likely easier for people using
map/reduce strategies.

&lt;/P&gt;
&lt;h3&gt;Storing multiple diverse fingerprint sets in the same file&lt;/h3&gt;
&lt;P&gt;

Substructure fingerprints are likely not good similarity fingerprints
and vice versa. Is there a need to store different fingerprint sets in
the same file?

&lt;/P&gt;&lt;P&gt;

I don't think so. I think using multiple files works well for that.

&lt;/P&gt;
&lt;h3&gt;Preserving order&lt;/h3&gt;
&lt;P&gt;

A substructure screen may go through several stages. For example, if
the input structure contains a hormone substructure then the results
of a general query filter might be passed through one optimized to
distinguish between different types of hormones, with the hitlist
passed from the first stage to the next.

&lt;/P&gt;&lt;P&gt;

Using names as the hitlist identifiers does not work that well because
of the expensive lookup cost. It's better to pass around a list of
offsets. But if the DENS block is reordered then the original sort
order is no longer available. Also, there needs to be a way to report
the match using its input order.

&lt;/P&gt;&lt;P&gt;

There are two solutions to that: don't sort by popcount, or add a
table which maps from input order to DENS order. (Or possibly two
tables, although the inverse mapping can be constructed in O(N) time
with only one malloc because it's a bucket sort.)

&lt;/P&gt;
&lt;h3&gt;Usefulness of Baldi&lt;/h3&gt;
&lt;P&gt;

The Baldi algorithm suggests optimial searching based on ordering the
search of the popcount bins based on the maximum minimum value of the
bin. (The paper actually suggests allocating an array and sorting, but
as the two sides of the distribution are monotonic decreasing it can
also be done with an iterator doing the equivalent of a merge sort of
the two sides.)

&lt;/P&gt;&lt;P&gt;

I implemented it but excepting high similarity searches (perhaps 80%
or so? I did this over a year ago), I didn't notice any real
improvement in Tanimoto searches. I conjectured that non-linear disk
seeks were causing the problem. The memory and especialy disk
subsystems are really optimized for forward searches but the Baldi
algorithm jumps back-and-forth, and breaks all the lovely
cache-prefetching going on behind the scenes.

&lt;/P&gt;&lt;P&gt;

While I'm convinced that the Baldi limits are useful, the reordering
to make it fast ends up making the system more complicated.

&lt;/P&gt;&lt;P&gt;

BTW, SSDs (Solid State Drives) fix part of the problem by eliminating
seek time, through not cache prefetching. Another solution might be to
store two sets of fingerprints, with the second sorted in reverse
popcount order and on another disk. Then with two threads going
forward... Hmmm...

&lt;/P&gt;&lt;P&gt;

In any case, Swamidass and Baldi reported good speedups for top-K
searches. I haven't seen their code, but I have seen other search code
which uses a very slow scoring function, so I would like a platform
where it's easier for people to compare against known fast
implementations of different approaches.

&lt;/P&gt;
&lt;h3&gt;Fingerprint density&lt;/h3&gt;
&lt;P&gt;

I said that I'm assuming dense fingerprints, where the bits are
random. Most chemical fingerprints are sparse, with under 20% density
for similarity fingerprints and even less for feature key fingerprints
used for substructure screening.

&lt;/P&gt;&lt;P&gt;

Perhaps using run-length encoding of the bits in a fingerprint, or
run-length encoding of the inverted index, might be better. However, I
have a lot less experience with that, and storing those data
structures on disk is more complex than I want to deal with.

&lt;/P&gt;&lt;P&gt;

CPUs are very fast. A test I did a few years ago suggests that disk
and memory access are the limiting factor, and not the algorithm. If
that's the case then it's more worthwhile to make the CPU do extra
work while it's waiting for less data. On the other hand, another test
showed my code was still rather slower than computing the md5 checksum
of the same file. These tests were suggestive, not rigorous.

&lt;/P&gt;
&lt;h3&gt;Streaming output&lt;/h3&gt;
&lt;P&gt;

Each chunk starts off with a length. This leads to several
complications. It's hard to stream everything to a pipe except if the
output data is known, which likely means storing all of the
fingerprints in memory before generating the output.

&lt;/P&gt;&lt;P&gt;

If the output is a file, not a stream, then it's possible to write the
all the fingerprints to the file, then seek back to the length field
and, and then seek back to the end. They wouldn't be sorted, but a
post-processor could sort a file which is too large to keep in
memory. (Which given that my laptop has 3GB of memory doesn't seem a
real concern.)

&lt;/P&gt;&lt;P&gt;

That would also mean keeping the names in memory until the
fingerprints are written, since the above trick only works for a
single block.

&lt;/P&gt;&lt;P&gt;

If memory is critical (which shouldn't be the case for names, as they
are usually only 10 bytes or so), then an implementation might write
the blocks to the filesystem and merge later on. Just like the old
days of dealing with tape drives. But really, this seems more a
theoretical concern than something to worry about since the point of
an FPF file is use it as a data source for a pipeline but not part of
the pipeline stream.

&lt;/P&gt;
&lt;h2&gt;Development and future&lt;/h2&gt;
&lt;P&gt;

This is something I've been thinking about for a while but haven't
gotten around to because I'm not working with fingerprints right
now. I am a consultant so if you're interested in funding me, &lt;a
href="mailto:dalke@dalkescientific.com"&gt;email me&lt;/a&gt;. I also do have
other, paying work so this doesn't have much priority.

&lt;/P&gt;&lt;P&gt;

I like the general approach of using PNG blocks. I've implemented
parts of the system, and the resulting code has been nice and
clean. The format I sketched out fits in well with the
high-performance C code for Tanimoto search and substructure filter
codes I wrote a couple of years ago. I also like how the format is
relatively future proof and if someone needs some new data chunk it
isn't hard to add.

&lt;/P&gt;&lt;P&gt;

I do need feedback from others. If you have ideas, or comments, or
perhaps know about existing fingerprint formats I should use instead,
then

&lt;a href="http://dalkescientific.blogspot.com/2010/01/fingerprint-file-format.html"&gt;leave a message&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

If there's enough interest in this, I'll set up some sort of project
repository and mailing list for it, so it's up to you to speak up!

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2010/01/11/fingerprint_file_format.html</guid><pubDate>Mon, 11 Jan 2010 12:00:00 GMT</pubDate></item><item><title>Content-Disposition bug in browsers?</title><link>http://www.dalkescientific.com/writings/diary/archive/2010/01/04/content_disposition_bug.html</link><description>&lt;P&gt;

Would someone who knows the relevant web specifications double-check
me on this? I think I've found a bug in how Safari, Firefox, and links
handle Content-Disposition in file uploads. (This is part of how a
form sends a file to the server and not how the server sends a file
back to the browers.). I did the following on a Mac OS X 10.6 ("Snow
Leopard") machine.

&lt;/P&gt;&lt;P&gt;

If you have input or comments, or test results for your browser,
&lt;a href="http://dalkescientific.blogspot.com/2010/01/content-disposition.html"&gt;let me&lt;/a&gt;
know. I looked in Firefox's Bugzilla but found nothing about it, nor
did this problem come up in general Google searches using what I think
are the relevant search phrases.

&lt;/P&gt;&lt;P&gt;

My appeal to the lazyweb: If this is a bug, and you know how to submit
to the relevant trackers, please do so. Figuring out each one and
tracking the comments for each site is a nuisance.

&lt;/P&gt;
&lt;h2&gt;Reproducible&lt;/h2&gt;
&lt;P&gt;

I created a file named:

&lt;pre class="code"&gt;
Evil"; name="fred
&lt;/pre&gt;

(It gets more evil if the file contains a newline, but I'm not going
to work though an example of that since I want something which is easy
for you to create.)

&lt;/P&gt;&lt;P&gt;

I created a simple form

&lt;pre class="code"&gt;
&amp;lt;html&amp;gt;
&amp;lt;head&amp;gt;&amp;lt;title&amp;gt;content-disposition test&amp;lt;/title&amp;gt;&amp;lt;/head&amp;gt;
&amp;lt;body&amp;gt;
 &amp;lt;form method="POST" action="http://localhost:8888/" enctype="multipart/form-data"&amp;gt;
  &amp;lt;input type="file" name="blah"&amp;gt;
  &amp;lt;input type="submit"&amp;gt;
 &amp;lt;/form&amp;gt;
&amp;lt;/body&amp;gt;
&amp;lt;/html&amp;gt;
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

I used netcat to listen for incoming requests

&lt;pre class="code"&gt;
nc -l 8888
&lt;/pre&gt;

&lt;/P&gt;&lt;P&gt;

To put it all together, I loaded the HTML page in a browser, selected
the evil file, and submitted it to netcat. Quick and dirty, but it
works, and it proves that nothing in the server is messing things up.

&lt;/P&gt;&lt;P&gt;

What follows is what I saw for different browsers. The results are
obviously suspicious. It's easy to make the header not follow &lt;a
href="http://www.ietf.org/rfc/rfc2183.txt"&gt;RFC 2183&lt;/a&gt;, which is the
relevant spec. For example, remove one of the quotes.

&lt;/P&gt;
&lt;h2&gt;Safari 4.0.4&lt;/h2&gt;
&lt;P&gt;

&lt;pre class="code"&gt;
------WebKitFormBoundaryO5rrXim+NdIE0npI
Content-Disposition: form-data; name="blah"; filename="Evil"; name="fred"
Content-Type: application/octet-stream
&lt;/pre&gt;


&lt;/P&gt;
&lt;h2&gt;Firefox 3.5.5&lt;/h2&gt;
&lt;P&gt;

&lt;pre class="code"&gt;

-----------------------------94839879511149195311657737442
Content-Disposition: form-data; name="blah"; filename="Evil"; name="fred"
Content-Type: application/octet-stream
&lt;/pre&gt;


&lt;/P&gt;
&lt;h2&gt;Links 2.2&lt;/h2&gt;
&lt;P&gt;

&lt;pre class="code"&gt;

-----------------------------00000000000000000000000000000
Content-Disposition: form-data; name="blah"; filename="Evil"; name="fred"
Content-Type: text/plain; charset=us-ascii
&lt;/pre&gt;

&lt;/P&gt;
&lt;h2&gt;Opera 10.10&lt;/h2&gt;
&lt;P&gt;

Opera 10.10 is the odd one out. As far as I can tell, it's safe from
evil filenames. It doesn't allow me to even submit filenames
containing a newline. The newline character gets removed from the
name. If there's a semicolon it removes that character and all
following text, so that "simple;semicolon" becomes
filename="simple". (Perhaps this is VMS versioning legacy?)

&lt;/P&gt;&lt;P&gt;

If I use a double quote (") and no other non-letter characters in the
filename then the result is

&lt;pre class="code"&gt;
------------lHwjN2s9agu9VAHOJ1ChbS
Content-Disposition: form-data; name="blah"; filename="default"
Content-Type: application/octet-stream
&lt;/pre&gt;

In other words, Opera does not generate invalid requests here.

&lt;/P&gt;&lt;P&gt;

&lt;h2&gt;Newlines&lt;/h2&gt;

I did the same tests with a newline character in the filename and
found that Safari and Firefox will upload a file containing the
newline, and the newline is placed in the Content-Disposition field
unaltered. This lets me craft new headers for the part, including a
replacement Content-Disposition header. Useful? Probably not.

&lt;/P&gt;
&lt;h2&gt;Security Vulnerabilities&lt;/h2&gt;
&lt;P&gt;

None that I can think of. There are other ways to craft ill-formatted
requests than using a browser, so the only possible attacks are from
people who have only the ability to create a filename.

&lt;/P&gt;&lt;P&gt;

Though it would be really cool if someone proved me wrong. Is there a
server out there which trusts the 'size' field and tries to
preallocate 2GB of data, just because of a well-constructed upload
filename? If you come up with something, &lt;a
href="http://dalkescientific.blogspot.com/2010/01/content-disposition.html"&gt;let
me know&lt;/a&gt;!

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2010/01/04/content_disposition_bug.html</guid><pubDate>Mon, 04 Jan 2010 12:00:00 GMT</pubDate></item><item><title>Problems with TDD</title><link>http://www.dalkescientific.com/writings/diary/archive/2009/12/29/problems_with_tdd.html</link><description>&lt;P&gt;
If you have not yet read it, please read Maria Siniaalto's 15 page
"&lt;a
href="http://www.agile-itea.org/public/deliverables/ITEA-AGILE-D2.7_v1.0.pdf"&gt;Test-Driven
Development: empirical body of evidence&lt;/a&gt;." It summarizes the few
empirical studies done to evaluate the effectiveness of TDD. In the
conclusion you'll find:

&lt;blockquote&gt;
Based on the findings of the existing studies, it can be concluded
that TDD seems to improve software quality, especially when employed
in an industrial context. The findings were not so obvious in the
semi-industrial or academic context, but none of those studies
reported on decreased quality either. The productivity effects of TDD
were not very obvious, and the results vary regardless of the context
of the study. However, there were indications that TDD does not
necessarily decrease the developer productivity or extend the project
lead-times: In some cases, significant productivity improvements were
achieved with TDD while only two out of thirteen studies reported on
decreased productivity. However, in both of those studies the quality
was improved.

&lt;br /&gt;
&lt;br /&gt;

The empirical evidence on the practical use of TDD and its impacts on
software development are still quite limited.

&lt;/blockquote&gt;

&lt;/P&gt;&lt;P&gt;

I mention this first because I've concluded that not only is TDD not
useful for me but I don't think it's a generally useful technique. The
important requirements are to have good, complete automated unit
tests, to develop code for testing, and to do interative improvement
through refectoring and rewriting. TDD promotes those, but my
experience is that TDD pins down the code too early and my observation
is that TDD by itself ignores certain classes of essential unit tests.

&lt;/P&gt;&lt;P&gt;

My position against TDD will be contentious to some, like those who
believe that TDD is a required component in modern best-practices
development. I quoted Siniaalto to show that there is no strong
evidence to back that belief. I fully expect someone to tell me that
TDD drastically improved their development style. My response will be
they learned good practices, but those practices don't require TDD and
can as easily be learned without TDD.

&lt;/P&gt;&lt;P&gt;

By the way, while my conclusion is in opposition to Siniaalto's, it's
because the most successful TDD paper in her report comes from
&lt;a href="http://collaboration.csc.ncsu.edu/laurie/Papers/MAXIMILIEN_WILLIAMS.PDF"&gt;Maximilien
and Williams&lt;/a&gt;
about their experience at IBM. They went from ad hoc unit testing to
good development practices based on TDD. I think good testing
practices without using TDD would have given the same results.

&lt;/P&gt;&lt;P&gt;

Before going further I'll also quote from Kent Beck's "Test-driven
development: by example":

&lt;blockquote&gt;

One of the ironies of TDD is that it isn't a testing technique (the
Cunningham Koan). It's an analysis technique, a design technique,
really a technique for structuring all the activities of development.

&lt;/blockquote&gt;

This entire essay will describe why TDD is a weak testing technique
and an incomplete development technique. I'll bring up other
techniques which are not part of TDD but end up leading to better unit
tests that should help make you more confident that your code works.

&lt;/P&gt;
&lt;h2&gt;Test first vs. test last vs. good testing&lt;/h2&gt;
&lt;P&gt;

By TDD I mean Test Driven Development, and specifically its test first
approach. Wikipedia describes &lt;a
href="http://en.wikipedia.org/wiki/Test-driven_development"&gt;TDD&lt;/a&gt;
as:

&lt;blockquote&gt;

First the developer writes a failing automated test case that defines
a desired improvement or new function, then produces code to pass that
test and finally refactors the new code to acceptable standards.

&lt;/blockquote&gt;

&lt;P&gt;&lt;/P&gt;

By contrast, people also talk about "test last". Test last is the
extreme opposite of "test first". One good definition of
&lt;a href="http://xunitpatterns.com/test%20last%20development.html"&gt;test
last&lt;/a&gt; is:

&lt;blockquote&gt;
testing should be done before the code goes into production; it does
not imply that the tests are automated. 
&lt;/blockquote&gt;

&lt;/P&gt;&lt;P&gt;

When I say that people shouldn't do TDD I do &lt;b&gt;not&lt;/b&gt; mean they
should do test last development instead. That is false dichotomy, and
it annoys me when I read 
&lt;a
href="http://www.wiziq.com/tutorial/19538-TDD-Overview"&gt;descriptions&lt;/a&gt;
which present those two styles as the only possibilities.

&lt;/P&gt;&lt;P&gt;

My own practice is to have good, automated tests, but these don't get
put into place until the cost/benefit ratio makes the tests
worthwhile; which is rarely at the start of the code development and
always by the end. The test themselves are guided by the code, and the
knowledge of where the failure cases might be in the code. In
addition, I'll add tests which check the expected input range, and
after the code is done I'll add tests which check my belief that the
code is done, as well in some cases tests driven by code coverage or
other reasons.

&lt;/P&gt;&lt;P&gt;

I expect people to point out that TDD does not preclude other testing
strategies, to fill in those gaps. I completely agree. I agree so much
that I mostly use those other good strategies, and not TDD. TDD seems
to add little to the result.

&lt;/P&gt;
&lt;h2&gt;Worked out TDD examples&lt;/h2&gt;
&lt;P&gt;

I want to base my response in at least the spirit of empirical
research. I can't, because I don't (and neither likely do you) have
the resources to do those tests. What I can do is find some
descriptions of TDD used to implement a problem and make comments
about them to highlight limitations in TDD.

&lt;/P&gt;&lt;P&gt;

I give full props to those who have described the steps they go
through to work on a problem. Even in the simplest of cases it's a lot
of work.

&lt;/P&gt;&lt;P&gt;

I found number of basic TDD tutorials, based around addition and
subtraction, either with basic
&lt;a href="http://agilesoftwaredevelopment.com/videos/test-driven-development-basic-tutorial"&gt;add() and sub()&lt;/a&gt; 
functions or through depositing and withdrawing money from a bank account
&lt;a href="http://www.parlezuml.com/tutorials/tdd.html"&gt;[1]&lt;/a&gt; and
&lt;a href="http://www.codeproject.com/KB/dotnet/tdd_in_dotnet.aspx#h11"&gt;[2]&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

Those were too simple to have problems. I wanted something more
complex. The most complete examples I found were Robert Martin's

&lt;a
href="http://butunclebob.com/ArticleS.UncleBob.ThePrimeFactorsKata"&gt;Prime
Factors Kata&lt;/a&gt;, which he also works through in &lt;a href="http://www.vimeo.com/2499161"&gt;a
video&lt;/a&gt;,

and implementing the Fibonacci sequence in Gary Bernhardt's blog post
&lt;a
href="http://blog.extracheese.org/2009/11/how_i_started_tdd.html"&gt;How
I started TDD&lt;/a&gt; and Kent Beck's
"&lt;a
href="http://books.google.com/books?id=gFgnde_vwMAC&amp;pg=PA211&amp;lpg=PA211&amp;dq=kent+beck+fibonacci&amp;source=bl&amp;ots=enJpsvZppF&amp;sig=ROnMnlJqoP562Kzcrxi0km8dcmM&amp;hl=en&amp;ei=47s2S-7vLJLC-QbqkuyjCQ&amp;sa=X&amp;oi=book_result&amp;ct=result&amp;resnum=1&amp;ved=0CAoQ6AEwAA#v=onepage&amp;q=&amp;f=false"&gt;Test-driven
development: by example&lt;/a&gt;". I don't know if Bernhardt's example is
derived from Beck's, but it's the one I came across first.

&lt;/P&gt;
&lt;h3&gt;Prime Factors&lt;/h3&gt;
&lt;P&gt;

The Prime Factor Kata asks for a function which takes a number and
returns its prime factors in an ordered list, including duplicates.
For example, 12 would return 2, 2, 3. The test cases were 1, 2, 3, 4,
6, 8, and 9 and the kernel of the solution was:

&lt;pre class="code"&gt;
  public static List&amp;lt;Integer&amp;gt; generate(int n) {
    List&amp;lt;Integer&amp;gt; primes = new ArrayList&amp;lt;Integer&amp;gt;();

    for (int candidate = 2; n &amp;gt; 1; candidate++)
      for (; n%candidate == 0; n/=candidate)
        primes.add(candidate);

    return primes;
&lt;/pre&gt;

&lt;/P&gt;
&lt;h3&gt;Fibonacci&lt;/h3&gt;
&lt;P&gt;

The Fibonacci sequence examples checked that the first few outputs
were correct, giving fib(i=0, 1, ...) = 0, 1, 1, 2, 3, 5 . Both people
ended with variations of the classic recursive solution, here from Bernhardt:

&lt;pre class="code"&gt;
def fib(n):
    if n &amp;lt;= 1:
        return n
    else:
        return fib(n - 1) + fib(n - 2)
&lt;/pre&gt;

&lt;/P&gt;
&lt;h2&gt;Problem: TDD doesn't emphasize good test cases&lt;/h2&gt;
&lt;P&gt;

When I looked at Martin's Prime Number Sieve, I first thought the code
was wrong. It tests to see if 2 is a divisor, then 3, then 4, then 5,
and so on. 4 can never be a prime divisor of the candidate because 4
isn't prime. Why does his code check for that possibility? Was there a
bug?

&lt;/P&gt;&lt;P&gt;

Code should be readable, so that others can understand it and verify
that it works. In the same vein, tests should serve as a way for
others to check that the code is working. I looked at the tests, and
noticed that the only prime factors tested were 2 and 3. Perhaps if 5
was a prime factor then there would be a problem when the code got to
4?

&lt;/P&gt;&lt;P&gt;

I couldn't tell from the tests, so I had to look more closely at the
code. It then became obvious. All factors of 2 were removed, so there
was no way that 4 could be a divisor. By construction, no non-prime
candidate could ever work, so will never be added to the list.

&lt;/P&gt;&lt;P&gt;

The tests were not good enough to minimize doubt that the code
contained bugs. I can think of a couple of simple variations of the
code which would contain bugs and which would pass the tests. Yes, the
tests were enough to help Martin get to a solution, but they shouldn't
have been enough to convince him, much less others, that the code was
right.

&lt;/P&gt;&lt;P&gt;

Some good tests might have included the primes 17 and 97 as well as 91
(=7*13). I can't think of simple bugs to put in Martin's code which
would also cause those test cases to fail, excepting a hard-coded
upper limit to the search space which would easily show up on code
review.

&lt;/P&gt;
&lt;h3&gt;Fibonacci Sequence&lt;/h3&gt;
&lt;P&gt;

Bernhardt's Fibonacci Sequence did test enough numbers that I was
pretty sure that algorithm would come up with the correct answers,
although I would have preferred some larger numbers, like fib(12) =
144. (I picked that one because it's cute that 144=12*12.)

&lt;/P&gt;
&lt;h2&gt;Problem: When do you add tests that should pass?&lt;/h2&gt;
&lt;P&gt;

TDD says to add a failing test then fix the code. What do you do with
tests which are expected to pass? For example, suppose I finished the
prime factors code but upon review of the tests I have a niggling
uncertainty that it handles prime factors greater than 3. I want to
add a test case to find the factors of 91.

&lt;/P&gt;&lt;P&gt;

I asked this of Bernhardt, and he kindly addressed that in his followup
essay "&lt;a
href="http://blog.extracheese.org/2009/11/the_limits_of_tdd.html"&gt;The
Limits of TDD&lt;/a&gt;."

&lt;blockquote&gt;

After the tests drove the first fully-functional design out, I'd add
exactly the types of tests you describe. These wouldn't fail at first,
but that's fine; TDD doesn't preclude such things, they're just
outside its scope. What I would do, to make sure the tests were
honest, is to intentionally break the code, watch them fail (probably
along with several other tests), then unbreak the code. This gives me
at least some of the confidence that TDD does - I know that something
is actually being tested.

&lt;/blockquote&gt;

This is a bit different than what I would do. If the code is supposed
to work then I don't want to touch the code at all. Instead, I add the
test but make sure the test is supposed to fail, perhaps by saying the
factors of 91 are 5 and 13. Seeing the failure is a check that I
didn't make a stupid mistake in writing the test. Then I fix the test
and see that it passes.

&lt;/P&gt;&lt;P&gt;

Mine is not his more TDD approach, although close. But I want to
highlight his comment that "TDD doesn't preclude such things, they're
just outside its scope."

&lt;/P&gt;&lt;P&gt;

That's exactly my point, and notably in disagreement with Beck's
statement that TDD is "really a technique for structuring all the
activities of development."

&lt;/P&gt;&lt;P&gt;

Other tests and other development approaches besides TDD are needed
for good software development, including approaches which are
conceptually quite close to TDD but not part of it. I say that the
skills that are needed to detect and add good passing tests can
equally be applied to developing good unit tests in the first place.

&lt;/P&gt;&lt;P&gt;

Only, without extra requirement of coming up with all of the tests
first.

&lt;/P&gt;
&lt;h2&gt;Problem: TDD does not consider worst-case scenarios&lt;/h2&gt;
&lt;P&gt;

In "good test cases" I said that TDD doesn't stress the tests needed
to convince yourself or others that the code was right, only tests to
implement the code you think is right. Here I'll talk about a
different sort of unit test that TDD doesn't help with - worst-case
scenarios.

&lt;/P&gt;
&lt;h3&gt;Prime Factors Kata&lt;/h3&gt;
&lt;P&gt;

I implemented the Prime Factors Kata on my own. It took me a while
too. I implemented the Sieve of Eratosthenes to generate prime
factors, and only searched for factors up to sqrt(n). This has been my
general approach for this sort of problem since college. I ended up
with 29 lines of code, and I couldn't understand how Martin was able
to write:

&lt;blockquote&gt;
The final algorithm is three lines of code. Interestingly enough
there are 40 lines of test code.
&lt;/blockquote&gt;

(BTW, I counted 15 total LOC in the program and 43 LOC in the test
module, or 3 vs. 12 if you only talk about "real" code, vs. import
statements, function definitions, lines with only a closing brace, and
so on. In either way of counting, it's still less than my 29 lines of
code.)

&lt;/P&gt;&lt;P&gt;

If you listen closely in Martin's video you'll see that he considers
his three line solution to be "more elegant" than the Sieve solution.
I really didn't understand assertion. His solution is going to be slow
for almost all cases. I timed Python implementations of our two
algorithms for numbers around 200,000. His was 150* slower than my
sieve-based solution, and it gets much worse after that.

&lt;/P&gt;&lt;P&gt;

If you listen even more closely, I think you'll hear the reason. He
introduced the problem by saying his kid was learning about prime
factors at school, and Martin wanted a program which could solve the
same sort of problem. In that case, the prime factors are small. Few
teachers would be so mean as to require their students to find the
prime factors of 524,287 by hand.

&lt;/P&gt;&lt;P&gt;

If the possible input range was only, say 1 to 150, then I could see
how Martin's code is elegant. But if the input range is 1 to 2**32
(which is more like I expected), then it's clearly not elegant because
finding that 2**31-1 is prime will take about 2**31 modulo tests.
Computers are fast, but that's excessive. (BTW, it's also cute that
2**31-1 is both max signed integer and a Mersenne prime.)

&lt;/P&gt;&lt;P&gt;

In either case, there should be tests for values which represent a
worst-case scenario. In this case that would be a prime at the high
end of the expected range. His largest test was 9. Mine was 2**31-1.

&lt;/P&gt;
&lt;h3&gt;Fibonacci&lt;/h3&gt;
&lt;P&gt;

There are three problems with the Fibonacci implementations. One is
that the classic recursive solution (without memoization) takes
exponential time. I implemented the solution iteratively and compared
the results. Bernhardt's solution for fib(32) takes about as long as
my iterative soluton for fib(100000), and after a minute I gave up
computing fib(40) recursively.

&lt;/P&gt;&lt;P&gt;

Another is that Python's default stack size is 1000 function calls.
Doing fib(1500) quickly gives a "RuntimeError: maximum recursion depth
exceeded" exception.

&lt;/P&gt;&lt;P&gt;

The last is in Beck's code. Assuming the recursive solution could
compute it in time, fib(48) is larger than 2**32. He uses a Java 32
bit integer, so his code would silently overflow.

&lt;/P&gt;
&lt;h3&gt;Discussion&lt;/h3&gt;
&lt;P&gt;

TDD creates unit tests which are used to develop and refactor
code. These tests are only a subset, and not even an essential subset,
of the tests needed to check that the code implements the requested
feature. You may think you are finished with the code and you pass all
the TDD tests, but you still aren't finished with the development
process. You still have to do other important unit tests.

&lt;/P&gt;&lt;P&gt;

I'm certain that Beck and Bernhardt know the limitations of their
Fibbonacci implementations. I'm really surprised they didn't mention
the problems in their solution. It would have been the perfect place
to show that other types of unit tests can't be ignored, and discuss
how to fit them into the TDD development process.

&lt;/P&gt;&lt;P&gt;

I also wish that Martin has been less dismissive of the sieve
solution. It's obvious that others have mentioned it to him. He should
have responded by pointing out that the solution was overkill for the
problem range. I also wish he had included tests for the high end of
that range. (I have the idea based on
&lt;a href="http://en.reddit.com/r/programming/comments/1kth0/test_driven_design_vs_thought_driven_design/c1lkvh"&gt;other writings&lt;/a&gt;
that he's not an algorithms person, so he also might not have been
aware of the performance problems in his solution.)


&lt;/P&gt;
&lt;h2&gt;Problem: TDD doesn't give you confidence that the code works&lt;/h2&gt;
&lt;P&gt;

Many TDD advocates bring up confidence as a reason for doing TDD. In
his book Beck writes:

&lt;blockquote&gt;

Psychological - Having a green bar feels completely different from
having a red bar. When the bar is green, you know where you stand. You
can refactor from there with confidence.

&lt;/blockquote&gt;

and others write similiar things.

&lt;/P&gt;&lt;P&gt;

If your goal is to be confident in your code, then TDD is a weak
method for developing those tests of confidence. I've now shown a
couple of TDD examples, which were done with TDD principles foremost
in mind, but which failed to consider worst-case solutions. You should
not be confident that your code works just because your TDD tests
pass.

&lt;/P&gt;&lt;P&gt;

When I write my code, I'm not confident that it works. I'm not even
confident that a refactoring works despite passing all of the unit
tests. I worry about edge cases I didn't think of, I worry about
implementation flaws, I worry about worst-case scenarios.

&lt;/P&gt;&lt;P&gt;

If I write the tests first, I also worry that I've overfit my code to
the tests. This is a problem that happens in statistical modelling.
Given any set of data points, I can fit them to a model. The next
question is, is the model valid and useful? The way to check is to use
them to make predictions, and see how well it matches reality. This in
turn means testing the model with data which wasn't used to make the
model.

&lt;/P&gt;&lt;P&gt;

I feel the same way about my code. I start with doubt that my program
works, but with confidence that I can develop new tests which should
pass if the code is correct. To reduce doubt, I'll write new tests and
see if they pass or fail. Passing tests reduces my doubt, failing
tests means I need to figure out what happened, and I'm back to more
code development.

&lt;/P&gt;&lt;P&gt;

TDD by itself cannot give you that confidence because it excludes the
idea of adding tests which are expected to pass. On the other hand,
developing unit tests even if just after the code is written (but long
before it's deployed as is done with test-last), guided by knowledge
of how the software is implemented and experience in how the code can
fail, can give you all the benefits of TDD, plus be able to handle the
cases that TDD doesn't handle. TDD is one technique for learning those
skills, but it is not an essential technique.

&lt;/P&gt;
&lt;h2&gt;Incorrect claim: TDD leads to 100% coverage&lt;/h2&gt;
&lt;P&gt;

Beck and others write that TDD naturally leads to nearly 100% test
coverage. In his book he writes "TDD followed religiously should
result in 100% statement coverage." Elsewhere I've seen people write
similar things.

&lt;/P&gt;&lt;P&gt;

That's not true. Yes, under TDD new code should have 100% statement
coverage, but what about refactored code? This is especially true if
the refactor is more like a rewrite, perhaps to replace an algorithm
with a faster version.

&lt;/P&gt;&lt;P&gt;

If I start with Martin's Prime Factors code and change it to my prime
sieve based code, I can think of several ways where part of the
refactored code wouldn't be tested. You can easily come up with plenty
of other refactorings where part of the new code are not tested.

&lt;/P&gt;&lt;P&gt;

Yes, people will respond that TDD doesn't mean you can't stop being
smart, and you must remember to include those tests, or even to add
those tests while refactoring the new code. That's very true. I only
point out that refactoring doesn't have the goal of maintaining full
statement coverage, and therefore TDD doesn't either.

&lt;/P&gt;&lt;P&gt;

If you feel that code coverage is needed, above and beyond code
inspection and manual methods, then there are tools to help automate
coverage tests. The best covered tool I know of is SQLite. Its
"&lt;a href="http://www.sqlite.org/testing.html"&gt;veryquick&lt;/a&gt;"
tests run about 42 thousand tests to get 97.23% coverage of about
66,000 SLOC, with additional tests which get 99.50% statement coverage
of the entire code, and 100% coverage of the core. This was an
intense and dedicated effort which does not and cannot fall out as a
simple consequence of TDD.

&lt;/P&gt;
&lt;h2&gt;Complaint: TDD freezes the API too early&lt;/h2&gt;
&lt;P&gt;

This is my personal complaint. It is not derived from those worked out
examples.

&lt;/P&gt;&lt;P&gt;

My own development style is a mixture of many techniques. When I've
tried doing TDD I feel like it locks me down too early. My code in the
early stage is very fluid. I'm mostly trying to get a feel for what
it's going to look like. At that stage the code isn't meant to even
compile, and the only machine it runs on is the model in my head.

&lt;/P&gt;&lt;P&gt;

This is especially true for cases where I'm trying to come up with a
good API to implement the new functionality. My test cases are short
programs which would use the API, and I try out different example
programs to get a feel for usefulness, ease-of-use,
ease-of-implementation and other factors.

&lt;/P&gt;&lt;P&gt;

If I use TDD here, I don't know what the API is going to look like, so
how do I write the tests? I won't know what the API is going to look
like until I've had a feel for implementing it but even then the API
changes often. If I have tests for the API and the API changes, then
there's the extra mental barrier of having to change all the tests for
the new API.

&lt;/P&gt;&lt;P&gt;

Especially bad are the cases when I realized that some function isn't
needed and should be deleted. With TDD that would also mean deleting
the tests which went along with the function, and it would likely mean
I've already spent time debugging the function, now all thrown away.

&lt;/P&gt;&lt;P&gt;

I've seen that in the code katas we do in the
&lt;a href="http://groups.google.com/group/gothpy"&gt;GothPy&lt;/a&gt; meetings (the
local Gothenburg Python Users Group). Once we have working code with
unit tests, I don't want to remove the function, and I start thinking
about ways to adapt it, rather than thinking about ways to simplify
the overall code base.

&lt;/P&gt;&lt;P&gt;

XP allows something this as a &lt;a
href="http://www.extremeprogramming.org/rules/spike.html"&gt;spike
solution&lt;/a&gt;, but says that you should expect to throw the
implementation away and start anew. I don't.

&lt;/P&gt;&lt;P&gt;

Once I have a good sketch of how the code is going to be, I often
continue by filling in the details. At this point unit tests starts to
be useful, but if I'm developing an API I'll write a simple functional
test which uses the API, and make it work. It really might be a
command-line program or even a &lt;tt&gt;__main__&lt;/tt&gt; for the current
module. This helps give me get more concrete solution and once that's
solidified enough code I start developing my automated unit tests.

&lt;/P&gt;&lt;P&gt;

Since I'm not using TDD, I used code coverage (either manually or
through coverage tools) to improve statement coverage, and I use my
knowledge of the problem to come up good test cases. The result seems
to be no less effective than TDD, plus as a methodology it includes
development tests which TDD does not.

&lt;/P&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;P&gt;

Good testing practices help make good code. Automated unit tests,
written by the developer and run often during the development stage,
is a good testing practice. TDD uses those sorts of tests, but its
focus on test-first, with failing test cases that reflect missing
code, exclude important tests in the development process.

&lt;/P&gt;&lt;P&gt;

TDD can easily be modified to handle these other cases, but the result
is simply "good unit testing", without the test-first aspect that
makes TDD what it is.

&lt;/P&gt;
&lt;h2&gt;Questions or Comments?&lt;/h2&gt;
&lt;P&gt;

This is a contentious topic with a long history and plenty said about
it. I think I've contributed something new to it with my commentaries
on what should be exemplar TDD-based solutions. I hope you found it
interesting if not enlightening or useful. With three nearly complete
rewrites, it was by far the hardest essay I've ever written for my
site.

&lt;/P&gt;&lt;P&gt;

If you have any comments or feedback,
&lt;a href="http://dalkescientific.blogspot.com/2009/12/problems-with-tdd.html"&gt;please do let me know&lt;/a&gt;.

&lt;/P&gt;
</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2009/12/29/problems_with_tdd.html</guid><pubDate>Tue, 29 Dec 2009 12:00:00 GMT</pubDate></item><item><title>License agreements and usability</title><link>http://www.dalkescientific.com/writings/diary/archive/2009/12/16/licenses_and_usability.html</link><description>&lt;P&gt;

Licenses and end user agreements are almost a joke. Have you read the
entire terms of every license agreement before ticking the "I agree"
checkbox? It's more like a magical charm than something people take
seriously.

&lt;/P&gt;&lt;P&gt;

I know there's plenty of legal discussion about this, and I don't want
to talk about any of that. The generally accepted fiction is that
we're supposed to read those licenses. But to uphold that fiction, the
providers of these agreements are supposed to want us to read and
understand the license.

&lt;/P&gt;&lt;P&gt;

The license agreement page is part of the user experience, so I've
decided out of a sense of perversity to read the licenses and see if
they are usable. I'll point out two cases where the providers of a
service agreement clearly never expected anyone to actually read and
understand the full agreement. Both have been unresponseive to my
pointing out the problem. By comparison, I'll end on a high note with
a free software project which wasn't following their copyright license
to the letter but have been extremely responsive in fixing that
problem.

&lt;/P&gt;
&lt;h2&gt;Lufthansa&lt;/h2&gt;
&lt;P&gt;

I bought a flight a few months ago on Lufthansa. They require you to
pay homage to the license gods and tick the "I agree" box. It links to
the

&lt;a href="http://www.lufthansa.com/online/portal/lh/cmn/generalinfo?l=en&amp;nodeid=1818501"&gt;Terms &amp; Conditions - General Conditions of Carriage (Passenger and Baggage)&lt;/a&gt;, under the fiction that you are supposed to read it.

&lt;/P&gt;&lt;P&gt;

Guess what? It's incomplete. For example "Article 9: Schedules,
Delays, Cancellation of Flights" says "If the legal liability rules
apply we offer compensation and assistance according to Art. 14.4.1."
but there is no article 14. You can see article 14 is listed in the
index, where it shows that there should be 17 articles, but the page
ends with article 9.

&lt;/P&gt;&lt;P&gt;

Now, if you click on the hyperlink for article 17 or go to the PDF you
can see the entire terms and conditions, but the top of the page where
it describes how to print the terms clearly means to say that this
page is the complete T&amp;amp;C. It looks like there aren't any
functional tests to ensure that the entire contract is shown.

&lt;/P&gt;&lt;P&gt;

I'm sure a good lawyer could use this to some advantage.

&lt;/P&gt;&lt;P&gt;

I sent a bug report to the most appropriate Lufthansa site. I got an
email back which said to call a number in Germany. I haven't done
that. If they aren't responsive then it's their legal fault.

&lt;/P&gt;
&lt;h2&gt;Apple's iTunes Store&lt;/h2&gt;
&lt;P&gt;

Next is the iTunes license agreement. I bought an iPod last month to
replace one I lost a year ago. My sister and brother-in-law had bought
me an iTunes gift card some time back and I wanted to finally use
it. I went to the store and it said that I had to agree to the &lt;a
href="http://www.apple.com/legal/itunes/us/terms.html"&gt;iTunes Store
Terms and Conditions&lt;/a&gt;.

&lt;/P&gt;&lt;P&gt;

I started to read the license. It is &lt;i&gt;HUGE&lt;/i&gt;. Printing takes 24
pages. The license is an agglomeration of licenses for several
different services and contains many duplicates. That's why it
mentions of the privacy policy occurs three different times in two
different forms:

&lt;blockquote&gt;

a. Apple's Privacy Policy. Except as otherwise expressly provided for
in this Agreement, the Service is subject to Apple's Privacy Policy at
http://www.apple.com/legal/privacy/, which is expressly made a part of
this Agreement. If you have not already read Apple's Privacy Policy,
you should do so now.

&lt;/blockquote&gt;

&lt;blockquote&gt;

At all times your information will be treated in accordance with
Apple&amp;#146;s Customer Privacy Policy which can be viewed at:
www.apple.com/legal/privacy/.

&lt;/blockquote&gt;

&lt;blockquote&gt;

At all times your information will be treated in accordance with
Apple&amp;#146;s Customer Privacy Policy which can be viewed at:
www.apple.com/legal/privacy/.

&lt;/blockquote&gt;

Notice how the first gives an escape clause while the latter two do
not? Which one is legally binding?

&lt;/P&gt;&lt;P&gt;

It also includes an agreement for the App Store. As far as I know, use
of the app store has nothing to do with an iPod, and I don't know why
I have to make that agreement in order to purchase music. Plus, is
their use of "virtual ammunition" some sort of legal term? Like how
"munitions" includes cryptography? (I'm stretching things a bit
here. I can guess what they mean.) I'll go with the idea that I'm
agreeing to the iTunes store licenses on that page, and not agreeing
with some other license which just happens to be in the text.

&lt;/P&gt;&lt;P&gt;

But questions of interpretation a different complaint. I'm talking
about the usability of licenses, not the ability to understand the
legalities of it. Clearly having somewhat contradictory duplicates
hinders understanding.

&lt;/P&gt;&lt;P&gt;

The license says:

&lt;blockquote&gt;

For more information about iTunes Plus, please read the FAQ at
http://phobos.apple.com/WebObjects/MZStore.woa/wa/iTunesPlusFAQPage.

&lt;/blockquote&gt;

I deliberately did not include a hyperlink in that quote. I'm reading
the license inside of iTunes, which neither has a hyperlink there nor
lets me copy and paste the text. Since my perverse goal is to read
what it says to read, I went to Safari, typed in that URL, and pressed
enter. It took me to a page which contained instructions to tell
Safari to tell iTunes to open up the web page.

&lt;/P&gt;&lt;P&gt;

Yes, that's right.  It's not possible to view

&lt;a href="http://phobos.apple.com/WebObjects/MZStore.woa/wa/iTunesPlusFAQPage"&gt;http://phobos.apple.com/WebObjects/MZStore.woa/wa/iTunesPlusFAQPage&lt;/a&gt;

in a standard web browser. Safari automatically opens it in iTunes and Firefox gives me a box saying "This link needs to be opened with an application."

&lt;/P&gt;&lt;P&gt;

Bringing the page up in iTunes stopped my registration session, which
takes place in iTunes. I had to start all over again.

&lt;/P&gt;&lt;P&gt;

I read the license. And read the license. And read it. Craig points
out that there are better things in life to do than to read the
license that has almost no legal reality anyway. I said I'm being
perverse.

&lt;/P&gt;&lt;P&gt;

I FINALLY got to the end. Clicked the "I Agree" button. Guess what?
The session had timed out. Now, I'm a decently fast reader. During
that most recent session I had already read some of the text from the
previous attempt and did not attempt to reread it. There is no way
that anyone has actually read the text of the license agreement during
the short time allotted for them.

&lt;/P&gt;&lt;P&gt;

My solution was to start over yet again, and click "I agree" without
reading the text this time. I assumed that the text had not changed
from the previous time I had read it. That's hard to tell.

&lt;/P&gt;&lt;P&gt;

Apple isn't even keeping alive the fiction that people read the
license presented to them. I guess there was no user testing here!

&lt;/P&gt;&lt;P&gt;

I'm sure a good lawyer could use this to some advantage.

&lt;/P&gt;&lt;P&gt;

I posted a statement about this on the most likely forum at Apple and
got the message that I posted to the wrong place and a pointer to the
right place. I sent something there but have received no response.

&lt;/P&gt;
&lt;h2&gt;On the other hand, CDK has been great&lt;/h2&gt;
&lt;P&gt;

Since I've been on this license kick (it's a bad habit kids; don't
start), I noticed that the &lt;a href="cdk.sourceforge.net/"&gt;CDK&lt;/a&gt;
people were distributing code but not following their license. CDK is
the Chemistry Development Kit and is an LGPL'ed collection of tools
for computational chemistry.

&lt;/P&gt;&lt;P&gt;

Their jar distribution include jars from a few other projects, some of
which are also LGPL'ed. The LGPL says that you have to mention those
other projects in the documentation in some way, but the CDK
documentation omitted those.

&lt;/P&gt;&lt;P&gt;

I emailed Egon, who is one of the main developers, and pointed this
out. It's a technical flaw that almost no one would care about. I'm
very happy to say that he's taken it seriously and the CDK has been
going through all their code to make sure they have everything right.

&lt;/P&gt;&lt;P&gt;

It's a small sample size, and I know I'm comparing service agreements
to a copyright license agreement, but it's interesting that those who
have the least resources are also the most responsive and
sensitive. They are also the ones who depend the most on the good will
of other people.

&lt;/P&gt;&lt;P&gt;

To Egon and Stefan and the others at CDK - great job! You all surely
deserve a round of drinks next time we meet. (And no, I wouldn't offer
that to members of Apple or Lufthansa. Then again, they are being paid
for what they do.)

&lt;/P&gt;

</description><guid isPermaLink="true">http://www.dalkescientific.com/writings/diary/archive/2009/12/16/licenses_and_usability.html</guid><pubDate>Wed, 16 Dec 2009 12:00:00 GMT</pubDate></item></channel></rss>