Dalke Scientific Software: More science. Less time. Products

Diary RSS | All of Andrew's writings | Diary archive

Writings from the software side of bioinformatics and chemical informatics, with a heaping of Python thrown in for good measure. Code to taste. Best served at room temperature.

New Cheminformatics Projects #

I've started two new open projects for cheminformatics and I'm looking for help in both of them.

Chemistry Toolkit Rosetta

The Chemistry Toolkit Rosetta (CTR) is a set of common cheminformatics tasks implemented using a variety of different toolkits and approaches. It is meant primarily as a way for people to understand and compare how the different APIs work.

Currently there are 16 tasks, 14 of which are well-defined and have at least one solution (in OpenEye/Python since that's what I know best). Several also have solutions in Pybel, and there are a couple RDKit and CDK solution as well.

Some of the CTR tasks are:

It needs your help. The project started in part because I don't know RDKit, CDK, or Indigo that well - to say nothing of the commercial tools available from Symyx, Accelrys, Schrodinger, and others. I know them a bit better now, but not enough.

Feel free to contribute a solution in your toolkit of choice! Or provide commentary, feedback, or improve an existing solution. You can even contribute a new task, if it's characteristic of a frequently encountered cheminformatics-related problem which several toolkits can handle.

By the way, I give a big thanks to Noel O'Boyle for his feedback on the project direction and for his Pybel and Cinfony contributions to help flesh out CTR before this public annoucement.

Chem Fingerprints

The other project I started is called "chem-fingerprints" or "chemfp" for short. Its goal is to develop a couple of file formats for cheminformatics fingerprints as well as tools and libraries which work with those formats.

The main problem it addresses is that there is no widely used fingerprint format, so each research group or even individual researcher ends up making a new one, as well as the tools to work with it. See the use cases for some more detailed examples.

So far I've written a proposal for a line-oriented text format called "FPS" meant to be easy to generate and parse, and have sketched out a inary format called FPB meant for fast loading, at the expense of some preprocessing.

The FPS format is simple enough that you can likely figure out most of it from this example, taken from the specification:

 #FPS1
 #num_bits=256
 #software=RDKit/2009Q3_1
 #params=RDKit-Fingerprint/1 minPath=1 maxPath=7 fpSize=256 nBitsPerHash=4 useHs=True
 #source=/Users/dalke/databases/Compound_00000001_00025000.sdf.gz
 #date=2010-01-27T02:22:26
 fffeffbfb7fffedff7beefdbddf7ffffabff76cf6df7fcf6f7fffebf7d7ffd6f 1
 fffeffbfb7fffedff7beefdbddf7ffffabff76cf6df7fcf6f7fffebf7d7ffd6f 2
 ffffbfdfffffffffbfeffffffffffffffffffffffff77efffffffebfffffffef 3
 00c02010002610000080800041100002084000440d100000c055048801224400 4

I've developed a set of tools to generate FPS fingerprints from OpenEye, OEChem, and RDKit, as well as to extract fingerprints from SD tags; specifically the CACTVS substructure keys in PubChem. These are available from the Mercurial repository.

These tools are in development status, and are primarily meant at this time as a way to get concrete feedback for the specification.g

Other tools I would like to develop, perhaps with your help, are command-line programs for similarity search and substructure filters.

I'm also looking for input and feedback on the format definitions, and for people who want to add support for these formats in their tools.

If you are interested in chemfp, then sign up on the chemfp mailing list.



Project hosting options? #

I started a new project for cheminformatics fingerprints and want to make it available for general use. It contains software under the MIT license and specifications under a license as lenient as I can make it. (Likely CC-BY.)

I looked around for project hosting. My requirements are:

That's it. Very simple, yes?

The options

I know there's a bunch of resources these days, and in my searches I found Wikipedia's Comparison of open source software hosting facilities. As you can see, there are quite a few. Sort on version control systems and it's Alioth, Assembla, BerliOS, Bitbucket, CodePlex, GNU Savannah, Google Code, JavaForge, KnowledgeForge, Project Kenai, and SourceForge.

Must have mailing list and web page or wiki hosting

Next, filter out those which don't have mailing lists, which removes Assembla, Bitbucket, and JavaForge. It's a shame about losing BitBucket since that's what I would have liked. With reluctance I also dropped Google Code since its mailing lists require a Google account. I think that's too high of a barrier of entry. I also dropped GNU Savannah since it doesn't have web or wiki hosting.

What's left are: Alioth, BerliOS, CodePlex, KnowledgeForge, Project Kenai, and SourceForge.

I want to try something other than SourceForge or a clone

Of those I have only used SourceForge, and done that for over 10 years. It feels very clunky and cluttered compared to Google Code and downloading packages is a nuisance for people like me who would rather curl the files than use a browser. Perhaps it's time to try something different? That puts BerliOS out, since it's derived from the SourceForge code base, as is GNU Savannah, and so is Alioth through GForge.

What's left? CodePlex, KnowledgeForge, and Project Kenai.

Must support non-member access to a mailing list

I looked at CodePlex. I think you have to be a CodePlex member in order to leave dicussions, and it uses web-based forum software instead of email. That is, I selected some of the project which have been downloaded the most often but never could find a "subscribe to the mailing list" option. Perhaps most people in the Microsoft Windows and .Net space don't do email?

In any case, it doesn't seem to fit my requirements.

Remaining options: KnowledgeForget and Project Kenai

I looked at KnowledgeForge and while it seems to fit my requirements, there aren't many people using it, although others may be using the underlying KForge application to host their own system. My concern is that the rough edges wouldn't have been worn down by other users.

That left me with Project Kenai, which also seemed to do what I wanted, and it has more and larger development projects, including JRuby. Okay, I'll try it out.

Project Kenai

(Update based on feedback. As of 27 Jan 2010 (or about two days after I registered on Kenai, and two days before I posted this essay), Oracle, who owns Sun, said they would be "phasing out of the public-facing domain used for the Project Kenai Beta site." Therefore, you shouldn't use it.)

I requested a new project hosting and got it. I set up the project, working on code, and updated the wiki. Seems to be nice enough, with really no problems to speak about. I was happy enough.

I liked some of the tweaks, like how it uses AJAX to update the displayed content rather than doing a full page submission like when editing Wikipedia. Though now that I think of it, I adore how StackOverflow shows the formatted content while you type.

Show stopper - non-member access to the mailing list

Until I got to the email part. Turns out Project Kenai does allow non-members to join a list, but they have to email subscribe request to the Sympa email system. Very much like the old majordomo list manager, and with no web-based front-end to help out.

I found that be searching the help files. There's no clue that that's even possible from the normal "mailing lists" page for a project. But perhaps I could remedy that with instructions on how to sign up without being a member.

The only way I could do that was on the wiki home page. I did that then asked a friend of mine to try it out. He followed the main mailing-list link from Kenai and never saw my note. Once on that page he couldn't figure out how to join without being a member, and he doesn't want to do all that just to join a mailing list.

Once I pointed out the manual instructions, he tried that out. I got an email which said I need to manualy confirm him as a member. On his side he only saw that he was now a member, and didn't like the lack of the Mailman-style double opt-in. As far as he could tell, anyone could register anyone else through a forged email.

That's a serious down-check, since while technically it meets my requirements, it doesn't meet the spirit.

No response to a feature requst

I posted this request to the features list a couple of days ago and got no response.

I do realize this is a free project, so I can make no demands nor should I expect fast response. That's why i waited a couple of days before writing this posting. But a reason for trying Project Kenai was because its size should mean it has more of these kinks worked out, and its support by Sun should imply there's someone to answer mail.

Just choose SourceForge?

As for the project, my conculsion is to just go ahead and use SourceForge. It's clumsy but I know it handles my needs.

Unless you have a better suggestion? Perhaps you think I should try BerliOS?



Cheminformatics, bioinformatics, and system biology positions available #

A couple of people I know have computational chemsitry and biology positions available, which might interest some of my readers.

University College Cork, Ireland

Noel O'Boyle (author of TwirlyMol, and Cinfony, contributor to OpenBabel and BlueObelisk, researcher in algorithms in cheminformatics and scoring functions for protein-ligand docking, and all-around good guy), has funding for a PhD student at the School of Pharmacy, University College Cork.

You would be developing open source tools for cheminformatics. I'm actually tempted by this one, except I think consulting pays better, I might have to move from Sweden, and I've got a different thesis I would like to work on.

Details about this PhD position.

Technical University of Denmark, on the north side of Copenhagen

Thomas Sicheritz-Ponten (whom I last saw at ISMB/Copenhagen, where he treated our table at the Scottish bar to a round of whiskys and I danced hustle with his girlfriend while the American musician played covers which everyone in the bar followed along to), has four open positions in his Metagenomics group at the Center for Biological Sequence Analysis. That's 3 PhD/postdoc positions in next gen sequencing related metagenomics and 1 postdoc position in archaeal genetics.

He also pointed out that there are 13 open position (PhD, postdoc, and programmer) for the entire center, including: postdoc within predictive bioinformatics for molecular epidemiology, scientific programmer for molecular epidemiology, PhD within epitope prediction, scientific programmer within disease systems biology, and postdoc or research assistant, bioinformatics generalist.

Details about the CBSA positions.

Get out there and start researching!



Fingerprint File Format #

(This has nothing at all to do with human fingerprints. I'm talking about a technique used in chemistry used to represent characteristics of a small molecule as a bit string so that fast bit operations can be used instead of slow graph operations. It's often rather like a Bloom filter where the hash function is based on molecular substucture.)

SD files and SMILES files are the two most common small molecule structure formats. PDB is the most common macromolecular structure format. FASTA and GenBank are similarly common in sequence databases. But molecular fingerprints don't have a common format.

I propose a new common format for these, called "FingerPrint Format", or "FPF file", with the standard extension of ".fpf." I'm looking for feedback and suggestions, and hopefully also uptake by others.

One problem of course is that they are many types of fingerprints and many ways to represent them. I'm limiting myself to those which are easily represented as a fixed-length bit string of length between 8 and about 8K, and which are dense enough to store the bits directly as bits rather than run-length or other encoding.

My goal is to make fingerprint data sets more portable, so that tools and algorithms developed by one group can more easily be shared and tested by another.

Use cases

Yes, these are all oriented around me. If it wasn't something I wanted then I would be getting paid for this. (Got funding?)

1) One of my clients wanted a 3-nearest-neighbors search (within some cutoff) of a set of Daylight fingerprints, to be used in part of their dataflow system. The files were static, and only updated once every 6 months or so. There was no need for a database, so I wrote a one-off search system for them.

2) I want to write a high-speed Tanimoto similarity search algorithm which works with a pre-built data structure and makes good use of modern processors. Most formats are not designed for high performance and require a preprocessing phase to load the data, and in a command-line tool the data load time will be the limiting factor, not the search.

3) I've found it a bit trying to determine the format and bit order of each fingerprint tool I use, since they are almost all different.

4) I want to develop some screening data sets for substructure queries and evaluate how effective they are against different inputs.

5) Oh, here's one which others want! Do the N**2 nearest-neighbor searches of a data set across a set of machines without spending much time building a fingerprint generation and parsing infrastructure.

Design goals

Here are my design goals:

These are not design goals:

I haven't decided how important these are:

Also, while not a design goal, I would like to have a set of reference implementation for different types of Tanimoto searches and substructure filtering and others. I also want them to be fast enough to be used in benchmarking new implementations, since I've seen too many benchmarks in general where the reference baseline was a naive (meaning "not fast") implementation.

FPF is based on the PNG block structure

I decided to base the format around the PNG specification. The first 8 bytes are a unique signature, and the rest of the file is a set of blocks. Each block contains a 4 byte identifier tag, followed by the length as a 4 byte integer, followed by 'length' many bytes for the block data, followed by 4 bytes for the CRC checksum.

All integers in this proposal are unsigned 32 bit values and stored in network byte order, taking up four bytes. That leads to a limitation in the fingerprint block size which I'll get to later.

FPF uses the same chunk layout, but with a different signature and with tags which are more appropriate for fingerprints.

The FPF signature as hex bytes is 0x89 0x46 0x50 0x46 0x0D 0x0A 0x1A 0x0A. The subsequence 0x46 0x50 0x46 is "FPF", where PNG uses "PNG".

I keep the PNG's use of a CRC checksum as a way to check for incomplete or corrupted files. I don't know how useful that is in practice.

FPF block types

FHDR: header block

I haven't figured out what goes in here. Perhaps something which indicates if the fingerprint type is known to be good/bad for substructure filtering or comparison?

tEXt (and potentially iTXt, zTXt): text block(s)

These are exactly the same as from PNG, which are key/value pairs including the controlled vocabulary of:

   Title            Short (one line) title or caption for image
   Author           Name of image's creator
   Description      Description of image (possibly long)
   Copyright        Copyright notice
   Creation Time    Time of original image creation
   Software         Software used to create the image
   Disclaimer       Legal disclaimer
   Warning          Warning of nature of content
   Source           Device used to create the image
   Comment          Miscellaneous comment; conversion from
                    GIF comment
Needless to say, but I'll do so anyway, some of these need to be tweaked for fingerprints.

I think FPF only needs tEXt, which is a simple key/value record using Latin-1. If those should be UTF-8 then look at iTXt or come up with a new type of text block.

Question: should there be some field which encodes the generation parameters ("OpenBabel FP2, folded to 32 bits")? That lets someone pass in structures without having to know the details of the given data set. Yes, these options are very implementation specific and not portable.

FDNS: dense fingerprint block

The block format is:

[ integer number of bits per fingerprint ]
[ integer number of bytes per fingerprint (including alignment) ]
[ integer number of bytes used for initial alignment = "initial alignment"]
[ "initial alignment" number of bytes of value 0 ]
[ fingerprint 1, which  is "number of bytes per fingerprint" long ]
[ fingerprint 2 ]
[ fingerprint 3 ]
  ...
[ fingerprint N ]
The fingerprints are stored in binary as a set of bytes. The bytes are written in little-endian order and the bits of each byte are written in big-endian order. To get bit i from a fingerprint:
byte_offset = i / 8;
bit_offset = i % 8;
bit = (fingerprint[byte_offset] >> bit_offset) & 1;
(CACTVS uses little-endian for the bytes and the bits, OpenBabel uses big-endian. I decided this mixture was easier to code.)

The minimum number of bytes per fingerprint is (bits_per_fingerprint+7)/8. Extra bits used to fill in the remainder of the last byte must be set to 0. The fingerprint may be padded with extra 0 bytes in order to make the fingerprint a multiple of a word size (typically 32 bits, 64 or 128 bits).

The "initial alignment" part is there to align the fingerprints in the case of memory-mapped files. It can be judiciously chosen so that the first byte of the first fingerprint is word aligned or even page aligned, if that makes a difference.

The number of fingerprints is not stored. It can be calculated based on the block size, the header size, and the number of bytes per fingerprint.

By the way, I expect memory alignment to be a down-stream issue where one program can generate a simple fingerprint file, using no extra alignment bytes, and another program can tweak the alignments to be more optimal for a given OS and implementation.

POPC: ordered population count offsets block

[ integer offset in DENS to first fingerprint with popcount=0 ]
[ integer offset  "  "   to first fingerprint with popcount=1 ]
   ...   
[ integer offset  "  "   to first fingerprint with popcount=N ]
[ integer offset in DENS to the byte after the last fingerprint with popcount=N ]

One way to speed up similarity queries and substructure filters is to exclude testing fingerprints which trivially cannot be accepted based on the popcounts of the target and query fingerprints. "Popcount" is short for "population count" and is the number of set bits in the fingerprint. For example, in a substructure filter test the query cannot have a smaller popcount than the target, and Baldi explained how reduce the Tanimoto search space.

FPF stores the popcount information implicitly, by ordering the fingerprints in the FDNS block so that all fingerprints with popcount=0 are listed first, then those with popcount=1, then popcount=2 and so on.

The "POPC" block contains byte offsets to the start and end of each set of fingerprints with the same popcount. If there are N bits per fingerprint then there will be N+2 offsets, and the fingerprints with popcount 0<=i<=N are between offsets POPC[i] and POPC[i+1].

The last offset is redundant and must be identical to the length of the DENS block. I include it here because my C implementations of the algorithms were easier if they have this information, and storing it in the data file means one less malloc or special case to consider.

If the POPC block is present then the DENS fingerprints must be ordered by population count.

NAME block - list of fingerprint identifiers

[NUL byte]
[ name 1 ]
[NUL byte]
[ name 2 ]
 ....
[ name N ]
[NUL byte]

If present, each fingerprint is associated with a single name and vice versa. The name must not contain the NUL character and should be a unique identifier and should not contain any control characters. (Are the names Latin-1 or UTF-8 encoded? If UTF-8, does the string also undergo a canonicalization step? Or are only ASCII names allowed?)

The names are stored in the same order as the fingerprints in the DENS block, so that the first name corresponds to the first fingerprint, the second with the second, and so on. Each name must be NUL terminated. (However, sorted names would make searching possible in log time instead of linear. On the other hand, most people will load the names into a local dictionary, and take the linear hit once.)

The first byte in the NAME block must be 0 (the NUL character). This is present even if there are no fingerprints and hence no names.

The purpose of the leading NUL character is to simplify text searches. To find the identifier "ID", construct a search for NUL+"ID"+NUL and do a linear search. (It also simplifies binary searches; pick a point p then backtrack to the previous NUL before doing the test.)

NOFF block - offsets into the NAME block

[integer offset to name for fingerprint 1]
[integer offset to name for fingerprint 2]
[integer offset to name for fingerprint 3]
  ...
[integer offset to name for the last fingerprint]

This block maps contains offsets into the NAME block. To get the name for fingerprint i, go to the ith entry in the table then use that as the byte offset of the name the NAME block.

To go the other way requires a binary search. Given the byte offset to a name in the NAME block, use a binary search to find the entry in the NOFF block. The byte position of the entry gives the index into the NOFF block, which is also the index of the fingerprint in the DENS block.

Using sorted names means the mapping from identifier to fingerprint can be done in log time. Using unsorted names requires linear time.

However, I expect most people to load the NAME data into a local dictionary/hash table, which will take linear time but only once. In that case, keeping the names in the same sort order as the fingerprints is much easier to deal with.

Unresolved issues

Sorted or unsorted names?

I'm repeating it here because it's tricky. How well supported should fingerprint lookup by name be? If the names are in sorted order then I can find the name in log(n) time, then find the fingerprint index in log(n) time, but I had to do that every time. While building a table mapping from name to index requires disentangling the lookup table and an linear operation.

Preserving the fingerprint sort order means the fastest program requires O(n) time lookup for every case but makes building the mapping from name to fingerprint very easy.

Scaling

This description used 32-bit unsigned integers as lengths and offsets. That means the largest fingerprint data set can contain at most 2**32 bytes. Consider someone using the 881 CACTVS substructure keys aligned to 128 bits. This uses 896 bits or 112 bytes per fingerprint, which means the block can hold a bit over 38 million fingerprints.

Of course, if someone does an 8K fingerprint then that drops to just over 4 million fingerprints.

Is this a realistic concern for the next 10 years? I think most people don't use fingerprints over 1024 bits, but on the other hand the databases are currently in the 50 million compound range.

There are three ways to address it. One is to use 64 bit values instead of 32 bit. Another is to use multiple blocks, with a special block to indicate the start of a new set of blocks. A third is to simply have multiple FPF files. One thing to bear in mind is that multiple section or multiple files is likely easier for people using map/reduce strategies.

Storing multiple diverse fingerprint sets in the same file

Substructure fingerprints are likely not good similarity fingerprints and vice versa. Is there a need to store different fingerprint sets in the same file?

I don't think so. I think using multiple files works well for that.

Preserving order

A substructure screen may go through several stages. For example, if the input structure contains a hormone substructure then the results of a general query filter might be passed through one optimized to distinguish between different types of hormones, with the hitlist passed from the first stage to the next.

Using names as the hitlist identifiers does not work that well because of the expensive lookup cost. It's better to pass around a list of offsets. But if the DENS block is reordered then the original sort order is no longer available. Also, there needs to be a way to report the match using its input order.

There are two solutions to that: don't sort by popcount, or add a table which maps from input order to DENS order. (Or possibly two tables, although the inverse mapping can be constructed in O(N) time with only one malloc because it's a bucket sort.)

Usefulness of Baldi

The Baldi algorithm suggests optimial searching based on ordering the search of the popcount bins based on the maximum minimum value of the bin. (The paper actually suggests allocating an array and sorting, but as the two sides of the distribution are monotonic decreasing it can also be done with an iterator doing the equivalent of a merge sort of the two sides.)

I implemented it but excepting high similarity searches (perhaps 80% or so? I did this over a year ago), I didn't notice any real improvement in Tanimoto searches. I conjectured that non-linear disk seeks were causing the problem. The memory and especialy disk subsystems are really optimized for forward searches but the Baldi algorithm jumps back-and-forth, and breaks all the lovely cache-prefetching going on behind the scenes.

While I'm convinced that the Baldi limits are useful, the reordering to make it fast ends up making the system more complicated.

BTW, SSDs (Solid State Drives) fix part of the problem by eliminating seek time, through not cache prefetching. Another solution might be to store two sets of fingerprints, with the second sorted in reverse popcount order and on another disk. Then with two threads going forward... Hmmm...

In any case, Swamidass and Baldi reported good speedups for top-K searches. I haven't seen their code, but I have seen other search code which uses a very slow scoring function, so I would like a platform where it's easier for people to compare against known fast implementations of different approaches.

Fingerprint density

I said that I'm assuming dense fingerprints, where the bits are random. Most chemical fingerprints are sparse, with under 20% density for similarity fingerprints and even less for feature key fingerprints used for substructure screening.

Perhaps using run-length encoding of the bits in a fingerprint, or run-length encoding of the inverted index, might be better. However, I have a lot less experience with that, and storing those data structures on disk is more complex than I want to deal with.

CPUs are very fast. A test I did a few years ago suggests that disk and memory access are the limiting factor, and not the algorithm. If that's the case then it's more worthwhile to make the CPU do extra work while it's waiting for less data. On the other hand, another test showed my code was still rather slower than computing the md5 checksum of the same file. These tests were suggestive, not rigorous.

Streaming output

Each chunk starts off with a length. This leads to several complications. It's hard to stream everything to a pipe except if the output data is known, which likely means storing all of the fingerprints in memory before generating the output.

If the output is a file, not a stream, then it's possible to write the all the fingerprints to the file, then seek back to the length field and, and then seek back to the end. They wouldn't be sorted, but a post-processor could sort a file which is too large to keep in memory. (Which given that my laptop has 3GB of memory doesn't seem a real concern.)

That would also mean keeping the names in memory until the fingerprints are written, since the above trick only works for a single block.

If memory is critical (which shouldn't be the case for names, as they are usually only 10 bytes or so), then an implementation might write the blocks to the filesystem and merge later on. Just like the old days of dealing with tape drives. But really, this seems more a theoretical concern than something to worry about since the point of an FPF file is use it as a data source for a pipeline but not part of the pipeline stream.

Development and future

This is something I've been thinking about for a while but haven't gotten around to because I'm not working with fingerprints right now. I am a consultant so if you're interested in funding me, email me. I also do have other, paying work so this doesn't have much priority.

I like the general approach of using PNG blocks. I've implemented parts of the system, and the resulting code has been nice and clean. The format I sketched out fits in well with the high-performance C code for Tanimoto search and substructure filter codes I wrote a couple of years ago. I also like how the format is relatively future proof and if someone needs some new data chunk it isn't hard to add.

I do need feedback from others. If you have ideas, or comments, or perhaps know about existing fingerprint formats I should use instead, then leave a message.

If there's enough interest in this, I'll set up some sort of project repository and mailing list for it, so it's up to you to speak up!



Content-Disposition bug in browsers? #

Would someone who knows the relevant web specifications double-check me on this? I think I've found a bug in how Safari, Firefox, and links handle Content-Disposition in file uploads. (This is part of how a form sends a file to the server and not how the server sends a file back to the browers.). I did the following on a Mac OS X 10.6 ("Snow Leopard") machine.

If you have input or comments, or test results for your browser, let me know. I looked in Firefox's Bugzilla but found nothing about it, nor did this problem come up in general Google searches using what I think are the relevant search phrases.

My appeal to the lazyweb: If this is a bug, and you know how to submit to the relevant trackers, please do so. Figuring out each one and tracking the comments for each site is a nuisance.

Reproducible

I created a file named:

Evil"; name="fred
(It gets more evil if the file contains a newline, but I'm not going to work though an example of that since I want something which is easy for you to create.)

I created a simple form

<html>
<head><title>content-disposition test</title></head>
<body>
 <form method="POST" action="http://localhost:8888/" enctype="multipart/form-data">
  <input type="file" name="blah">
  <input type="submit">
 </form>
</body>
</html>

I used netcat to listen for incoming requests

nc -l 8888

To put it all together, I loaded the HTML page in a browser, selected the evil file, and submitted it to netcat. Quick and dirty, but it works, and it proves that nothing in the server is messing things up.

What follows is what I saw for different browsers. The results are obviously suspicious. It's easy to make the header not follow RFC 2183, which is the relevant spec. For example, remove one of the quotes.

Safari 4.0.4

------WebKitFormBoundaryO5rrXim+NdIE0npI
Content-Disposition: form-data; name="blah"; filename="Evil"; name="fred"
Content-Type: application/octet-stream

Firefox 3.5.5


-----------------------------94839879511149195311657737442
Content-Disposition: form-data; name="blah"; filename="Evil"; name="fred"
Content-Type: application/octet-stream

Links 2.2


-----------------------------00000000000000000000000000000
Content-Disposition: form-data; name="blah"; filename="Evil"; name="fred"
Content-Type: text/plain; charset=us-ascii

Opera 10.10

Opera 10.10 is the odd one out. As far as I can tell, it's safe from evil filenames. It doesn't allow me to even submit filenames containing a newline. The newline character gets removed from the name. If there's a semicolon it removes that character and all following text, so that "simple;semicolon" becomes filename="simple". (Perhaps this is VMS versioning legacy?)

If I use a double quote (") and no other non-letter characters in the filename then the result is

------------lHwjN2s9agu9VAHOJ1ChbS
Content-Disposition: form-data; name="blah"; filename="default"
Content-Type: application/octet-stream
In other words, Opera does not generate invalid requests here.

Newlines

I did the same tests with a newline character in the filename and found that Safari and Firefox will upload a file containing the newline, and the newline is placed in the Content-Disposition field unaltered. This lets me craft new headers for the part, including a replacement Content-Disposition header. Useful? Probably not.

Security Vulnerabilities

None that I can think of. There are other ways to craft ill-formatted requests than using a browser, so the only possible attacks are from people who have only the ability to create a filename.

Though it would be really cool if someone proved me wrong. Is there a server out there which trusts the 'size' field and tries to preallocate 2GB of data, just because of a well-constructed upload filename? If you come up with something, let me know!



Problems with TDD #

If you have not yet read it, please read Maria Siniaalto's 15 page "Test-Driven Development: empirical body of evidence." It summarizes the few empirical studies done to evaluate the effectiveness of TDD. In the conclusion you'll find:

Based on the findings of the existing studies, it can be concluded that TDD seems to improve software quality, especially when employed in an industrial context. The findings were not so obvious in the semi-industrial or academic context, but none of those studies reported on decreased quality either. The productivity effects of TDD were not very obvious, and the results vary regardless of the context of the study. However, there were indications that TDD does not necessarily decrease the developer productivity or extend the project lead-times: In some cases, significant productivity improvements were achieved with TDD while only two out of thirteen studies reported on decreased productivity. However, in both of those studies the quality was improved.

The empirical evidence on the practical use of TDD and its impacts on software development are still quite limited.

I mention this first because I've concluded that not only is TDD not useful for me but I don't think it's a generally useful technique. The important requirements are to have good, complete automated unit tests, to develop code for testing, and to do interative improvement through refectoring and rewriting. TDD promotes those, but my experience is that TDD pins down the code too early and my observation is that TDD by itself ignores certain classes of essential unit tests.

My position against TDD will be contentious to some, like those who believe that TDD is a required component in modern best-practices development. I quoted Siniaalto to show that there is no strong evidence to back that belief. I fully expect someone to tell me that TDD drastically improved their development style. My response will be they learned good practices, but those practices don't require TDD and can as easily be learned without TDD.

By the way, while my conclusion is in opposition to Siniaalto's, it's because the most successful TDD paper in her report comes from Maximilien and Williams about their experience at IBM. They went from ad hoc unit testing to good development practices based on TDD. I think good testing practices without using TDD would have given the same results.

Before going further I'll also quote from Kent Beck's "Test-driven development: by example":

One of the ironies of TDD is that it isn't a testing technique (the Cunningham Koan). It's an analysis technique, a design technique, really a technique for structuring all the activities of development.
This entire essay will describe why TDD is a weak testing technique and an incomplete development technique. I'll bring up other techniques which are not part of TDD but end up leading to better unit tests that should help make you more confident that your code works.

Test first vs. test last vs. good testing

By TDD I mean Test Driven Development, and specifically its test first approach. Wikipedia describes TDD as:

First the developer writes a failing automated test case that defines a desired improvement or new function, then produces code to pass that test and finally refactors the new code to acceptable standards.

By contrast, people also talk about "test last". Test last is the extreme opposite of "test first". One good definition of test last is:
testing should be done before the code goes into production; it does not imply that the tests are automated.

When I say that people shouldn't do TDD I do not mean they should do test last development instead. That is false dichotomy, and it annoys me when I read descriptions which present those two styles as the only possibilities.

My own practice is to have good, automated tests, but these don't get put into place until the cost/benefit ratio makes the tests worthwhile; which is rarely at the start of the code development and always by the end. The test themselves are guided by the code, and the knowledge of where the failure cases might be in the code. In addition, I'll add tests which check the expected input range, and after the code is done I'll add tests which check my belief that the code is done, as well in some cases tests driven by code coverage or other reasons.

I expect people to point out that TDD does not preclude other testing strategies, to fill in those gaps. I completely agree. I agree so much that I mostly use those other good strategies, and not TDD. TDD seems to add little to the result.

Worked out TDD examples

I want to base my response in at least the spirit of empirical research. I can't, because I don't (and neither likely do you) have the resources to do those tests. What I can do is find some descriptions of TDD used to implement a problem and make comments about them to highlight limitations in TDD.

I give full props to those who have described the steps they go through to work on a problem. Even in the simplest of cases it's a lot of work.

I found number of basic TDD tutorials, based around addition and subtraction, either with basic add() and sub() functions or through depositing and withdrawing money from a bank account [1] and [2].

Those were too simple to have problems. I wanted something more complex. The most complete examples I found were Robert Martin's Prime Factors Kata, which he also works through in a video, and implementing the Fibonacci sequence in Gary Bernhardt's blog post How I started TDD and Kent Beck's "Test-driven development: by example". I don't know if Bernhardt's example is derived from Beck's, but it's the one I came across first.

Prime Factors

The Prime Factor Kata asks for a function which takes a number and returns its prime factors in an ordered list, including duplicates. For example, 12 would return 2, 2, 3. The test cases were 1, 2, 3, 4, 6, 8, and 9 and the kernel of the solution was:

  public static List<Integer> generate(int n) {
    List<Integer> primes = new ArrayList<Integer>();

    for (int candidate = 2; n > 1; candidate++)
      for (; n%candidate == 0; n/=candidate)
        primes.add(candidate);

    return primes;

Fibonacci

The Fibonacci sequence examples checked that the first few outputs were correct, giving fib(i=0, 1, ...) = 0, 1, 1, 2, 3, 5 . Both people ended with variations of the classic recursive solution, here from Bernhardt:

def fib(n):
    if n <= 1:
        return n
    else:
        return fib(n - 1) + fib(n - 2)

Problem: TDD doesn't emphasize good test cases

When I looked at Martin's Prime Number Sieve, I first thought the code was wrong. It tests to see if 2 is a divisor, then 3, then 4, then 5, and so on. 4 can never be a prime divisor of the candidate because 4 isn't prime. Why does his code check for that possibility? Was there a bug?

Code should be readable, so that others can understand it and verify that it works. In the same vein, tests should serve as a way for others to check that the code is working. I looked at the tests, and noticed that the only prime factors tested were 2 and 3. Perhaps if 5 was a prime factor then there would be a problem when the code got to 4?

I couldn't tell from the tests, so I had to look more closely at the code. It then became obvious. All factors of 2 were removed, so there was no way that 4 could be a divisor. By construction, no non-prime candidate could ever work, so will never be added to the list.

The tests were not good enough to minimize doubt that the code contained bugs. I can think of a couple of simple variations of the code which would contain bugs and which would pass the tests. Yes, the tests were enough to help Martin get to a solution, but they shouldn't have been enough to convince him, much less others, that the code was right.

Some good tests might have included the primes 17 and 97 as well as 91 (=7*13). I can't think of simple bugs to put in Martin's code which would also cause those test cases to fail, excepting a hard-coded upper limit to the search space which would easily show up on code review.

Fibonacci Sequence

Bernhardt's Fibonacci Sequence did test enough numbers that I was pretty sure that algorithm would come up with the correct answers, although I would have preferred some larger numbers, like fib(12) = 144. (I picked that one because it's cute that 144=12*12.)

Problem: When do you add tests that should pass?

TDD says to add a failing test then fix the code. What do you do with tests which are expected to pass? For example, suppose I finished the prime factors code but upon review of the tests I have a niggling uncertainty that it handles prime factors greater than 3. I want to add a test case to find the factors of 91.

I asked this of Bernhardt, and he kindly addressed that in his followup essay "The Limits of TDD."

After the tests drove the first fully-functional design out, I'd add exactly the types of tests you describe. These wouldn't fail at first, but that's fine; TDD doesn't preclude such things, they're just outside its scope. What I would do, to make sure the tests were honest, is to intentionally break the code, watch them fail (probably along with several other tests), then unbreak the code. This gives me at least some of the confidence that TDD does - I know that something is actually being tested.
This is a bit different than what I would do. If the code is supposed to work then I don't want to touch the code at all. Instead, I add the test but make sure the test is supposed to fail, perhaps by saying the factors of 91 are 5 and 13. Seeing the failure is a check that I didn't make a stupid mistake in writing the test. Then I fix the test and see that it passes.

Mine is not his more TDD approach, although close. But I want to highlight his comment that "TDD doesn't preclude such things, they're just outside its scope."

That's exactly my point, and notably in disagreement with Beck's statement that TDD is "really a technique for structuring all the activities of development."

Other tests and other development approaches besides TDD are needed for good software development, including approaches which are conceptually quite close to TDD but not part of it. I say that the skills that are needed to detect and add good passing tests can equally be applied to developing good unit tests in the first place.

Only, without extra requirement of coming up with all of the tests first.

Problem: TDD does not consider worst-case scenarios

In "good test cases" I said that TDD doesn't stress the tests needed to convince yourself or others that the code was right, only tests to implement the code you think is right. Here I'll talk about a different sort of unit test that TDD doesn't help with - worst-case scenarios.

Prime Factors Kata

I implemented the Prime Factors Kata on my own. It took me a while too. I implemented the Sieve of Eratosthenes to generate prime factors, and only searched for factors up to sqrt(n). This has been my general approach for this sort of problem since college. I ended up with 29 lines of code, and I couldn't understand how Martin was able to write:

The final algorithm is three lines of code. Interestingly enough there are 40 lines of test code.
(BTW, I counted 15 total LOC in the program and 43 LOC in the test module, or 3 vs. 12 if you only talk about "real" code, vs. import statements, function definitions, lines with only a closing brace, and so on. In either way of counting, it's still less than my 29 lines of code.)

If you listen closely in Martin's video you'll see that he considers his three line solution to be "more elegant" than the Sieve solution. I really didn't understand assertion. His solution is going to be slow for almost all cases. I timed Python implementations of our two algorithms for numbers around 200,000. His was 150* slower than my sieve-based solution, and it gets much worse after that.

If you listen even more closely, I think you'll hear the reason. He introduced the problem by saying his kid was learning about prime factors at school, and Martin wanted a program which could solve the same sort of problem. In that case, the prime factors are small. Few teachers would be so mean as to require their students to find the prime factors of 524,287 by hand.

If the possible input range was only, say 1 to 150, then I could see how Martin's code is elegant. But if the input range is 1 to 2**32 (which is more like I expected), then it's clearly not elegant because finding that 2**31-1 is prime will take about 2**31 modulo tests. Computers are fast, but that's excessive. (BTW, it's also cute that 2**31-1 is both max signed integer and a Mersenne prime.)

In either case, there should be tests for values which represent a worst-case scenario. In this case that would be a prime at the high end of the expected range. His largest test was 9. Mine was 2**31-1.

Fibonacci

There are three problems with the Fibonacci implementations. One is that the classic recursive solution (without memoization) takes exponential time. I implemented the solution iteratively and compared the results. Bernhardt's solution for fib(32) takes about as long as my iterative soluton for fib(100000), and after a minute I gave up computing fib(40) recursively.

Another is that Python's default stack size is 1000 function calls. Doing fib(1500) quickly gives a "RuntimeError: maximum recursion depth exceeded" exception.

The last is in Beck's code. Assuming the recursive solution could compute it in time, fib(48) is larger than 2**32. He uses a Java 32 bit integer, so his code would silently overflow.

Discussion

TDD creates unit tests which are used to develop and refactor code. These tests are only a subset, and not even an essential subset, of the tests needed to check that the code implements the requested feature. You may think you are finished with the code and you pass all the TDD tests, but you still aren't finished with the development process. You still have to do other important unit tests.

I'm certain that Beck and Bernhardt know the limitations of their Fibbonacci implementations. I'm really surprised they didn't mention the problems in their solution. It would have been the perfect place to show that other types of unit tests can't be ignored, and discuss how to fit them into the TDD development process.

I also wish that Martin has been less dismissive of the sieve solution. It's obvious that others have mentioned it to him. He should have responded by pointing out that the solution was overkill for the problem range. I also wish he had included tests for the high end of that range. (I have the idea based on other writings that he's not an algorithms person, so he also might not have been aware of the performance problems in his solution.)

Problem: TDD doesn't give you confidence that the code works

Many TDD advocates bring up confidence as a reason for doing TDD. In his book Beck writes:

Psychological - Having a green bar feels completely different from having a red bar. When the bar is green, you know where you stand. You can refactor from there with confidence.
and others write similiar things.

If your goal is to be confident in your code, then TDD is a weak method for developing those tests of confidence. I've now shown a couple of TDD examples, which were done with TDD principles foremost in mind, but which failed to consider worst-case solutions. You should not be confident that your code works just because your TDD tests pass.

When I write my code, I'm not confident that it works. I'm not even confident that a refactoring works despite passing all of the unit tests. I worry about edge cases I didn't think of, I worry about implementation flaws, I worry about worst-case scenarios.

If I write the tests first, I also worry that I've overfit my code to the tests. This is a problem that happens in statistical modelling. Given any set of data points, I can fit them to a model. The next question is, is the model valid and useful? The way to check is to use them to make predictions, and see how well it matches reality. This in turn means testing the model with data which wasn't used to make the model.

I feel the same way about my code. I start with doubt that my program works, but with confidence that I can develop new tests which should pass if the code is correct. To reduce doubt, I'll write new tests and see if they pass or fail. Passing tests reduces my doubt, failing tests means I need to figure out what happened, and I'm back to more code development.

TDD by itself cannot give you that confidence because it excludes the idea of adding tests which are expected to pass. On the other hand, developing unit tests even if just after the code is written (but long before it's deployed as is done with test-last), guided by knowledge of how the software is implemented and experience in how the code can fail, can give you all the benefits of TDD, plus be able to handle the cases that TDD doesn't handle. TDD is one technique for learning those skills, but it is not an essential technique.

Incorrect claim: TDD leads to 100% coverage

Beck and others write that TDD naturally leads to nearly 100% test coverage. In his book he writes "TDD followed religiously should result in 100% statement coverage." Elsewhere I've seen people write similar things.

That's not true. Yes, under TDD new code should have 100% statement coverage, but what about refactored code? This is especially true if the refactor is more like a rewrite, perhaps to replace an algorithm with a faster version.

If I start with Martin's Prime Factors code and change it to my prime sieve based code, I can think of several ways where part of the refactored code wouldn't be tested. You can easily come up with plenty of other refactorings where part of the new code are not tested.

Yes, people will respond that TDD doesn't mean you can't stop being smart, and you must remember to include those tests, or even to add those tests while refactoring the new code. That's very true. I only point out that refactoring doesn't have the goal of maintaining full statement coverage, and therefore TDD doesn't either.

If you feel that code coverage is needed, above and beyond code inspection and manual methods, then there are tools to help automate coverage tests. The best covered tool I know of is SQLite. Its "veryquick" tests run about 42 thousand tests to get 97.23% coverage of about 66,000 SLOC, with additional tests which get 99.50% statement coverage of the entire code, and 100% coverage of the core. This was an intense and dedicated effort which does not and cannot fall out as a simple consequence of TDD.

Complaint: TDD freezes the API too early

This is my personal complaint. It is not derived from those worked out examples.

My own development style is a mixture of many techniques. When I've tried doing TDD I feel like it locks me down too early. My code in the early stage is very fluid. I'm mostly trying to get a feel for what it's going to look like. At that stage the code isn't meant to even compile, and the only machine it runs on is the model in my head.

This is especially true for cases where I'm trying to come up with a good API to implement the new functionality. My test cases are short programs which would use the API, and I try out different example programs to get a feel for usefulness, ease-of-use, ease-of-implementation and other factors.

If I use TDD here, I don't know what the API is going to look like, so how do I write the tests? I won't know what the API is going to look like until I've had a feel for implementing it but even then the API changes often. If I have tests for the API and the API changes, then there's the extra mental barrier of having to change all the tests for the new API.

Especially bad are the cases when I realized that some function isn't needed and should be deleted. With TDD that would also mean deleting the tests which went along with the function, and it would likely mean I've already spent time debugging the function, now all thrown away.

I've seen that in the code katas we do in the GothPy meetings (the local Gothenburg Python Users Group). Once we have working code with unit tests, I don't want to remove the function, and I start thinking about ways to adapt it, rather than thinking about ways to simplify the overall code base.

XP allows something this as a spike solution, but says that you should expect to throw the implementation away and start anew. I don't.

Once I have a good sketch of how the code is going to be, I often continue by filling in the details. At this point unit tests starts to be useful, but if I'm developing an API I'll write a simple functional test which uses the API, and make it work. It really might be a command-line program or even a __main__ for the current module. This helps give me get more concrete solution and once that's solidified enough code I start developing my automated unit tests.

Since I'm not using TDD, I used code coverage (either manually or through coverage tools) to improve statement coverage, and I use my knowledge of the problem to come up good test cases. The result seems to be no less effective than TDD, plus as a methodology it includes development tests which TDD does not.

Conclusion

Good testing practices help make good code. Automated unit tests, written by the developer and run often during the development stage, is a good testing practice. TDD uses those sorts of tests, but its focus on test-first, with failing test cases that reflect missing code, exclude important tests in the development process.

TDD can easily be modified to handle these other cases, but the result is simply "good unit testing", without the test-first aspect that makes TDD what it is.

Questions or Comments?

This is a contentious topic with a long history and plenty said about it. I think I've contributed something new to it with my commentaries on what should be exemplar TDD-based solutions. I hope you found it interesting if not enlightening or useful. With three nearly complete rewrites, it was by far the hardest essay I've ever written for my site.

If you have any comments or feedback, please do let me know.



License agreements and usability #

Licenses and end user agreements are almost a joke. Have you read the entire terms of every license agreement before ticking the "I agree" checkbox? It's more like a magical charm than something people take seriously.

I know there's plenty of legal discussion about this, and I don't want to talk about any of that. The generally accepted fiction is that we're supposed to read those licenses. But to uphold that fiction, the providers of these agreements are supposed to want us to read and understand the license.

The license agreement page is part of the user experience, so I've decided out of a sense of perversity to read the licenses and see if they are usable. I'll point out two cases where the providers of a service agreement clearly never expected anyone to actually read and understand the full agreement. Both have been unresponseive to my pointing out the problem. By comparison, I'll end on a high note with a free software project which wasn't following their copyright license to the letter but have been extremely responsive in fixing that problem.

Lufthansa

I bought a flight a few months ago on Lufthansa. They require you to pay homage to the license gods and tick the "I agree" box. It links to the Terms & Conditions - General Conditions of Carriage (Passenger and Baggage), under the fiction that you are supposed to read it.

Guess what? It's incomplete. For example "Article 9: Schedules, Delays, Cancellation of Flights" says "If the legal liability rules apply we offer compensation and assistance according to Art. 14.4.1." but there is no article 14. You can see article 14 is listed in the index, where it shows that there should be 17 articles, but the page ends with article 9.

Now, if you click on the hyperlink for article 17 or go to the PDF you can see the entire terms and conditions, but the top of the page where it describes how to print the terms clearly means to say that this page is the complete T&C. It looks like there aren't any functional tests to ensure that the entire contract is shown.

I'm sure a good lawyer could use this to some advantage.

I sent a bug report to the most appropriate Lufthansa site. I got an email back which said to call a number in Germany. I haven't done that. If they aren't responsive then it's their legal fault.

Apple's iTunes Store

Next is the iTunes license agreement. I bought an iPod last month to replace one I lost a year ago. My sister and brother-in-law had bought me an iTunes gift card some time back and I wanted to finally use it. I went to the store and it said that I had to agree to the iTunes Store Terms and Conditions.

I started to read the license. It is HUGE. Printing takes 24 pages. The license is an agglomeration of licenses for several different services and contains many duplicates. That's why it mentions of the privacy policy occurs three different times in two different forms:

a. Apple's Privacy Policy. Except as otherwise expressly provided for in this Agreement, the Service is subject to Apple's Privacy Policy at http://www.apple.com/legal/privacy/, which is expressly made a part of this Agreement. If you have not already read Apple's Privacy Policy, you should do so now.
At all times your information will be treated in accordance with Apple’s Customer Privacy Policy which can be viewed at: www.apple.com/legal/privacy/.
At all times your information will be treated in accordance with Apple’s Customer Privacy Policy which can be viewed at: www.apple.com/legal/privacy/.
Notice how the first gives an escape clause while the latter two do not? Which one is legally binding?

It also includes an agreement for the App Store. As far as I know, use of the app store has nothing to do with an iPod, and I don't know why I have to make that agreement in order to purchase music. Plus, is their use of "virtual ammunition" some sort of legal term? Like how "munitions" includes cryptography? (I'm stretching things a bit here. I can guess what they mean.) I'll go with the idea that I'm agreeing to the iTunes store licenses on that page, and not agreeing with some other license which just happens to be in the text.

But questions of interpretation a different complaint. I'm talking about the usability of licenses, not the ability to understand the legalities of it. Clearly having somewhat contradictory duplicates hinders understanding.

The license says:

For more information about iTunes Plus, please read the FAQ at http://phobos.apple.com/WebObjects/MZStore.woa/wa/iTunesPlusFAQPage.
I deliberately did not include a hyperlink in that quote. I'm reading the license inside of iTunes, which neither has a hyperlink there nor lets me copy and paste the text. Since my perverse goal is to read what it says to read, I went to Safari, typed in that URL, and pressed enter. It took me to a page which contained instructions to tell Safari to tell iTunes to open up the web page.

Yes, that's right. It's not possible to view http://phobos.apple.com/WebObjects/MZStore.woa/wa/iTunesPlusFAQPage in a standard web browser. Safari automatically opens it in iTunes and Firefox gives me a box saying "This link needs to be opened with an application."

Bringing the page up in iTunes stopped my registration session, which takes place in iTunes. I had to start all over again.

I read the license. And read the license. And read it. Craig points out that there are better things in life to do than to read the license that has almost no legal reality anyway. I said I'm being perverse.

I FINALLY got to the end. Clicked the "I Agree" button. Guess what? The session had timed out. Now, I'm a decently fast reader. During that most recent session I had already read some of the text from the previous attempt and did not attempt to reread it. There is no way that anyone has actually read the text of the license agreement during the short time allotted for them.

My solution was to start over yet again, and click "I agree" without reading the text this time. I assumed that the text had not changed from the previous time I had read it. That's hard to tell.

Apple isn't even keeping alive the fiction that people read the license presented to them. I guess there was no user testing here!

I'm sure a good lawyer could use this to some advantage.

I posted a statement about this on the most likely forum at Apple and got the message that I posted to the wrong place and a pointer to the right place. I sent something there but have received no response.

On the other hand, CDK has been great

Since I've been on this license kick (it's a bad habit kids; don't start), I noticed that the CDK people were distributing code but not following their license. CDK is the Chemistry Development Kit and is an LGPL'ed collection of tools for computational chemistry.

Their jar distribution include jars from a few other projects, some of which are also LGPL'ed. The LGPL says that you have to mention those other projects in the documentation in some way, but the CDK documentation omitted those.

I emailed Egon, who is one of the main developers, and pointed this out. It's a technical flaw that almost no one would care about. I'm very happy to say that he's taken it seriously and the CDK has been going through all their code to make sure they have everything right.

It's a small sample size, and I know I'm comparing service agreements to a copyright license agreement, but it's interesting that those who have the least resources are also the most responsive and sensitive. They are also the ones who depend the most on the good will of other people.

To Egon and Stefan and the others at CDK - great job! You all surely deserve a round of drinks next time we meet. (And no, I wouldn't offer that to members of Apple or Lufthansa. Then again, they are being paid for what they do.)



DNS pre-resolution #

Watched the DNS pre-resolution video where Jim Roskind of Google explains that Chrome does an early DNS lookup. Suppose you follow a URL. Your computer needs to translate the host name into an address. It uses DNS, but that can take a second or so. Chrome optimistically assumes you will want the name resolved to an IP address, so does a DNS lookup for you, saving time. It uses the OS cache to store the address, so just in case the address is no longer in cache, it will also do a DNS lookup during mouseover. During testing they noticed that people take about 200ms between when the mouse first gets to the link and when the button is clicked, which is again time saved.

My second thought (first being "neat trick!") was that this could be used for new sorts of user tracking. Suppose I want to know if there are people in the world who look at old versions of a web page. This might be people who have saved a page for future reference. Every day I'll add a new URL somewhere in my document, like "http://2009-12-04.tracker.dalkescientific.com/". When someone uses Chrome to look at that copy, it will look up that URL. Now, I can control my local DNS server, and I can see if there have been any lookups under the tracker.dalkescientific.com subdomain. Five years from now, if there's a request for that specific domain then I know that someone has been looking at a copy of that page made today.

If the URL is http://$userid.2009-12-04.tracker.dalkescientific.com/ where $userid was the one who read the page in the first place, then I will also have an idea of who is rereading the page, or at least who shared the content in the first place.

DNS preresolution in email programs

Email programs often include a way to view HTML content. I use a Mac and Mail.app does this. The HTML content can contain images loaded from a remote site. If the mail application fetched the image from the remote site, then that site knows that you have received the email. For one, that makes your email address a bit more valuable to those who traffick in that sort of data.

I have Mail.app configured so it does not fetch and display HTML images unless I press a button to specifically do that. If Mail.app (or Firefox or any other mail program which includes an HTML display pane) were to take Google's lead and offer DNS pre-fetching, then I hope it is disabled when image loading is also disabled.

Web analytics

Using special domains gives another way to do web analytics, and one which works even when images and Javascript are turned off. Few browse that way and the results will be much more imprecise than what's possible now. For one, there's no cookie information in the DNS request, so it can't do the user-tracking possible with web bugs.

Even if every browser did DNS prefetching, the only other advantage I can see of this over log analysis is that it's a cheap way to let someone else do the log analysis for you, by pointing to subdomains on a DNS server they control. You'll still need to add those a unique URL in every request of every document, and have a way to map the URL back to the page, so it will be very cumbersome.

Proxies

Hmm. To work properly, Chrome would also have to know that it can't pre-fetch DNS entries which go through a proxy.

Comments not enabled. If you're really passionate about my musing and want to get hold of me, you'll have to figure out my email address and do things the old-fashioned way.



100,000 tasklets: Stackless and Go #

(Sorry for the repost. For some reason I used the wrong date. Now corrected. Since posting this yesterday several people have posted timings which confirm that with the code Pike posted, current Go is roughly 1/2 the performance of current Stackless Python.)

People are talking about Google's Go language. It supports concurrency, which here means light-weight user-space threads and inter-thread communications channels. In Stackless terms these are "tasklets" and "channels." See Richard Tew's more detailed description of the correspondences between these two languages.

In Rob Pike's Google Tech Talk on Go he shows an example which builds 100,000 "goroutines" (their tasklest, and a pun on the word 'coroutine') and does a simple operation in each goroutine. It's at 43:45 into the video. I've transcribed it here by hand, so this might contain typos:

package main

import ("flag"; "fmt")

var ngoroutine = flag.Int("n", 100000, "how many")

func f(left, right chan int) { left <- 1 + <-right }

func main() {
  flag.Parse();
  leftmost := make(chan int);
  var left, right chan int = nil, leftmost;
  for i:= 0; i< *ngoroutine; i++ {
    left, right = right, make(chan int);
    go f(left, right);
  }
  right <- 0;  // bang!
  x := <-leftmost;  // wait for completion
  fmt.Println(x);   // 100000
}

On the Stackless list, Richard Tew commented:

They have an nice example where they chain 100000 microthreads each wrapping the same function that increases the value of a passed argument by one, with channels inbetween. Pumping a value through the chain takes 1.5 seconds. I can't imagine that Stackless will be anything close to that, given the difference between scripting and compiled code.
I was curious so I wrote something up which suggested that Stackless Python was a lot faster than Go for this task. That was on a benchmark of my own devising, based on reading Richard's comment. Yesterday I finally tracked down the code and wrote a direct translation into Stackless:
import stackless
from optparse import OptionParser

parser = OptionParser()
parser.add_option("-n", type="int", dest="num_tasklets", help="how many", default=100000)

def f(left, right):
    left.send(right.receive()+1)

def main():
    options, args = parser.parse_args()
    leftmost = stackless.channel()
    left, right = None, leftmost
    for i in xrange(options.num_tasklets):
        left, right = right, stackless.channel()
        stackless.tasklet(f)(left, right)
    right.send(0)
    x = leftmost.receive()
    print x

stackless.tasklet(main)()
stackless.run()

It's a bit longer and more verbose than Go because there's no syntactical support for the Stackless additions to Python. Question is, how long does it take to run?

My reference is from when Pike did:

wally:~ r$ goc goroutine.go
wally:~ r$ 6.out
100000
wally:~ r$ time 6.out
100000

real    0m1.507s
user    0m0.875s
sys     0m0.626s
wally:~ r$
He jokingly apologized for how long the 'goc' step took, which got a titter because it was quite fast, even on a "little Mac here, it's not very fast."

Let me reproduce that timing test in Stackless:

josiah:~/src dalke$ uptime
14:01  up 2 days, 13:53, 4 users, load averages: 0.46 0.68 0.74
josiah:~/src dalke$ time spython go_100000.py 
100000

real    0m0.655s
user    0m0.512s
sys     0m0.136s
josiah:~/src dalke$ 
("spython" is the name of my local installation of Stackless.) Now, I did my tests on a 2.5 year old MacBook Pro. That should be comparable to his Mac. My Stackless example was faster in every way than the reference Go code, and that includes the code of Python parsing the .py file and compiling it to byte-code.

(Edit: decode contributed timing numbers done on the same machine, with Stackless being almost twice as fast as Go for this benchmark.)

Remember, Go is a compiled language designed for concurrency. I'm working with the C implementation of Python, which compiles only to byte codes, not machine code, and which was not designed for concurrency. Yet the Python code is faster.

Why then does Pike sound so proud about the performance of Go on this timing test? I don't know.

Edit: By the way, I did some other analysis. It takes 0.35 seconds to start up that many tasklets and 0.07 seonds to send a number through them all. I did not time tasklet teardown.

Why do they talk about fast compile times with Go?

One other observation about Go. Pike's video and another I watched stressed the fast compile times. It seems modern development bogs down doing compiles. They gave numbers: 1,000 lines of Go code in 0.2 seconds on some sort of Mac. I took my copy of Python 0.9p1, at 25,000 lines of C code. It compiled in 7 seconds on my MacBook Pro laptop. Scaling up, the same number of lines of Go code would compile (assuming we have identical machines) in 5 seconds. If I compile with CFLAGS=-g then Python compiles in 2.75 seconds.

Since I'm not testing this on the same machine as them it's hard to say anything concrete, but I would be suprised to find out that my laptop was twice as fast as theirs, which it would have to be to make my numbers be worse than what they report for Go. Plus, I've heard the Intel compilers are a lot faster than gcc. (Edit: I tried to build Go on my laptop only to find it doesn't seem to support OS X 10.4.)

Something doesn't make sense here. Why do they tout Go's performance both for goroutine message passing and for compilation as exceptional? The timings seem worse than existing comparables.

Edit based on feedback from Karl on my comments page: It seems that the real advantage is that Go handles large dependancies better than the C or Java language models. They don't want to maintain the Makefile dependencies by hand, and for the large tool sets at Google, simple Java recompiles is dominated by the compiler taking 90+ seconds just to read all the dependency information. Yet in the videos the praise is for the (relatively slow) compile times of a small set of files.

If you have ideas or thoughts, leave a comment.

Addendum

Here's an example of a Go sort:

func Sort(data Interface) {
    for i := 1; i < data.Len(); i++ {
        for j := i; j > 0 && data.Less(j, j-1); j-- {
            data.Swap(j, j-1);
        }
    }
}
It's handled by the interface
type Interface interface {
    Len() int;
    Less(i, j int) bool;
    Swap(i, j int);
}
Types/interfaces are only a partial specification. There's nothing which says that Len() returns a constant value, nor that the size of the array is constant while being sorted. What would happen if the size changes while the sort is in operation?

This was a bug in Python some years back. During list.sort() a user-defined comparison method might reach back to the container and change it. In the worst case situations, I think this caused a crash. This was fixed by making the container read-only during the duration of the sort:

>>> class Evil(object):
...   def __lt__(self, other):
...     data.append(9)
...     return 1
...
>>> data = [1, 2, Evil()]
>>> data.sort()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: list modified during sort
>>>

Why does Go do for this case? Does it crash? The code doesn't expect errors so there's no way to use the multi-valued return types suggested in Effective Go. It doesn't have exceptions. Perhaps there's a global error handler?

If you know, leave a comment.

Addendum #2 - building Go on my laptop

joesb on reddit ask why I didn't compile Go myself, so I could get head-to-head timings.

I went to the installation page thinking to do that. I was put off by the environment variables. This is 2009 and there should be no reason to set any variables. Honestly, that's why put me off.

Still, since this is making its way around the web, I decided to give it a go. The build requires $GOROOT, $GOOS and $GOARCH to be set. I did that. It says $GOBIN is optional, so I didn't set that. Ran the build and "$GOBIN is not a directory or does not exist". Doesn't seem so optional, does it?

I set GOBIN to ~/tmp and restated the build. That got me to "make: quietgcc: Command not found". A bit of digging and I see this where I missed that GOBIN needs to be on the PATH. It does mention that in the documentation, but since it was 'optional' I skipped it. Turns out that if I don't set the option then it uses ~/bin, which doesn't exist on my machine and is why I got the error in the previous paragraph. Only, well, I didn't set GOBIN so the error message is wrong - it should be that ~/bin is not a directory or does not exist.

Fixed my PATH and ended up with

quietgcc -ggdb -I/Users/dalke/cvses/go/include -O2 -fno-inline -c /Users/dalke/cvses/go/src/cmd/cc/y.tab.c
/usr/include/stdio.h:269: error: conflicting types for 'getc'
/Users/dalke/cvses/go/src/cmd/cc/cc.h:573: error: previous declaration of 'getc' was here
make: *** [y.tab.o] Error 1
make: *** Waiting for unfinished jobs....
I've seen this before. It's because I'm running OS X 10.4.11 which has bison version 1.28 and Mac 10.5 updated bison to 2.3, and that introduced some changes. I did write that I have an old Mac, right? And I haven't upgraded it either.

In short, it doesn't look like Go installs on my machine. Perhaps I should get a new OS ... or a new laptop.

I'm not complaining about this limitation. I have an old OS. There's no reason for Google to expect to support it. I'm not interested enough to fix the problem myself. I think the numbers given in Pike's Tech Talk are comparable enough that my question is still valid - why does Pike emphasize the performance of Go's goroutine creation and channel communication when it seems to be slower than Stackless and definitely is not an order of magnitude faster?

Addendum #3 - comparisons on the same hardware

decode compared Go and Stackless on the same machine and reports:

  ~/development/go/dev$ time ./8.out 
  100000
  
  real	0m0.632s
  user	0m0.288s
  sys	0m0.344s
  
  
  
  ~/install/stackless-2.6.4$ time ./python 100k.py 
  100000

  real	0m0.350s
  user	0m0.292s
  sys	0m0.060s
making Python about twice as fast as Go. (The ratio is 1/1.8 if you want to be precise.)

Thanks decode!



Copyright © 2001-2008 Dalke Scientific Software, LLC.