Dalke Scientific Software: More science. Less time. Products
[ previous | newer ]     /home/writings/diary/archive/2007/10/07/wide_finder

Wide Finder

More notes on Wide Finder

Background

Some history first. Tim Bray is a well-known developer who writes a widely read blog. In a recent posting he talked about parsing an Apache log file using Erlang to find top 10 requested URLs under the "/ongoing/When/" domain matching a certain pattern.

The Apache log file format contains many lines of the form:

aasb7.neoplus.adsl.tpnet.pl - - [01/Oct/2006:06:34:37 -0700] "GET /ongoing/When/200x/2005/11/01/More-Dots HTTP/1.1"
200 4471 "http://www.google.pl/search?hl=pl&q=more+dots&lr=" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)"
I added a newline to make it easier to read. Tim uses the regular expression r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+ )" as the filter, which matches the log file example I gave. Note that the regular expression ends with a space and there's a space in the square brackets. They might not have come across well in this essay.

Tim isn't doing this because he needs a way to parse log files. He's doing this as a way to get experience in using Erlang for something he's done before and knows a lot about. I would call it a practice piece, and Emily would say it's a kata.

Tim's conclusion was that regular expressions and file parsing in Erlang were slow, compared to languages like Ruby and Perl which have a stronger emphasis on those. This stirred up some discussion in the Erlang community as people worked to show how to make the code faster. The belief, perhaps article of faith, is that it's easier to speed up Erlang's I/O and regexp code than it is for other languages to match Erlang's concurrency support.

Quite a few people chimed in on the comments and by email. Tim provided roundup of other Wide Finder implementations, in Erlang and in other languages. Santiago Gala contributed a Python version.

Fredrik Lundh's notes and commentary, in Python

Fredrik Lundh wrote a set of notes on Wide Finder starting with Santiago's code. He shows how to speed it up by being smarter with the code and by taking advantage of a dual-core machine. He described what he did through series of related programs, available from it's SVN repository. I summarize each one here along with timings, both given by Fredrik in his README.txt file and by me on my MacBook Pro 2.33Gz Intel Core 2 Duo:

In this essay I'll write a few other variations, all single-threaded. To make it easier to compare, here are their times: Some interesting observations: I have a faster machine than Fredrik, my wf-5 is a lot slower than his, and there is indeed room for improvement even staying within the standard library.

[In my next essay I compared these to implementations in other languages.]

What am I doing?

This essay is also set of practice pieces, since neither Fredrik nor I have any real need for this tool. If I needed something like it I would almost certainly write it using awk, uniq and sort. In addition, the way that Fredrik wrote things is useful for other programmers because it shows how a good programmer modifies and then tests a program, when trying to get better performance out of it.

I decided to carry on in that style. What I wanted to do was compare the timings to a version using mxTextTools. At the recent EuroPython 2007, Marc-André mentioned in passing the new byte code engine in mxTextTools 3.0. Because of the work Fredrik and, to a lesser extent I, did on improving the performance of string operations at the NeedForSpeed event in 2006, I was curious to compare the two timings. And I was curious to see if I could do better than the effbot. :)

I started by implementing my own version using just the standard Python libraries. The biggest speed gain came from using a memory-mapped file instead of using normal file I/O, but Fredrik only showed how that helped the multiprocessing version. I wanted to see how much it sped up the single threaded version.

I used wf-2.py as my base and added to it. I think it's best to start something like this without really reviewing other code. I didn't look at the programs in much detail. If this is a practice piece then part of the practice is being able to solve the problem in different ways, and with relatively little guidance. I also find that by working out a problem first helps understand why other implementions were done in a given way.

mmap and re playing together using the buffer interface

Python has a concept called a buffer interface. Strings and a few other data types, including memory-mapped files, implement the buffer interface. String operations (like 'find') and regular expressions can work on buffer objects. Which means I am able to apply the regular expression directly to a memory-mapped file.

# dalke-wf-7.py
import re, os, mmap
from collections import defaultdict

FILE = "o1000k.ap"

import time, sys
if sys.platform == "win32":
    timer = time.clock
else:
    timer = time.time

t0, t1 = timer(), time.clock()

pat = re.compile(r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ")

fileobj = open(FILE)
filemap = mmap.mmap(fileobj.fileno(), os.path.getsize(FILE), access=mmap.ACCESS_READ)

count = defaultdict(int)
for m in pat.findall(filemap):
    count[m] +=1

for key in sorted(count, key=count.get)[:10]:
    pass # print "%40s = %s" % (key, count[key])

print timer() - t0, time.clock() - t1

# sanity check
for key in sorted(count, key=count.get)[-10:]:
    print "%40s = %s" % (key, count[key])

In my original code I used finditer instead of findall to find all the matches. The finditer returns match objects while findall only returns the matching text. (But read the docs because findall returns a tuple instead of a string if there is more than one group in the pattern.) Making the extra objects takes time and I was able to make my code about 5% faster using findall.

Had I not tried to implement my code first, I wouldn't have thought of trying 'finditer'. Had I not compared my result to Fredrik's I wouldn't have thought of using 'findall'. Becoming a better programmer means you have to do both.

My single-threaded dalke-wf-7.py runs in 1.3s. Compare that to Fredrik's best single-threaded version at 1.7s and his best dual-process version wf-6.py at 1.0s. I think this is the best performance and cleanest code you can get using this general approach, though I would like a shorter way to open a memory-mapped file. I count 12 non-blank/non-timing lines while Tim's original Ruby code has 11. His machine looks about 1/2 the speed of mine so this Python code is roughly 2-3 faster than his Ruby. That's about what I've seen in other performance comparisons between the two languages.

mxTextTools, with regexps via Martel

Now that I have fast baseline code, I want to compare it to mxTextTools 3.0. There is no regular expression support in mxTextTools. Instead, you define a tag table, which are machine instructions for the tag engine. While there is some documentation on how to use mxTextTools, it's not easy to understand. The examples which come with mxTextTools 3.0 aren't that helpful, and some of them haven't even been updated to work with 3.0.

Some years back I spent some time working on Martel, which uses an (almost) regular expression language as a way to parse semi-structured flat-files as if it's in XML. The element definitions comes from group names in the regular expression, and the text parsing is done with mxTextTools. I wrote some code to have Martel convert Tim's filter regular expression into a tag table. I did change the regexp so that the group has a name, as group names are used as tag in the tag table.

# dalke-wf-8.py
# A version using mxTextTools 3.0, with the tag table generated by Martel

from collections import defaultdict
from mx.TextTools import *

import Martel


FILE = "o1000k.ap"

import time, sys
if sys.platform == "win32":
    timer = time.clock
else:
    timer = time.time

t0, t1 = timer(), time.clock()

# My Martel package implements a converter from regular expressions to
# mxTextTool tag table, but it doesn't take full advantage of mxTextTools.
expr = Martel.Re("GET /ongoing/When/\d\d\dx/(?P<page>\d\d\d\d/\d\d/\d\d/[^ .]+) ")
tag_table = ("start",
             # For each character, see if it is parsed by the regular expression
             #   if no, go to the next state, if yes go to the "start" tag
             (None, SubTable, expr.make_parser().tagtable, +1, "start"),
             # At end of file?  If no, go to the next state, if yes, stop parsing and declare victory
             (None, EOF, 0, +1, MatchOk),
             # Advance to the next character and start again
             (None, Skip, 1, "start", "start"),
             )
tag_table = TagTable(tag_table)

count = defaultdict(int)

CHUNK = 12*1024

infile = open(FILE)

while 1:
    # Guarantee the complete line is read
    text = infile.read(CHUNK)
    text += infile.readline()
    if not text:
        break

    success, taglist, position = tag(text, tag_table)
    for (_, l, r, _) in taglist:
        count[text[l:r]] += 1

for key in sorted(count, key=count.get)[:10]:
    pass # print "%40s = %s" % (key, count[key])

print timer() - t0, time.clock() - t1

# sanity check
for key in sorted(count, key=count.get)[-10:]:
    print "%40s = %s" % (key, count[key])

Bad news. It's REALLY SLOW. It takes about 26s to process the log file, compared to the under 2s using the normal approach and 1.3s for my best single-threaded code.

What's wrong? Well, one thing is that my Martel code is meant as a validator. It assumes most characters will match and it's very bad at searching for a match because it checks the text character-by-character. More sophisticated algorithms, like Boyer-Moore, can analyze the search string and skip testing characters when there's no chance of a match.

For example, the search string "GET /ongoing/When/" only contains the characters " /EGTWeghino". If neither text[i] nor text[i+18] contain a character in that set then it's impossible for that search string to exist in that range. Python's find, index and x in s tests use a smarter superlinear substring search algorithm, which Fredrik uses as a first-level filter in his wf-2.py code.

mxTextTools is zippy fast!

mxTextTools also implements a fast substring search algorithm. Rather than using Martel to generate the tag table, I wrote one by hand which takes advantage of this feature:

# dalke-wf-9.py
#  A version using mxTextTools 3.0 and hand-written tag table

from collections import defaultdict
from mx.TextTools import *

FILE = "o1000k.ap"

import time, sys
if sys.platform == "win32":
    timer = time.clock
else:
    timer = time.time

t0, t1 = timer(), time.clock()

DIGITS = "0123456789"

count = defaultdict(int)

# using the "add_count" CallTag yields about 5% speedup vs. appending
# the result to the tag list and processing after the parsing.
def add_count(taglist, text, l, r, subtags, count=count):
    count[text[l:r]] += 1

tag_table = TagTable( (
    "start",
    # Use mxTextTool's fast "Boyer-Moore-Horspool style algorithm"
    # Failure to find the text means it's the end of the string
    (None, sWordEnd, TextSearch("GET /ongoing/When/"), MatchOk),
    # Match r"\d\d\dx/"
    (None, IsIn, DIGITS, "start"),
    (None, IsIn, DIGITS, "start"),
    (None, IsIn, DIGITS, "start"),
    (None, Word, "x/", "start"),
    # If this part matches, call 'add_count', then go to "start"
    # If the subtable does not match, then go to "start"
    (add_count, SubTable+CallTag, (
                # Match r"\d\d\d\d/\d\d/\d\d/[^ ] "
                (None, IsIn, DIGITS),
                (None, IsIn, DIGITS),
                (None, IsIn, DIGITS),
                (None, IsIn, DIGITS),
                (None, Is, "/"),
                (None, IsIn, DIGITS),
                (None, IsIn, DIGITS),
                (None, Is, "/"),
                (None, IsIn, DIGITS),
                (None, IsIn, DIGITS),
                (None, Is, "/"),
                (None, AllNotIn, " ."),
                (None, Is, " ")), "start", "start")
    ) )

# mxTextTools does not support memory mapped files, so process a
# chunk of text at a time.

# Timings for different sized chunks.  There is little variation in these numbers
# 10240K 0.964
#  1024K 0.933
#   128K 0.882
#    16K 1.17     # strange
#    12K 0.728
#    10K 0.745
#     8K 0.777
#     4K 0.962
#     1K 1.455
CHUNK = 12*1024  # best numbers on my machine

infile = open(FILE)

while 1:
    text = infile.read(CHUNK)
    # Guarantee the complete line is read
    text += infile.readline()
    if not text:
        break

    # no need to look at the return type; processing occurs by side effect
    tag(text, tag_table)

for key in sorted(count, key=count.get)[:10]:
    pass # print "%40s = %s" % (key, count[key])

print timer() - t0, time.clock() - t1

# sanity check
for key in sorted(count, key=count.get)[-10:]:
    print "%40s = %s" % (key, count[key])

As you can read from the comments, mxTextTools does not use the buffer interface. It only support strings, which can be either 8-bit or unicode. I had to use a chunking method instead. I read CHUNK bytes followed by a readline to ensure that the resulting text only contains complete lines.

This is a very common technique and it works when there are many "records" in a large file. In this case each record is one line long.

A disadvantage of this approach is that the chunk size is tunable. You can either guess a reasonable chunk size, or run tests to find out the best value. I started with a 1MB chunk size and worked my way down to find that 12KB gave me the fastest results. Much to my surpise, 16K is slower than any other value I picked in the entire range, other than 1K. (Memory and disk block sizes are usually 4K so 1K is not a reasonable chunk size.)

The best case took 0.7s and even the worst sane case at 1.2s shows that using mxTextTools is faster than dalke-wf-7, which was my best single-threaded code at 1.3s. But to get that speed requires using a third-party tool and learning how to use it effectively.

I commented out the tagging code, so it does everything except parsing. The code to do the chunking and setup for mxTextTools takes about 0.36s, or about 1/2 of the time.

Making a faster standard library approach

As I was writing an email to Fredrik describing these results, I came up with another approach to speeding up the performance, using only the standard library.

Fredrik showed that using a two-level filter, with a quick exclusion test using string operations followed by the regular expression test, was faster than doing only the regular expression test. Quoting him:

The RE engine does indeed use special code for literal prefixes, but the superlinear substring search algorithm that was introduced in 2.5 is a lot faster in cases like this, so this simple change gives a noticable speedup.
This works because the only about 20% of the lines in the input file matches the quick test and the simple string test is
% python -m timeit -s 's="This is a test.  I was here."*4; t="testXYZ"' 't in s'
10000000 loops, best of 3: 0.194 usec per loop
% python -m timeit -s 'import re;s="This is a test.  I was here."*4; t=re.compile("testXYZ")' 't.search(s)'
1000000 loops, best of 3: 0.98 usec per loop
% python -c 'print 0.98/0.194'
5.05154639175
%
roughly 5 times faster than the regular expression test.

My observation was that I can defer the regular expression test until later. Use the quick string test to find all substrings starting with "GET /ongoing/When/" and ending with the " ". This will include some extra substrings. Tally all of the substrings, including the false positives. This will do extra work but the tallying code is very fast. Once the file has been parsed, post-process the counts dictionary and remove those keys which are not allowed by the regular expression.

This works because there are many duplicate keys. Nearly 50% of the entries which pass the quick string test are duplicates. The keys in the counts dictionary are unique, which mean only one regular expression test needs to be done, instead of one for each match.

If most of the entries were under /ongoing/When/ and most were unique then these optimizations would be a net slowdown. You have to understand your data as well as the software in order to figure out how to improve things, and there will be tradeoffs.

Remember also I mentioned that string operations are available for buffer objects? This means I can do the fast find directly on the memory-mapped file, rather than using a chunk reader. I'll do the quick search for the leading part of the pattern to search for, then another search for the trailing " " (space) character.

# dalke-wf-10.py  fast string ops, mmap, post-process filter
import re, os, mmap
from collections import defaultdict

FILE = "o1000k.ap"

import time, sys
if sys.platform == "win32":
    timer = time.clock
else:
    timer = time.time

t0, t1 = timer(), time.clock()

pat = re.compile(r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ")
search = pat.search


def count_file(filename):
    count = defaultdict(int)
    fileobj = open(FILE)
    filemap = mmap.mmap(fileobj.fileno(), os.path.getsize(FILE), access=mmap.ACCESS_READ)
    i = j = 0
    # For the first pass, including everything which is a reasonable match.
    # It's faster to count everything and filter later than it is to do
    # the filtering now.
    while 1:
        i = filemap.find("GET /ongoing/When/", j)
        if i == -1:
            break
        j = filemap.find(' ', i+19)
        field = filemap[i:j]
        count[field] += 1
        
    # The previous code included fields which aren't allowed by the
    # regular expression.  Filter those which don't match the regexp.
    new_count = {}
    for k, v in count.iteritems():
        # because of the way the key was saved, I didn't keep the
        # trailing space.  Add it back here so the regexp can be used unchanged.
        k = k + " "
        m = pat.search(k)
        if m:
            new_count[m.group(1)] = v
    return new_count


count = count_file(FILE)

for key in sorted(count, key=count.get)[:10]:
    pass # print "%40s = %s" % (key, count[key])

print timer() - t0, time.clock() - t1

# sanity check
for key in sorted(count, key=count.get)[-10:]:
    print "%40s = %s" % (key, count[key])

Variable lookups in module scope are slower than lookups in local scope so I introduced the count_file function to get a bit more speed. I didn't generate numbers for this one but experience says it's nearly always a performance advantage.

The resulting dalke-wf-10 code finishes in 1.0s. Yes, you read that right. It's faster than the mmap/findall solution of dalke-wf-7.py, which took 1.3s. Still not as fast as mxTextTools at 0.7s, but this solution uses only the standard library.

Chunked reading can beat mmap?

I did test a chunk-driven version of this last code, instead of using a mmap'ed file. Strange thing is that with a tuned chunk size I could get the performance to be faster than using a mmap'ed file. It looks like I'll be using a chunk size of 14K in future code! Though I don't know what chunks would be faster than mmap here. If you're interested, here it is:

# dalke-wf-11.py  fast string ops, chunk-based reader, post-process filter
import re
from collections import defaultdict

FILE = "o1000k.ap"

import time, sys
if sys.platform == "win32":
    timer = time.clock
else:
    timer = time.time

t0, t1 = timer(), time.clock()

pat = re.compile(r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ")

# Choosing the right chunk size affect the timings.  Here
# are performance numbers for my machine.
#     255K  1.1477599144 1.12
#      64K  1.18760800362 1.17
#      32K  1.2726931572 1.25
#      16K  1.44729995728 1.42
#      14K  0.951714038849 0.94  # the mmap version is slower, at 1.0 seconds!
#      12K  0.967882871628 0.95
#      10K  0.983233213425 0.97
#       8K  1.02103090286 1.0

CHUNKSIZE = 14*1024

def count_file(filename):
    count = defaultdict(int)
    infile = file(filename)

    # For the first pass, including everything which is a reasonable match.
    # It's faster to count everything and filter later than it is to do
    # the filtering now.
    while 1:
        text = infile.read(CHUNKSIZE)
        text += infile.readline()
        if not text:
            break
        i = j = 0
        while 1:
            i = text.find('GET /ongoing/When/', j)
            if i == -1:
                break
            j = text.find(' ', i+19)
            field = text[i:j]
            count[field] += 1

    # The previous code included fields which aren't allowed by the
    # regular expression.  
    new_count = {}
    for k, v in count.iteritems():
        # because of the way the key was saved, I didn't keep the
        # trailing space.  Add it back here so the regexp can be used unchanged.
        k = k + " "
        m = pat.search(k)
        if m:
            new_count[m.group(1)] = v
    return new_count

count = count_file(FILE)

for key in sorted(count, key=count.get)[:10]:
    pass # print "%40s = %s" % (key, count[key])

print timer() - t0, time.clock() - t1

# sanity check
for key in sorted(count, key=count.get)[-10:]:
    print "%40s = %s" % (key, count[key])

It's all a bad idea

While it's the fastest of the pure-Python solution, it's not one you should aspire to follow. It's pretty unusual code, as you can see in part by the extra comments I added to help clarify things. It's more prone to accidental breakage if one filter changes without updating the other. It's longer, which means it takes longer to write and to understand. And usually a savings of a second (30%-60%) from the easily understood and maintained version isn't worth the extra problems.

Instead, think of it as a practice piece - a way to hone your skills so that when you do need to get 30% faster code you'll have some ideas of what you can try out and how to measure your results.

Comments

I've never had a comment system here. My web site is just a bunch of static web pages, and I don't want to deal with comment spam. I decided to let Blogger handle it for me, so check out my comment page. (Sorry, no list of the number of comments, or preview of what they are.)

Maybe you can comment on why Tim Bray named this "Wide Finder"? Or tell me why a chunk size of 16K seems so bad on my Mac?

Changes:


Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me



Copyright © 2001-2013 Andrew Dalke Scientific AB