Martel: Bioinformatics file parsing made easy

Andrew Dalke / Dalke Scientific Software, LLC / dalke@dalkescientific.com

Keywords

Bioinformatics, biopython, computational biology, parsing, Martel, SWISS-PROT, regular expressions, SAX, XML.

Abstract

A goal of the Biopython project [1] is to reduce the amount of effort needed to do computational biology. A large part of that work turns out to be parsing file formats, so we have developed Martel, a parser generator which uses a regular expression as the format description to create a parser that returns the parse three using the SAX API common in XML processing. The resulting system is able to do all of the parsing tasks needed for bioinformatics parsing while being both fast and relatively easy to understand. Martel is not specific to bioinformatics and may be used to parse any regular grammar.
Biopython started in August 1999 as international collaboration to collect, develop and share free Python tools for computational molecular biology. The distribution contains various parsers, a sequence and an alignment class, algorithms for sequence analysis, a Prosite engine, interfaces to web services including Entrez, Prosite and SCOP and even a graphical sequence editor. There are five main Biopython developers and the distribution is being used in at least Australia, Europe, South Africa and the United States.

Software development is a large part of modern biology research so a major goal of Biopython and its sister organizations Bioperl [2], BioJava [3], BioXML[4] and BioCORBA [5] is to reduce the amount of effort needed to develop new science by providing a base set of robust, flexible components on which to build. A surprisingly large amount of that development goes into parsing file formats. Traditional parser generation tools, like lex/yacc, are not very useful in parsing these bioinformatics files so people have written specialized parsers for each format. Most are written to solve the problem at hand and rarely capture all the data present in a given format, much less provide a general or consistent solution to the wide class of formats.

To remedy that situation we have been developing Martel, a parser generator that uses a regular expression syntax as the format description to create a parser which analyzes a format and returns the results using the SAX API common in XML processing. Martel makes it easy to implement the parsing tasks done by all other available parsers and it can even do some things which are difficult or not even possible in any other existing system. Because of these abilities, I believe Martel will a killer app for Biopython and the use of Python in bioinformatics. At least, I can hope. The use of Martel is not specific to molecular biology and may be used to parse any regular grammar.

File formats

There are over 200 database formats relevant to molecular biology [6], like Genbank, PDB and SWISS-PROT and roughly as many program outputs to parse, like BLAST and PHD. These numbers are even higher if the various versions of a given format are included. There are many reasons for the diversity. Each subfield has its own focus and requirements, and the knowledge of what is important and even the vocabulary has changed over time. A nearly-usable format might be too hard to parse, or not be extensible or covered under patent. Or it was simply easier to make up a yet another format without looking to see if any existing one was usable.

Most readers here will not know the specifics of these formats but only the general form is important. They would appear familiar to anyone who has worked with data meant for line printers. First off, they are line oriented, so a line of text contains a well-defined piece of data. Lines are arranged blocks, which may themselves be arranged inside larger blocks. The number of levels of these blocks is fixed and small. All of the formats are designed to be human readable, so people can scan a file or record quickly to see what's in it. The database formats, and some of the program outputs, are also designed to be machine readable, and the machine would usually be programmed in FORTRAN.

Below are the beginning and end parts of a SWISS-PROT record, which is one of the most commonly used protein databases.

Part of SWISS-PROT record 143E_HUMAN
ID   143E_HUMAN     STANDARD;      PRT;   255 AA.
AC   P42655; P29360; Q63631;
DT   01-NOV-1995 (Rel. 32, Created)
DT   01-NOV-1995 (Rel. 32, Last sequence update)
DT   15-JUL-1999 (Rel. 38, Last annotation update)
DE   14-3-3 PROTEIN EPSILON (MITOCHONDRIAL IMPORT STIMULATION FACTOR L
DE   SUBUNIT) (PROTEIN KINASE C INHIBITOR PROTEIN-1) (KCIP-1) (14-3-3E).
GN   YWHAE.
OS   Homo sapiens (Human), Mus musculus (Mouse), Rattus norvegicus (Rat),
OS   Bos taurus (Bovine), and Ovis aries (Sheep).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.
...
SQ SEQUENCE 255 AA; 29174 MW; 40A43E62 CRC32; MDDREDLVYQ AKLAEQAERY DEMVESMKKV AGMDVELTVE ERNLLSVAYK NVIGARRASW RIISSIEQKE ENKGGEDKLK MIREYRQMVE TELKLICCDI LDVLDKHLIP AANTGESKVF YYKMKGDYHR YLAEFATGND RKEAAENSLV AYKAASDIAM TELPPTHPIR LGLALNFSVF YYEILNSPDR ACRLAKAAFD DAIAELDTLS EESYKDSTLI MQLLRDNLTL WTSDMQGDGE EQNKEALQDV EDENQ //
Even without reading the format definition you can puzzle out some of the structure. Lines start with two characters followed by three spaces and the first two characters tag the line type. Long lines fold over and repeat the tag text. The semicolon is often used to separate (unnamed) subfields.

Lex/yacc

The standard computer science approach to parsing is to define a lex and yacc grammar for the format. Lex tokenizes the words and passes them to yacc, which puts them together. This works very well for computer science languages because they are designed to be parsed by lex/yacc with a pipeline architecture where the lexer sends tokens to the parser with little or no need for information to be sent the other way.

Bioinformatics formats are usually messier than that. Often there will be a lot of state information to pass around. For example, the word "GENE" might mean four different things in a file depending on which block it is found: a keyword, a description, the submitter's name or a protein sequence. It may even depend on which character position the word is in the line.

Lex can deal with this using start conditions but that results in dozens of different states and the lex description language does not simplify building explicit state machines. The other option is to pass the state information from the grammar definition - which must also describe the token order - back to the lexer. Again, it can be done but it gets to be quite messy. Indeed, in the seven years I've been in this field I have only seen lex and yacc used by those formats designed to be used with those tools.

Instead, most people end up writing their own parsers by hand. Being rational, they don't want to do a lot of extra work so end up writing a parser which does exactly what they need it to do. For example, it might read only the data fields which are relevant to a project, or core dump when given improperly formatted data. Rarely are these parsers usable for other projects.

Requirements

Our goal for Martel is to simplify parsing tasks so we need to be able to do nearly everything people want from a parser. Following is a list of tasks based on long experience in writing and using those systems. This list is somewhat sequence database-centric in that it deals with files containing many sequence records, but otherwise these tasks can be applied to a wide range of formats, including program outputs. Some of these tasks are subsets of others. For example, a parser which builds a data structure for each record almost certainly can be used to count the number of records present in a file. They are included because people write specialized parsers to handle those specific tasks very quickly. Using grep ^ID | wc is a very fast way to count the number of records in a SWISS-PROT file. Ideally then, file parsing should be faster if only a few fields are needed or if validity checking is not needed.

Some additional functional requirements go along with the use cases.

Existing systems

To make sure I fully captured the capabilities people use in a parser, I analyzed 19 different SWISS-PROT parsers. They are listed here for reference: grep/shell utils, Bioperl [2], Biopython [3], BioJava [4], SRS [11], Swissknife [12], Biopy [13], Darwin [14], SeqIO [15], readseq(C) [16], readseq(Java) [17], Boulder [18], molbio++ [19], BioDB-Loader [20], GCG [21], sp2fasta [22], sw2xml [23], NiceProt [24], and get-sprot-entry [25]. All but five are freely available in source form for detailed analysis and information about the others was garnered from published papers, web documentation, web-based interfaces to those programs and by asking experienced people.

None of the programs implemented any parsing abilities which were not given in the task list, and at least one package was able to do each task. (Some of them could do more than parse the input file but I only concentrated on their parsing requirements.) For detailed analysis and discussion of the different programs, please see http://www.biopython.org/wiki/html/BioPython/ParserComparison.html.

SRS and Icarus

Most of the packages are specialized to a single task or use a central dispatcher to determine the line type then call the associated function to parse the contents of the line. In the latter case, the handlers can often be changed depending on the task. Only one package, Icarus, provides a general parser generator. In many respects, Icarus is very close to what I want Martel to be.

Icarus is the parser and language used by the SRS, a commercial bioinformatics database system available from Lion Bioscience. Definitions are available for over 150 different formats [26], including non-sequence databases.

The major problem with lex/yacc was the need for explicit state declaration. Icarus solves that by having the parser's grammar definition drive the parsing instead of the lexer's. It only checks those lexical definitions which are expected, making the state declaration implicit.

Icarus is also the programming language used to implement the actions for each terminal in the parse tree. Potentially any data structure can be built but Icarus is a simple language and only really supports the goal of translation to SRS's database data model. I believe in theory Icarus can be used to produce HTML mark-up from an existing format, but the existing SWISS-PROT definition cannot be used for that, nor can it be used to verify that a record is in the correct format.

The most serious problem with Icarus is its specialized nature. It is a parser which has become a programming language and its origins show in the relative opaqueness of the language. I am quite serious when I say I spent an hour trying to figure out a tutorial example and could not understand why everything it used was needed. Because it was designed as a language to normalize input data for the database, it does not allow building new data structures nor are there easy ways to integrate Icarus code with other languages and toolkits.

I have not yet been able to track down enough information about Icarus to judge it fully but I also believe Icarus uses a standard look-ahead 1 algorithm, which means it cannot readily be used for some types of format or version identification that may require arbitrary lookahead.

Finally, the newest versions are commercial products and so Icarus is not free in either sense of the word.

History: DiscoveryBase

Given the lack of general solutions in both the public and commercial parsers, is one really possible? The answer of course is yes, since otherwise this paper wouldn't have been written. To explain how it works and comes together is best described with some history.

I've been writing parsers by hand for bioinformatics data for many years. Part of that time was as an employee of Molecular Applications Group [27]. While there I worked on DiscoveryBase, a Perl-based web application that ties together many different bioinformatics databases and programs. Each database and program required its own parser, which were still hand-written but used a consistent interface with a parser and a callback handler. These formats are nearly all line oriented, so the parser did just enough work to identify a line uniquely then passed the type information and the line's contents to the callback object for further analysis.

The callback approach turned out to be very handy. The callback handler could decide to do HTML markup, or extract data, or even filter events through to another callback object. Because the task specific work was shifted elsewhere, the parser itself could remain simple and easy to understand. The data blocks were at most only a few levels deep, so the whole parser could be implemented with a few embedded while-loops and without recursive calls. The handlers were mostly a set of simple string and regular expression searches.

Many other parsers, like the ones in BioJava, Biopy and BioDB-Loader, the existing Biopython parser, and the standard Icarus definition, use this approach of having a dispatcher identify line types and pass the text to the appropriate handler. A novel and useful feature specific to the DiscoveryBase parser was what I termed synthetic events, that is, events which were not explicitly tied to a line of text. Each line sent and event to the handler, but some data elements may fold over multiple lines. Synthetic events were sent before and after these data types so that the client code wouldn't need to keep track of the larger context of the file. Below is an example of the events generated from the SWISS-PROT example above.

 
DiscoveryBase events for parts of SWISS-PROT 143E_HUMAN
Event nameContent
__begin 
ID
ID   143E_HUMAN     STANDARD;      PRT;   255 AA.
AC
AC   P42655; P29360; Q63631;
__begin_DT 
DT
DT   01-NOV-1995 (Rel. 32, Created)
DT
DT   01-NOV-1995 (Rel. 32, Last sequence update)
DT
DT   15-JUL-1999 (Rel. 38, Last annotation update)
__end_DT 
__begin_DE 
DE
DE   14-3-3 PROTEIN EPSILON (MITOCHONDRIAL IMPORT STIMULATION FACTOR L
DE
DE   SUBUNIT) (PROTEIN KINASE C INHIBITOR PROTEIN-1) (KCIP-1) (14-3-3E).
__end_DE 
...
__end 

The synthetic events have names starting with "__" and contain no content. I used "__" because I really wanted to use Python instead of Perl.

Most formats are regular

Jeff Chang and I worked together at Molecular Applications Group. Jeff is now the Biopython lead. After we left, we worked on several approaches to generalize the parsing system from DiscoveryBase. One became the original Biopython parser. It simplified line identification by using a description of how to recognize each line rather than providing a hand-written one. That description uses a regular grammar.

I ended up working on a different parsing system because I was interested in how to increase performance given better guarantees about the data. For example, the output of a program is pretty much guaranteed to be in the right format, or at least a consistent format. Suppose you are interested in only a few items and you know the format is correct. You might be able to seek to the right byte locations and read those items without parsing anything else.

This requires the parser generator have full details about the language description; details which were hidden inside of the callback routines of the Biopython parser. I still liked the idea of describing the format as a set of lines so ended up working on a system with two regular languages; one to identify a line using a regular expression and the other to arrange those lines. Jeff's work showed that the line arrangement could be done with regular expressions, which in retrospect is obvious from my statement that the hand-written parsers only needed a few while-loops and no recursion.

The composition of two regular languages is itself regular. That's probably a true or false question on an undergraduate language theory course. After a few months, I finally figured out it was true.

That leads to the interesting statement that most bioinformatics formats are described by regular grammars and don't require a full context-free definition. In fact, the context-free formats are almost all easily parseable with existing parsers like lex/yacc or SPARK. A context free grammar can always parse a regular grammar, but the emphasis here is on easily parseable so the goal of my project ended up being a way to simplify parsing regular languages.

By the way, formally speaking the formats are not regular. When I say "regular" I am using the informal definition of "can be parsed with Perl5-like regular expressions." Some of the non-regular features of regular expressions are back references, where "(\d+) \1" matches "1 1" and "999 999" and named group repeats, which is a construct I created and describe below that allows "(?P<count>\d)( \w+){count}" to match "1 Hello" and "4 This is some text."

Features of regular expressions

Regular expression engines meet many of the functional requirements listed earlier yet do not exclude the other requirements.

Perhaps the best feature of them is that most programmers in bioinformatics use Perl and so know at least a bit about Perl's regular expressions and how to use it. For those that don't, there are many available references on-line and at the bookstore. Additionally, Perl's regular expression syntax has become the accepted standard so even non-Perl programmers know how to use them.

On a technical level, regular expression engines can potentially be very fast. They also more naturally allow for unlimited backtracking, which makes version identification very easy. For example, suppose "P1" is the regular expression for one format and "P2" for another. Then "P1|P2" is an expression which parses either P1 or P2 - if the first fails the second is attempted.

Returning data

Suppose you have a regular expression which describes a given format. Compile it with the re module and match it against the string to parse. It may match the string, but there's no way to get all the information about what it matched.

Take a simple example from the SWISS-PROT format, which allows multiple accession names (a type of identifier) on a line. The format as a regular expression is AC   (\w+);( (\w+);)*. When matched against "AC   P42655; P29360; Q63631;" you'll find that group 1 is "P42655", group 2 is " P29360; Q63631;" and group 3 is "Q63631". There is no way to get the list of the two different substrings matching the third group - only the last match for "(\w+)" is returned.

That's because while the regular expression effectively produces a parse tree for a string, the APIs for Python, Perl, Tcl, pcre, GNU Regex and every other package I know of only return the last match for each branch of the tree.

Given that I could not find a regular expression parse engine which does this, I decided to write my own, but I needed to figure out how to pass the parse tree back to the caller. Jumping sideways for a moment, XML documents are also tree data structures so XML processing has the same need. They've solved it by defining two different interfaces: DOM for passing around a tree data structure and SAX for generating callbacks based on a tree traversal.

There are a lot of existing tools and documentation for working with XML so I decided to implement one of the XML APIs. The two most common are DOM, which returns a tree data structure, and SAX, which sends a set of events corresponding to a traversal of the tree. The DOM model is quite complicated and not easily usable for large data sets. I already enjoyed using a callback mechanism in DiscoveryBase so I went the SAX route, or more specifically, I used the SAX2 API.

The difficulty is that SAX needs element names, which aren't available from the Perl5 regular expression syntax. Using the group number for each match is not acceptable because counting embedded parenthesis inside a pattern string is very cumbersome, and adding a new parenthesis or data element requires renumbering in all the callbacks.

Luckily, Python regular expressions introduced me to the concept of "named groups," which are defined with the (?P<name>...) notation. The start and end of a named group maps very naturally to the startElement/endElement events in SAX. Replacing the (\w+) in the previous definition with (?P<ac_number>\w+) creates a parse tree where some of the node have the name "ac_number". During traversal of the parse tree, descent into a named node is mapped to a startElement("ac_number", ...) and ascent out of the node maps to a endElement("ac_number"). Below is an example of the parse tree for the AC line of the SWISS-PROT example and the corresponding SAX events.

Parse tree for a SWISS-PROT AC lineSAX events from tree traversal
a parse tree characters("AC ") startElement("ac_number", ...")    characters("P42655") endElement("ac_number") characters("; ") startElement("ac_number", ...")    characters("P29360") endElement("ac_number") characters("; ") startElement("ac_number", ...")    characters("Q63631") endElement("ac_number") characters(";")

Martel - a new regexp engine

Few of the existing regular expression engines understand named groups and none internally create parse trees so I wrote my own, which is called Martel. It reuses a lot of existing code and ideas. First off, the regular expression parser is a modified version of Fredrick Lundh's sre_parse.py from Python 2.0. It is used to convert the pattern string into an Expression tree. Another part of Martel converts an Expression into a set of tag tables for Marc-André Lemburg's mxTextTools [28], and wraps a SAX-like parser interface around the result. And of course, using SAX means I get to borrow all the existing XML tools and expertise.

Building a large pattern string then parsing it is very error prone since mistakes - like a missing parenthesis or backslash - are not caught until they are embedded somewhere in a very large string, and regular expression pattern parsers are not very helpful in identifying those problem. Greg Ewing's Plex [29], uses a Python based regular expression description, which reduces many of the regexp errors to Python errors - and Python is much better at pointing out where an error may be found. I introduced many of his commands into Martel as alternate ways to build up an Expression tree, and added a few more for regular expression features Plex doesn't support.

Most of the standard expression syntax is supported including branches, greedy repeats, character classes, back references and look-ahead assertions. A few are not either because they just haven't been written yet or because they are hard to implement on top of mxTextTools, like non-greedy expressions.

Using Martel

There are four steps to using Martel: make the format definition, construct the parser, set up the SAX callbacks and evaluate the input text. Here is how they are applied to parse the AC line of a SWISS-PROT record.

The table below shows part of the SWISS-PROT format definition in Martel.formats.swissprot38. It is written in my characteristic style of using regular expression strings for small patterns and assembling them with the Plex commands. For the curious, Group("name", Re("pattern")) is identical to Re("(?P<name>pattern)").

Portion of the SWISS-PROT format definition
ID = Martel.Group("ID", Martel.Re(
       r"ID   (?P<entry_name>\w+) +(?P<data_class_table>\w+); +" \
       r"(?P<molecule_type>\w+); +(?P<sequence_length>\d+) AA\.\R"
     ))

AC = Martel.Group("AC", Martel.Re(
       r"AC   (?P<ac_number>\w+);( (?P<ac_number>\w+);)*\R"
     ))
AC_block = Martel.Group("AC_block", Martel.Rep1(AC))

     ...

end = Martel.Group("END", Martel.Str("//") + Martel.AnyEol())

record = Martel.Group("swissprot38_record", \
    ID + \
    AC + \
     ...
    end
    )

(The "\R" and "AnyEol()" expressions will be described shortly.)

Making the parser from an expression is simple. Start with the format definition and use its make_parser() method. Since I want a short example, I'll only make a parser for an AC line and ignore the full record definition.

Making a parser
from Martel.formats import swissprot38
parser = swissprot38.AC.make_parser()

This newly created parser implements the xml.sax.xmlreader.XMLReader interface so the next step is to set up the callbacks. For this example, I'll convert the events to XML by using the XMLGenerator class for the ContentHandler. The default action on errors is to raise an exception so I don't need to change the ErrorHandler.

Setting up callback for XML output
from xml.sax import saxutils
parser.setContentHandler(saxutils.XMLGenerator())

An XMLReader can take a string, URL or file handle as input. For this example, I'll pass in the example SWISS-PROT AC line as a string. (The third accession number was removed to reduce the output size.)

Parse a string to produce XML
>>> parser.parseString("AC   P42655; P29360;\n")

<?xml version="1.0" encoding="iso-8859-1"?>
<AC>AC   <ac_number>P42655</ac_number>; <ac_number>P29360</ac_number>;
</AC>
That showed how to parse a string containing a line from the accession definition. Parsing a complete record from a URL is just as easy:

Converting a full record to XML

>>> from Martel.formats import swissprot38
>>> parser = swissprot38.record.make_parser()
>>> from xml.sax import saxutils
>>> parser.setContentHandler(saxutils.XMLGenerator())
>>> parser.parse("http://www.expasy.ch/cgi-bin/get-sprot-raw.pl?P42655")

<?xml version="1.0" encoding="iso-8859-1"?>
<swissprot38_record><ID>ID   <entry_name>143E_HUMAN</entry_name>     <data_class_table>STANDARD
</data_class_table>;      <molecule_type>PRT</molecule_type>;   <sequence_length>255</sequence_
length> AA.
</ID><AC>AC   <ac_number>P42655</ac_number>; <ac_number>P29360</ac_number>; <ac_number>Q63631</
ac_number>;
</AC><DT_created>DT   <day>01</day>-<month>NOV</month>-<year>1995</year> (Rel. <release>32</rel
ease>, Created)
</DT_created><DT_seq_update>DT   <day>01</day>-<month>NOV</month>-<year>1995</year> (Rel. <rele
ase>32</release>, Last sequence update)
...
</SQ_data><SQ_data> <sequence>EQNKEALQDV EDENQ</sequence> </SQ_data></SQ_data_block></sequence_block><END>// </END></swissprot38_record>

Writing SAX handlers

Suppose you need to reformat the SWISS-PROT record for HTML and want to put the accession numbers in bold text. The simplest way to do this is to create a new ContentHandler that writes the characters callback to stdout and writes the <b> and </b> tags when the ac_number startElement and endElement events are sent. Below is a class which does just that.

ContentHandler to embolden accession numbers
import sys
from xml.sax import handler
class ACBold(handler.ContentHandler):
    def characters(self, s):
        sys.stdout.write(s)
    def startElement(self, name, attrs):
        if name == "ac_number":
            sys.stdout.write("<b>")
    def endElement(self, name):
        if name == "ac_number":
            sys.stdout.write("</b>")

>>> parser = swissprot38.AC.make_parser()
>>> parser.setContentHandler(ACBold())
>>> parser.parseString("AC   P42655; P29360;\n")
AC   <b>P42655</b>; <b>P29360</b>;

Suppose instead you want to capture all of the accession numbers as a list of strings. This can be done by creating a new ContentHandler that stores the characters events between the start and end of the ac_number elements, but for variety I'll use DOM, and show a complete example of parsing the elements from a file.

Get a list of accession numbers using Martel and DOM

from xml.dom.sax_builder import SaxBuilder
from Martel.formats import swissprot38

parser = swissprot38.record.make_parser()
dh = SaxBuilder()
parser.setContentHandler(dh)
parser.parse(open("100K_RAT.swiss").read())

for ac_node in dh.document.getElementsByTagName("ac_number"):
    print "AC", ac_node.get_firstChild().get_data()

By comparison, here is the existing Bioperl and Biopython code to do just the parsing part of this task. The Bioperl example builds up a common data structure and is similar in that respect to SeqIO and readseq. The Biopython uses a dispatcher and is similar to the BioJava, Biopy and Swissknife approach.

Bioperl code to capture all accession numbers

until (!defined ($buffer)) {
    $_ = $buffer;
    ...
    #accession number(s)
    elsif( /^AC\s+(.+)/) {
        $acc_string .= $acc_string ? " $1": $1;
    }
...
$acc_string =~ s/\;\s*/ /g;
( $acc, $sec ) = split " ", $acc_string;
$seq->accession_number($acc);
foreach my $s (@sec) {
    $seq->add_secondary_accession($s);
}

Biopython code to capture all accession numbers

class _Scanner:
    _scan_fns = [_scan_id, _scan_ac, ...]
    ...
    def _scan_ac(self, uhandle, consumer):
        self.scan_line('AC', uhandle, consumer.accession, any_number = 1)
    ...

class _RecordConsumer(AbstractConsumer):
    def accession(self, line):
        cols = string.split(self._chomp(string.rstrip(line[5:])), ';')
        for ac in cols:
            self.data.accessions.append(string.lstring(ac))

From these examples you should be able to see that Martel makes it easier to describe and parse the contents of a file than existing solutions.

Reducing callback overhead

The parser doesn't know which events are needed by the ContentHandler so it must send all of them. In the previous example of the ACBold class, only one of the roughly 90 tag types is needed. Python's method call overhead is not light so sending unneeded events may cause an appreciable performance hit. Additionally, for each event there will be a table lookup or a set of if/else statements to determine the appropriate action. Reducing the number of events reduces this overhead as well.

One way to solve the problem is to write a new format definition without the unneeded groups defined. Martel has a function called select_names which takes an expression and a list of names as its two parameters. It returns a new expression that describes the same format as the original except that only the specified events will be created.

The first test of the reduced parser was for the SwissProtBuilder test, which reads all of the records in the SWISS-PROT 38 release and builds the Biopython SwissProt object for each record. After filtering out the unneeded events the overall performance, including all of the object creation, improved by about a third.

Debugging

When an error occurs during mxTextTools processing, it only returns the elements which have been completely parsed. It does not return a list of subtables which have only been partially parsed. (Subtables are an mxTextTools feature used to implement groups, both named and unnamed.) If an error occurs while parsing a group, the error position is reported as happening after the last position successfully parsed. Because proper XML demands that a document contain a single outermost tag, all of the format definitions use a named group to describe the full format, which means the error position will be reported as occurring at or after the first byte.

This obviously isn't very helpful but fixing it has a large performance impact. Instead, better reporting is available by setting a debug_level when creating the parser. This is an optional parameter to the make_parser method, as in "format.make_parser(debug_level=1)." The default level is 0, which adds no additional debugging support. Setting it to 1 inserts code that captures the location of all successful partial matches and assumes the furtherest location is the error position. Setting it to 2 also adds printouts at every successful match showing the location around the end of the match and the regular expression pattern which matched. Debugging with support enabled makes for a slower run-time, but finding errors in the format definition or data file is much faster.

Example debugging output with debug_level = 2
>>> from Martel.formats import swissprot38
>>> parser = swissprot38.AC.make_parser(debug_level = 2)
>>> parser.parseString("AC   P42655; P29360; Q63631;\n")
Match 'AC   P426' (x=1): 'A'
Match 'AC   P4265' (x=2): 'C'
Match 'AC   P42655' (x=3): ' '
Match 'AC   P42655;' (x=4): ' '
Match 'AC   P42655; ' (x=5): ' '
Match 'AC   P42655; P' (x=6): '[\\dA-Z_a-z]'
   ...
Match '  P42655; P29360' (x=11): '(?P<ac_number>[\\dA-Z_a-z]+)'
Match ' P42655; P29360;' (x=12): '\\;'
Match 'P42655; P29360; ' (x=13): ' '
   ...
Match ' Q63631;\012' (x=28): '( (?P<ac_number>[\\dA-Z_a-z]+)\\;)'
Match ' Q63631;\012' (x=28): '( (?P<ac_number>[\\dA-Z_a-z]+)\\;)*'
Match 'Q63631;\012' (x=29): '(\\n|\\r\\n?)'
Match 'Q63631;\012' (x=29): 'AC   (?P<ac_numbe ... ]+)\\;)*(\\n|\\r\\n?)'
Match 'Q63631;\012' (x=29): '(?P<AC>AC   (?P<a ... +)\\;)*(\\n|\\r\\n?))'
>>>

Large files

The mxTextTools engine will only evaluate a string, and not even a memory-mapped file. Some of the data files to parse are several hundred megabytes in size and don't fit into the memory of at least my laptop so we needed to develop an alternate solution. Nearly all of the large files are actually concatenations of many much smaller records, with perhaps an additional, short header and footer.

The record locations are designed to be very easy to find. For SWISS-PROT, all records start with a line starting with "ID   " and end with the line "//\n". This let us write a two-stage parser where the first stage reads individual records, which are passed to the second stage for further analysis.

The first stage parsers are called RecordReaders. One RecordReader is StartsWith, which finds records starting with a given string. Another is EndsWith, which finds records whose last line start with the given string. The readers are implemented by reading a large block of data at a time then passing the string to mxTextTools to find the record start or end positions.

The composition of these two parsers is done at the format definition level. The ParseRecords expression stores an element name and an expression. As a definition it acts like a named group of the given name containing 1 or more copies of the given expression. A ParseRecords instance also stores a RecordReader constructor and an optional arglist. These are used when the "make_parser" method is called. What's returned is a RecordParser object that implements the XMLReader interface using the two different stages.

SWISS-PROT format definition reading a record at a time
import Martel
from Martel import ParseRecords
from Martel.formats import swissprot38

format = Martel.ParseRecords("swissprot38", swissprot38.record,
                             RecordReader.EndsWith, ("//",))
parser = format.make_parser()

parser.parseFile("/home/dalke/databases/swissprot/sprot38.dat")

There is also a HeaderFooter expression object which handles the more general case where there may be an optional header or footer.

In theory this two-stage separation is not needed. With sufficient analysis of the format expression, the parser generator could determine that no backtracking is possible at a given point (eg, after the "\n//\n" indicating the end of a record) and deallocate any text stored before there. At the other end, it could read input text only when the engine has reached the end of the current in-memory buffer. That analysis is more complicated than I have time to dedicate.

Newlines

Most of the databases are stored using the unix newline conventions of "\n". Macintosh computers have had a long presence in biology labs so a lot of data is produced using the "\r" convention. A lot of people use Microsoft Windows, which uses "\r\n". To make matters even worse, people will concatenate files from different sources into one file with mixed newline conventions.

We need to be able to handle all of these possibilities so we decided to be platform neutral in the same fashion as Java, where any of "\n", "\r" or "\r\n" is recognized as a newline. This means we could not use the readlines method of file objects because that uses the machine's local convention. Instead, the tag tables used by the RecordReaders were modified to handle the three cases.

The other half of the problem was the format definitions. They had been using the character "\n" to mean newline, but the implementation converts that to a check for ASCII character 10. This is fine if the newlines in the input file match the machine's definition and the file is opened in ASCII mode, for then C's stdio does the conversion. That was not the case so we needed to either have "\n" match all three endings or use a new character.

Under the guiding philosophy of "things that look the same should act the same", we decided to keep "\n" as meaning "match ASCII 10" and instead use "\R" to mean "match any of the three standard newline conventions." When used in a character set, as in "[\R]", it means to match either the "\r" or "\n" characters. The sequence "\R" was chosen because it is not used by Perl's or Python's regular expression pattern syntax. The Plex-style equivalent is "AnyEol()."

Named group repeats

A few of the formats could not be parsed with the standard expression syntax because they use one field to store a count of the number of fields that follow. In the simplest example from the MDL Molfile format, the line "SKP N" means the next N lines of input are skipped.

I ended up creating a new expression construct which I call a "named group repeat." It is a modified form of the ranged repeat construct "{n}" and allows any group name to occur in place of n. When called, it gets the integer value of the last match and uses it as the repeat count.

This allows the SKP instruction above to be parsed using:

  Martel.Re("SKP (?P<count>\d+)\R([^\R]*\R){count}")
            -- or in the Plex form --
  Martel.Str("SKP ") + Martel.Integer("count") + Martel.AnyEol() + \
     Martel.RepN(Martel.ToEol(), "count")

Performance

Martel's performance is quite good. In fact, for the benchmark test it is just under 20% faster than the equivalent Biopython code and slightly faster than Biopy, which does much less work. The major reason, of course, is because mxTextTools is a really fast parser. In the benchmark it takes under 7 minutes to parse a 225 MB file and another 17 minutes to send all of the callback events. (One of the next projects is to work on reducing that overhead.)

Martel can be sped up by removing events which would be ignored. This is done with the select_names function, described earlier. If only two fields are needed, as for making FASTA output, the callback overhead goes down to 3 minutes, or 9.5 minutes total. The closest equivalent code from BioJava takes 2.25 minutes. They are not validating the format so a better comparison would be to write a new Martel grammar which similarly does less identification.

A more surprising result is that Martel is just over 20% faster than the closest equivalent Perl code from the Bioperl project. Both of the Biopython related parsers are faster than Bioperl's and are doing more work. I believe the reason may Bioperl's if/elif chain in the dispatch handler. If there are N tags and an even distribution of M tag events then there will be O(N*M) checks. By comparison, the two Python projects use a lookup table for the dispatch, which should be O(M). (Using a dispatch table is more important for Martel because it sends many more events than the existing Biopython parser.)

Format and Version detection

The exact details of a database format often change over time as new fields are needed or old ones replaced. The database files are designed to be parsed by machine so are usually stable and and changes fit a predictable scheme. Program outputs are a different matter. Many programs are written expecting their outputs to be read by humans so less attention is made to keep them easily parseable. They may for example add extra output fields or change the field ordering or insert extra whitespace.

The most notorious of these in bioinformatics is BLAST [30], which searches a database for sequence sufficiently similar to a query sequence. I have had a parser break between version 2.0.2 and 2.0.3 of BLAST because an extra space was prepended to a line. It is good that a parser is able to detect unexpected changes in a format raise an error, since otherwise the change might be silently ignored. There still needs to be a way to handle parsing many alternate versions.

The Bioperl BLAST parser is perhaps the best known parser for this task. Written by Steve Chervitz, it can handle many different versions of NCBI BLAST as well as some of the development forks like WU-BLAST and PSI-BLAST. It works by using a lot of if/else statements to handle each case, as shown below.

Part of Bioperl's BLAST parser code for handling multiple versions
    if( $data =~ /(.+?)${Newline}CPU time: (.*)/so) {
        # NCBI-Blast2 format (v2.04).
        ...
    } elsif( $data =~ /(.+?)${Newline}Parameters:(.*)/so) {
        # NCBI-Blast1 or WashU-Blast2 format.
        ...
    } elsif( $data =~ /(.+?)$Newline\s+Database:(.*)/so) {
        # Gotta watch out for confusion with the Database: line in the header
        # which will be present in the last hit of an internal Blast report
        # in a multi-report stream.
 
        # NCBI-Blast2 format (v2.05).
        ...
 
    } elsif( $data =~ /(.+?)$Newline\s*Searching/so) {
        # trying to detect a Searching at the end of a PSI-blast round.
        # Gotta watch out for confusion with the Searching line in the header
        # which will be present in the last hit of an internal Blast report
        # in a multi-report, non-PSI-blast stream.
 
        # PSI-Blast format (v2.08).
        ...
    }

Steve did an excellent job at writing this parser but maintaining it requires detailed knowledge about what each version might do. He no longer maintains this code and because of its complexity no one else has taken up the maintainer mantle.

Martel includes a BLAST parser but is not yet as complete as the Bioperl one. The existing Biopython BLAST parser also does a good of parsing the different formats so there has not been the need to work on Martel definitions. As mentioned earlier, Martel should be able to handle alternate versions of a format very easily by specifying that the BLAST parser is:

    blast = blast_2_0_10 | blast_2_0_9 | ... | wublast_2_0 | \
            blast_1_4_11 | ...
Adding support for a new version is a simple matter of prepending a new definition it to the list. The BLAST programs all display version information near the top of the file so identification should also occur quite quickly. Since the format definitions are independent of each other, they are also a lot easier to maintain than a large set of if/else statements.

While Martel has not been tested with BLAST versions, it has been used to automatically recognized and parse different file formats. The examples directory of the distribution includes a program named toxml.py which defines a new format as:

    format = swissprot38.format | PDB_2_1.format | MDL_10_1996.format | \
             blastall_2_0_10.format
This new format can be used to make a parser which converts files in any of the given formats to the corresponding XML version. The only other existing programs which manage this, like SeqIO, do so in the Bioperl fashion of writing specialized detectors by hand contained inside of if/else chains.

Validation

The final task in the requirement list was to use Martel to validate a file format. Anyone who has worked with regular expression will realize how finicky they are, so it's no surprise that Martel can be used to check if a file is in the right format.

What is startling is how few formats agree with their documentation. Sometimes it is because the documentation is incomplete, as when a format contains undocumented fields or when certain fields are allowed to be optional. In other cases, there actually is a problem in the output format, as when the arrangement of lines is in the wrong order or the year field contains 5 digits. As these have come up we have sent email to the providers asking for clarification.

Still, the number of problems identified in the existing databases is a clear indication that few others program has achieved this level of validation checking. Perhaps we can convince the database providers to validate all new releases against Martel.

Language independence

Martel is very useful for Python programmers but other people use other languages and that variety is not going to change. A key feature of Martel's format definitions is that they can be written as a single regular expression pattern. Indeed, every expression defines the __str__ method to return the corresponding pattern string.

Ideally then, someone could start with a definition written in Python and export it as a pattern string. The syntax is well defined and could be imported by a Martel-like processor written in another language. The behavior of each part of the pattern is also well defined so this new engine would easily be able to reproduce the same SAX events as Martel given the same inputs. In other words, Martel's format definitions are language independent.

We have experimented a bit with Fourthought's XSLT engine [31]. XSLT is a transformation language written in XML that describes how to convert one type of XML to another. XSLT is not a full programming language but does provide enough functionality to merge, rearrange and remove parts of a XML document. XSLT may be useful for two reasons. First, most people are only interested in the semantic information in a file. This requires discarding the extra syntactical data in a file and merging fields that had been folded over several lines. XSLT is powerful enough to be able to do that.

Just as exciting is that XSLT is itself language independent. Combined with a Martel definition means there is a language independent way to parse almost every bioinformatics data file and present the cleaned information (XML but with the syntactical fields removed) to a database, search engine or other analysis tool.

Ease of use

Martel is just starting to be used by the other Biopython developers so it is hard to quantify its ease of use. The general consensus seems to be that it is powerful - Brad Chapman termed it "regular expressions on steroids" - and able to meet its other design goals.

There are several areas of concern. A lot of Martel's parts were taken from other systems, including sre, Plex, mxTextTools, XML and SAX. Most of these parts are quite novel in this field so there is a learning curve to use them effectively. On the other hand, the most user-directed of these, regular expressions and SAX, are very well documented so help in using them is readily available.

Even then, regular expressions are used much less frequently in Python programming than in Perl. One developer has even pointed out that this is his first serious use of regular expressions. The Plex-style API has helped him out a lot. He also pointed out that since I do have a lot of experience writing regular expression patterns my examples are not very helpful because they can be quite complex. This can be addressed by reworking some of the format definitions to be more Plex-like and by continuing to develop the tutorial.

Another developer is used to regular expressions from Perl programming but is running into problems because Martel's regular expression evaluation behavior is not the same as Perl's. These differences are covered in the Bugs section below.

One problem with writing format definitions is knowing which information is important. The different fields on a line are obviously important, but knowledge of the arrangement of lines into larger blocks is also important though less apparent. It takes some practice to learn where and how to define a format. People seem to be picking up that sense pretty quickly, although more documentation would also help.

Another usability issue concerns the use of SAX itself. I've found that most people are not used to callback based programming and expect a more iterator based approach. Sean McGrath's RAX [34] suggests an intermediate solution where records are built using SAX but returned through an iterator. We have experimented with this concept further and while it may be useful, there is not enough experience yet to judge.

Bugs

There is a major design bug in Martel which thankfully has an easy workaround. Backtracking does not work with the greedy repeat operator because I can't figure out how to do it using mxTextTools. For example, ".*\n" will always fail because the "." will consume the final "\n" and not backtrack. This case can be replaced with a pattern which doesn't depend on backtracking, like "[^\n]*\n", but I still consider it a serious usability problem.

A more subtle problem comes up in the expression "(AB)+A". This will raise an internal assertion when matched against the string "ABA" because the "A" in "(AB)" will partially match the final "A" of the string then backtrack and match with the final "A" of the expression. The bug in this case is that the partial match is not discarded for unnamed groupings. Using a named group will work as expected, as in "(?P<foo>AB)+A".

A last implementation bug exists with the or operator. Once one of the parts of the expression matches none of the other parts will be tested during any backtracking. Consider the expression "primer|primer_bind" matched against the string "primer_bind". The "primer" part of the expression matches the string. The substring "_bind" remains but there is nothing in the expression for it to match against, so the parser will exit with an error. So far the workaround to this problem is to put the fields in an order where it won't be a problem. With more complex cases it may require a lookahead assertion.

With all these bugs, a working Martel definition is a subset of the behavior expected by a full regular expression. If the Martel pattern matches correctly then so will a standard regular expression. This at least allows someone in the future to improve the parser engine but not break existing definitions.

Possible long-term work

There are a few projects that I would like to see implemented but haven't really mentioned earlier. In most cases, neither I nor the other Biopython developer have the time or expertise to implement them soon.

The simplest of these is keeping up with new format definitions. The database providers change their formats about once a year as do the outputs from programs. Even limiting things to the major databases and programs, this means tweaking a format definition every week or two. My hope here is to persuade the providers to also distribute a format definition with their data sets and tools, and hopefully it will be a definitions which they've also used to validate their output.

Keeping track of versions is a weak spot in Martel. Currently it is being done at the tag level, so a parser for SWISS-PROT release 38 gets an element named "swissprot38" while one for release 39 will likely get one named "swissprot39." Release 39 is already out so it is easy see that it is a superset of release 38. There should be almost no reason to change the callback handlers, but unfortunately they expect some tags containing the format name. Another approach would be to use tag attributes to store format For example, the outermost element could be "<swissprot version='38'>". Most programs would only need the "swissprot" part and those which are dependent on the version could get that from the "version" attribute.

One of the task scenarios was the ability to do format and version detection. This is easily implemented by "or"ing the possible formats together and seeing which one works, but that runs into problems when trying to identify large files. The RecordReaders come into play only for a specified format and not an or'ed definition. When multiples of those are merged together, the parser reverts to the original in-memory parser. I see no way to solve that without solving the much harder problem of detecting when no valid backtracking is possible, but it should be possible to write detectors which can detect the format when given at most N characters of text (N may depend on the specific format) then defer the actual parsing to the appropriate parser.

The Friedl book on regular expressions [32] describes a few optimization for regular expressions but points out that checking for them usually takes too much overhead for their uses in Perl and other interactive languages. Martel is a different case because the data files to analyze are usually quite large so it is worthwhile to do some additional optimization, even if it might take a second or two. The tag tables can be pickled and cached for later one so the amortized overhead should not be that large.

Apparently the mxTextTools engine can be made even faster, or so says Marc-André Lemburg. For example, the comments mention "The beginnings of a Tag Table compiler."

The original goal of working on Martel was to allow faster parsing when the format is known to be correct. Consider the task of counting the number of records in a SWISS-PROT file. By analyzing the format definition each record must contain at least 200 characters, so a fast counter could skip 200 characters after reading the character sequence "\nID". Potentially this sort of analysis could make Martel faster than even grep for some tasks.

The core interface for Martel is very simple - it takes a format definition and a string to parse and generates a hand full of callback events. If rewritten as a C library, this core could be used by developers in C, Perl, Tcl, Java and other languages, which would really emphasize the language independence of Martel's grammar definition. It is likely that the format definition will not be the full regular expression syntax but some specialized and optimized byte stream created by a Python program.

Finally, mentioned above but worth repeating is the idea of getting rid of the two stage parsers needed for reading large files. With sufficient analysis of the format expression it should be possible to know when no backtracking is possible and discard any text which is no longer needed. Similarly, it should be possible to read data from the input stream only when required, so that that overall memory footprint stays low.

Status

Martel has achieved all our major design goals and is in final testing stage. We are in the process of replacing the existing Biopython parsers with it. The biggest tasks now are to identify and hopefully fix any outstanding bugs, write format definitions, and write builders which convert the SAX events to useful data structures. Along with this we will be identifying common subexpressions which should be moved to a common library. An initial set of these, like Integer and Float are present but others like Date and URL or the more biologically specific E.C. Number are also needed.

Availability

The development version of Martel is available at http://www.biopython.org/~dalke/Martel/ and will also soon be available as part of the core Biopython distribution. Discussions about its status, use and capabilities can be found on the biopython-dev mailing list at www.biopython.org.

Acknowledgements

Thanks go to Jeff Chang who has been a great sounding board over the last few years as he and I have explored this topic, to Roger Sayle for helping me figure out some of the format and pointing out relevant work and to Cayte Lindner and Brad Chapman for providing usability feedback.

References

[1] Biopython home page - http://www.biopython.org/
[2] Bioperl home page - http://www.bioperl.org/
[3] BioJava home page - http://www.biojava.org/
[4] BioXML - http://www.bioxml.org/
[5] BioCORBA - http://www.biocorba.org/
[6] "over 200 database formats" - Baxevanis, A. D. The Molecular Biology Database Collection: an online compilation of relevant database resources, Nuc.Acids.Res. 2000, v28, no.1.: http://nar.oupjournals.org/cgi/content/abstract/28/1/1
[7] "SWISS-PROT record 143E_HUMAN" - http://expasy.cbr.nrc.ca/cgi-bin/get-sprot-entry?P42655
[8] "SWISS-PROT format definition" - http://expasy.cbr.nrc.ca/txt/userman.txt
[9] "there is about 50 GB of publically available sequence data". This is a conservative estimate made from looking at just four databases and rounding up to include TrEMBL, which translates EMBL to protein sequence. The real number is likely several times greater, but even getting the numbers for these four databases is difficult: [10] "I analyzed 19 different SWISS-PROT parsers" - http://www.biopython.org/wiki/html/BioPython/ParserComparison.html
[11] SRS - Etzold, T., Ulyanov, A., Argos, P., SRS: information retrieval system for molecular biology data banks. Meth.Enzymol. v266, pp. 114-128. http://www.lionbio.co.uk/
[12] Swissknife - Hermjakob, H., Fleischmann, W., Apweiler, R., Swissknife - 'lazy parsing' of SWISS-PROT entries Bioinf. 1999, v15, no.9, pp771-772. ftp://ftp.ebi.ac.uk/pub/software/swissprot/
[13] Biopy - Ramu, C., Gemünd, C., Gibson, T., Object-oriented parsing of biological databases with Python Bioinf. 2000, v16, no.7, pp628-638. http://shag.embl-heidelberg.de:8000/Biopy/
[14] Darwin - http://cbrg.inf.ethz.ch/Darwinshome.html
[15] SeqIO - http://www.cs.ucdavis.edu/~gusfield/seqio.tar.gz
[16] readseq (C) - http://iubio.bio.indiana.edu/soft/molbio/readseq/version1/readseq.shar
[17] readseq (Java) - http://iubio.bio.indiana.edu/soft/molbio/readseq/java/readseq-source.zip
[18] Boulder - http://stein.cshl.org/software/boulder/
[19] molbio++ - ftp://ftp.ebi.ac.uk/pub/software/unix/molbio.tar.Z
[20] BioDB-Loader - http://www.franz.com/services/conferences_seminars/ismb2000/biodb1.tar.Z
[21] GCG - http://www.gcg.com/products/wis-package.html
[22] sp2fasta - part of WU-BLAST: http://blast.wustl.edu/
[23] sw2xml - http://www.vsms.nottingham.ac.uk/biodom/software/protsuite-user-dist/sw2xml-protbot.pl
[24] NiceProt - source not available, see an example at http://expasy.cbr.nrc.ca/cgi-bin/niceprot.pl?P42655
[25] get-sprot-entry - source not available, see an example at http://expasy.cbr.nrc.ca/cgi-bin/get-sprot-entry?P42655
[26] "Definitions are available for over 150 different formats" - http://srs.ebi.ac.uk/srs6bin/cgi-bin/wgetz?-page+databanks+-newId
[27] Molecular Applications Group - sadly no longer exists nor is any instance of DiscoveryBase publically available.
[28] mxTextTools - http://www.lemburg.com/files/python/mxTextTools.html
[29] "Greg Ewing's Plex" - http://www.cosc.canterbury.ac.nz/~greg/python/Plex/
[30] BLAST - http://www.ncbi.nlm.nih.gov/BLAST/
[31] "Fourthought's XSLT engine" - http://www.fourthought.com/
[32] "Friedl's book" - Friedl, J.E.F., Mastering Regular Expressions, O'Reilly, 1997. http://www.oreilly.com/catalog/regex/
[33] "regular expressions on steroids" - personal communications, September 27, 2000.
[34] "Sean McGrath's RAX" - http://www.oreillynet.com/~rael/data/xml/rax/