Dalke Scientific Software: More science. Less time. Products

This is part 2 of 2 in parsing. Part 1 parses chemical equations using hand-written code.

Parsing with PLY

The ALGOL programming language in the late 1950s/early 1960s introduced the idea of computer languages based on a machine understandable grammar. The notation is called BNF for Backus-Naur Form. It wasn't until the 1970s though that people understood the practice well enough to write general-purpose tools which helped build parsers. The first popular book describing the process is known as The Dragon Book because it had a picture of a knight attacking a dragon on the cover; a hint that this was a hard topic being conquered. The back cover had Don Quixote attacking windmills. It taught how to take the theory and turn it into practice. It's considered a classic because of its impact but the material is dated.

This is also the time when lex and yacc were introduced as part of Unix. Lex is a lexer and yacc is "yet another compiler compiler." As you can infer, it wasn't the first. The term "compiler compiler" means it uses a language used to build a parser for another language. The term "compiler" is a misnomer in this case and the classical distinction between "compiled" and "interpreted" language has mostly become irrelevant.

There are many parser tools for Python. A web search will dig them up for you. The tools take several different approaches. The one I liked the most some years back was SPARK, Last month I looked around and found that PLY is the more modern tool for the niche that SPARK filed. The author of PLY is David Beazley, who is also the author of SWIG.

PLY splits the processing up between the lexer and the parser. As with the final code in the "parsing by hand" talk the lexer assumes all the tokens are described by regular expressions and assumes they are all potentially values at any point in the input. This is usually true for languages designed to be read by parser generator tools, but false for most bioinformatics file formats. (Eg, "GENE" can be someone's name, a section header and a protein sequence, all in the same file. The meaning depends on the location in the file. PLY supports stateful systems like this but I haven't tried that out.)

tokenizing with PLY's 'lex' module

PLY uses variables starting with "t_" to indicate the token patterns. If the variable is a string then it is interpreted as a regular expression and the match value is the value for the token. If the variable is a function then its docstring contains the pattern and the function is called with the matched token. The function is free to modify the token or return a new token to be used in its place. If nothing is returned then the match is ignore. Usually the function only changes the "value" attribute, which is initially the matched text. In the following the t_COUNT converts the value to an int.

import lex

tokens = (
    "SYMBOL",
    "COUNT"
)

t_SYMBOL = (
    r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|"
    r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|"
    r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|"
    r"U|V|W|Xe|Yb?|Z[nr]")

def t_COUNT(t):
    r"\d+"
    t.value = int(t.value)
    return t

def t_error(t):
    raise TypeError("Unknown text '%s'" % (t.value,))

lex.lex()

lex.input("CH3COOH")
for tok in iter(lex.token, None):
    print repr(tok.type), repr(tok.value)
BTW, iter(f, sentinel) is a handy function. It calls f() and returns the return value using the iterator protocol. When the returned value equals the sentinel value it raises a StopIterationError. In other words, I converted the sequence of multiple function calls into an iterator, letting my use a for loop.

When I run the code I get the following

'SYMBOL' 'C'
'SYMBOL' 'H'
'COUNT' 3
'SYMBOL' 'C'
'SYMBOL' 'O'
'SYMBOL' 'O'
'SYMBOL' 'H'
You can see that the count was properly converted to an integer

parsing with PLY's 'yacc' module

PLY's parser works on tokens. It uses a BNF grammar that describes how those tokens are assembled. Parsers can handle some ambiguity. In this chemical equation example the grammar is ambiguous after reading a chemical symbol. There could be repeat count afterward or not. There are a huge number of techniques to resolve the ambiguity such as as lookahead (see what the next tokens are) and precedence rules (given the choice between "*" and "+" use "*" first).

The parsing algorithms go by names like LALR(1), SLR, LL and LR. The most general purpose is the Earley algorithm (which SPARK used) but it's usually slower than the more limited ones I just listed. The limited ones only parse less complicated grammars. You might think that's a problem, but in practice that's rarely the case. People have a hard time with complex grammars as well and it's better to have a syntax which people can understand without deep thought.

Here's the parser part of my chemical equation evaluator.

class Atom(object):
    def __init__(self, symbol, count):
        self.symbol = symbol
        self.count = count

# When parsing starts, try to make a "chemical_equation" because it's
# the name on left-hand side of the first p_* function definition.
def p_species_list(p):
    "chemical_equation :  chemical_equation species"
    p[0] = p[1] + [p[2]]

def p_species(p):
    "chemical_equation : species"
    p[0] = [p[1]]

def p_single_species(p):
    """
    species : SYMBOL
    species : SYMBOL COUNT
    """
    if len(p) == 2:
        p[0] = Atom(p[1], 1)
    elif len(p) == 3:
        p[0] = Atom(p[1], p[2])
        
def p_error(p):
    print "Syntax error at '%s'" % p.value
    
yacc.yacc()

print yacc.parse("H2SO4")
As you can see, I can have one or more rule defined per parser function, and I can have multiple functions which work on the same left-hand side term. When using yacc, remember to return the information through p[0]. I keep expecting to "return" the object instead.

When I run the above, combined with the lexer code (and with the lexer test code removed) I get the following output

[<__main__.Atom object at 0xb83b0>, <__main__.Atom object at 0xc14d0>, <__main__.Atom object at 0xc12b0>]
That proved rather unhelpful. I can see there are 3 atoms but I can't see what the atoms are. I'll modify the Atom class to add a __repr__ method.
class Atom(object):
    def __init__(self, symbol, count):
        self.symbol = symbol
        self.count = count
    def __repr__(self):
        return "Atom(%r, %r)" % (self.symbol, self.count)
and with that in place I see
[Atom('H', 2), Atom('S', 1), Atom('O', 4)]

parsetab.py

Generating the parser can be expensive (that is, take a lot of time). PLY will store the parser in a file called "parsetab.py" and reuse it if the grammar is unchanged. If you with you can change the filename to something else.

The new count functions

There is only a minor change to the count functions. I don't need the tokenize step because PLY handles that for me. I can pass a string directly into the yacc.parse functions. Here's are the new versions

import collections

def atom_count(s):
    count = 0
    for atom in yacc.parse(s):
        count += atom.count
    return count

def element_counts(s):
    counts = collections.defaultdict(int)
    for atom in yacc.parse(s):
        counts[atom.symbol] += atom.count
    return counts 

Because the function APIs are unchanged I can use the same test suite as before. When I do that I found two problems:

1) I didn't get a TypeError when the input was an invalid string. This was because I was only printing the error, with

def p_error(p):
    print "Syntax error at '%s'" % p.value
when I should have raised an expection, with
def p_error(p):
    raise TypeError("unknown text at %r" % (p.value,))

2) I didn't support the empty string. My previous code allowed "" to mean "no chemical compound". Supporting that is straight-forward but required a bit of rearranging. I had to add a new start token which can either be the empty string (in which case p[0] gets an empty list) or be a list of chemical species. The new grammar is

def p_chemical_equation(p):
    """
    chemical_equation :
    chemical_equation : species_list
    """
    if len(p) == 1:
        # the empty string means there are no atomic symbols
        p[0] = []
    else:
        p[0] = p[1]

def p_species_list(p):
    "species_list :  species_list species"
    p[0] = p[1] + [p[2]]

def p_species(p):
    "species_list : species"
    p[0] = [p[1]]

def p_single_species(p):
    """
    species : SYMBOL
    species : SYMBOL COUNT
    """
    if len(p) == 2:
        p[0] = Atom(p[1], 1)
    elif len(p) == 3:
        p[0] = Atom(p[1], p[2])

The final code

For posterity's sake, here's the final code with all the fixes, docstrings and self-test code.

# A grammar for chemical equations like "H2O", "CH3COOH" and "H2SO4"
# Uses David Beazley's PLY parser.
# Implements two functions: count the total number of atoms in the equation and
#   count the number of times each element occurs in the equation.

import lex
import yacc

tokens = (
    "SYMBOL",
    "COUNT"
)

t_SYMBOL = (
    r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|"
    r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|"
    r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|"
    r"U|V|W|Xe|Yb?|Z[nr]")

def t_COUNT(t):
    r"\d+"
    t.value = int(t.value)
    return t

def t_error(t):
    raise TypeError("Unknown text '%s'" % (t.value,))

lex.lex()

class Atom(object):
    def __init__(self, symbol, count):
        self.symbol = symbol
        self.count = count
    def __repr__(self):
        return "Atom(%r, %r)" % (self.symbol, self.count)

# When parsing starts, try to make a "chemical_equation" because it's
# the name on left-hand side of the first p_* function definition.
# The first rule is empty because I let the empty string be valid
def p_chemical_equation(p):
    """
    chemical_equation :
    chemical_equation : species_list
    """
    if len(p) == 1:
        # the empty string means there are no atomic symbols
        p[0] = []
    else:
        p[0] = p[1]

def p_species_list(p):
    "species_list :  species_list species"
    p[0] = p[1] + [p[2]]

def p_species(p):
    "species_list : species"
    p[0] = [p[1]]

def p_single_species(p):
    """
    species : SYMBOL
    species : SYMBOL COUNT
    """
    if len(p) == 2:
        p[0] = Atom(p[1], 1)
    elif len(p) == 3:
        p[0] = Atom(p[1], p[2])
        
def p_error(p):
    raise TypeError("unknown text at %r" % (p.value,))
    
yacc.yacc()

######

import collections

def atom_count(s):
    """calculates the total number of atoms in the chemical equation
    >>> atom_count("H2SO4")
    7
    >>>
    """
    count = 0
    for atom in yacc.parse(s):
        count += atom.count
    return count

def element_counts(s):
    """calculates counts for each element in the chemical equation
    >>> element_counts("CH3COOH")["C"]
    2
    >>> element_counts("CH3COOH")["H"]
    4
    >>>
    """
    
    counts = collections.defaultdict(int)
    for atom in yacc.parse(s):
        counts[atom.symbol] += atom.count
    return counts 

######
def assert_raises(exc, f, *args):
    try:
        f(*args)
    except exc:
        pass
    else:
        raise AssertionError("Expected %r" % (exc,))

def test_element_counts():
    assert element_counts("CH3COOH") == {"C": 2, "H": 4, "O": 2}
    assert element_counts("Ne") == {"Ne": 1}
    assert element_counts("") == {}
    assert element_counts("NaCl") == {"Na": 1, "Cl": 1}
    assert_raises(TypeError, element_counts, "Blah")
    assert_raises(TypeError, element_counts, "10")
    assert_raises(TypeError, element_counts, "1C")

def test_atom_count():
    assert atom_count("He") == 1
    assert atom_count("H2") == 2
    assert atom_count("H2SO4") == 7
    assert atom_count("CH3COOH") == 8
    assert atom_count("NaCl") == 2
    assert atom_count("C60H60") == 120
    assert_raises(TypeError, atom_count, "SeZYou")
    assert_raises(TypeError, element_counts, "10")
    assert_raises(TypeError, element_counts, "1C")

def test():
    test_atom_count()
    test_element_counts()

if __name__ == "__main__":
    test()
    print "All tests passed."



Copyright © 2001-2013 Andrew Dalke Scientific AB