Dalke Scientific Software: More science. Less time. Products

This is part 1 of 2 in parsing. Part 2 parses chemical equations using the PLY parser generator.

Parsing by hand

counting the total number of atoms

How many atoms are in "H2SO4"? 2 hydrogens, a sulphur and 4 oxygens = 7. Now write a program to do that.

I'll start with a set of test cases

def test():
    assert atom_count("He") == 1
    assert atom_count("H2") == 2
    assert atom_count("H2SO4") == 7
    assert atom_count("CH3COOH") == 8
    assert atom_count("NaCl") == 2
    assert atom_count("C60H60") == 120

if __name__ == "__main__":
    test()

Run this and of course it will fail. I need an atom_count function which parses the string and counts the number of atoms, including the optional atom count. This reflects that there are two distinct entities in the chemical notation: element symbols and counts. I'll need some way to distinguish between the two of them. I'll use regular expressions. (BTW, in this case you could combine the two into a single regular expression; this is a teaching example.)

From an earlier project I have a compact regular expression with all of the element symbols in it, including the transuranics.

import re

atom_symbol_pattern = re.compile(
    r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|"
    r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|"
    r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|"
    r"U|V|W|Xe|Yb?|Z[nr]")

count_pattern = re.compile(r"\d+")
The match method of Python's regular expressions support an optional start position (and and optional end position). In the following, group() returns the text of the entire match and match returns None when there is no match. (Python's interactive interpreter does not show None so I'll use a print to force the display.)
>>> atom_symbol_pattern.match("H2SO4", 0)
<_sre.SRE_Match object at 0x5ee90>
>>> atom_symbol_pattern.match("H2SO4", 0).group()
'H'
>>> atom_symbol_pattern.match("H2SO4", 1)
>>> print atom_symbol_pattern.match("H2SO4", 1)
None
>>> atom_symbol_pattern.match("H2SO4", 2)
<_sre.SRE_Match object at 0x5ee90>
>>> atom_symbol_pattern.match("H2SO4", 2).group()
'S'
>>> atom_symbol_pattern.match("H2SO4", 3).group()
'O'
>>>
I can use the end() method to get the character position just past the end of the entire match, which I'll use as the start position for the next match.
>>> atom_symbol_pattern.match("NaCl", 0).group()
'Na'
>>> atom_symbol_pattern.match("NaCl", 0).end()
2
>>> atom_symbol_pattern.match("NaCl", 2).group()
'Cl'
>>> atom_symbol_pattern.match("NaCl", 2).end()
4
>>> 

I'll put the above together in a while loop and add the code which checks for the optional repeat count after the element.

import re

atom_symbol_pattern = re.compile(
    r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|"
    r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|"
    r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|"
    r"U|V|W|Xe|Yb?|Z[nr]")


count_pattern = re.compile(r"\d+")

def atom_count(s):
    N = len(s)
    count = 0
    pos = 0
    while pos < N:
        # Must have an element
        m = atom_symbol_pattern.match(s, pos)
        if not m:
            raise TypeError("Expecting an element symbol at position %d (%r)" %
                            (pos, s))
        pos = m.end()
        
        # May have an optional repeat count
        m = count_pattern.match(s, pos)
        if m:
            count += int(m.group())
            pos = m.end()
        else:
            count += 1
    return count

def test():
    assert atom_count("He") == 1
    assert atom_count("H2") == 2
    assert atom_count("H2SO4") == 7
    assert atom_count("CH3COOH") == 8
    assert atom_count("NaCl") == 2
    assert atom_count("C60H60") == 120

if __name__ == "__main__":
    test()
When I ran the above my code had an error; I had forgotten to use "pos" in the count_pattern.match. After fixing that the tests all passed. But did I check all of the cases?

I downloaded Ned Batchelder's coverage module to check. The "-x" option executes the given code and collects the results (by default) in the file ".coverage". This can collect details across multiple tests. The "-a" option generates a detailed coverage report, putting the results in a file ending ",cover".

[~/nbn] % python coverage.py -x atom_count.py
[~/nbn] % python coverage.py -a atom_count.py
[~/nbn] % cat atom_count.py,cover 
> import re
  
> atom_symbol_pattern = re.compile(
>       r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|"
>       r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|"
>       r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|"
>       r"U|V|W|Xe|Yb?|Z[nr]")
  
  
> count_pattern = re.compile(r"\d+")
  
> def atom_count(s):
>       N = len(s)
>       count = 0
>       pos = 0
>       while pos < N:
                # Must have an element
>               m = atom_symbol_pattern.match(s, pos)
>               if not m:
!                       raise TypeError("Expecting an element symbol at position %d (%r)" %
!                                       (pos, s))
>               pos = m.end()
  
                # May have an optional repeat count
>               m = count_pattern.match(s, pos)
>               if m:
>                       count += int(m.group())
>                       pos = m.end()
>               else:
>                       count += 1
>       return count
  
> def test():
>     assert atom_count("He") == 1
>     assert atom_count("H2") == 2
>     assert atom_count("H2SO4") == 7
>     assert atom_count("CH3COOH") == 8
>     assert atom_count("NaCl") == 2
>     assert atom_count("C60H60") == 120
  
> if __name__ == "__main__":
>     test()
Looks like I tested almost everything. I missed testing the error handling, which is not uncommon in any code. Testing all error cases is hard, boring, and generally unsatisfying. Errors do occur in the error handling code so I'll add a test to exercise that case.
def test():
    assert atom_count("He") == 1
    assert atom_count("H2") == 2
    assert atom_count("H2SO4") == 7
    assert atom_count("CH3COOH") == 8
    assert atom_count("NaCl") == 2
    assert atom_count("C60H60") == 120
    try:
        atom_count("SeZYou")
    except TypeError, err:
        assert "position 2" in str(err)
    else:
        raise AssertionError("should have failed")
With that in place all of my code is tested, but I see the else: in my test code isn't. I can change thing to fix that but a better solution is to use "assertRaises" from the unittest module, or implement my own. For example,
def assert_raises(exc, f, *args):
    try:
        f(*args)
    except exc:
        pass
    else:
        raise AssertionError("Expected %r" % (exc,))

def test():
    assert atom_count("He") == 1
    assert atom_count("H2") == 2
    assert atom_count("H2SO4") == 7
    assert atom_count("CH3COOH") == 8
    assert atom_count("NaCl") == 2
    assert atom_count("C60H60") == 120
    assert_raises(TypeError, atom_count, "SeZYou")
though a yet better test would check that the error message contains the correct position information.

Individual atom counts

How many atoms of each element are in CH3COOH? There are 4 hydrogens, 2 carbons, and 2 oxygens. Now write a program to do that.

defaultdict

Normally this sort of information is stored in a dictionary mapping element name to its count in the equation. In that case you have to worry about two cases: 1) if "H" is in the dictionary then add the new count to the existing count, 2) if "H" is not there then set it to the new count. The "get" method helps unify the two cases, as in:

>>> counts = {}
>>> counts["H"] += 3
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
KeyError: 'H'
>>> counts["H"] = counts.get("H", 0) + 3
>>> 

The "+= 3" version works in Perl, which gives a default value if the key doesn't exist. Python doesn't make that assumption so until recently you had to test of the value existed first, or use the get/setdefault methods. New with Python 2.5 is a dictionary variant called "defaultdict" which takes a callable (function, class or functor) in the constructor. If requested items does not exist the defaultdict calls the callable and uses the result as the default object.

>>> import collections
>>> counts = collections.defaultdict(int)
>>> counts["H"] += 3
>>> counts
defaultdict(<type 'int'>, {'H': 3})
>>> int()  
0
>>>
To understand why it works, note that "int()" returns 0.

The defaultdict isn't that much of an improvement in this case; only a few characters. It's more useful when default items is complex object. With a normal Python dictionary the setdefault method exists but it's ugly and annoying to use because the default item is always created. With defaultdict you can do

>>> x = collections.defaultdict(list)
>>> x
defaultdict(<type 'list'>, {})
>>> x["vikings"].append("spam")
>>> x
defaultdict(<type 'list'>, {'vikings': ['spam']})
>>> x["vikings"]
['spam']
>>>  

code for element_counts

The code to do the element counts isn't surprising. It's a modification of the atom_count function. The biggest change is keeping the atom_symbol_pattern match around to use the element symbol in the counts defaultdict.

import re
import collections

atom_symbol_pattern = re.compile(
    r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|"
    r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|"
    r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|"
    r"U|V|W|Xe|Yb?|Z[nr]")

count_pattern = re.compile(r"\d+")

def element_counts(s):
    N = len(s)
    counts = collections.defaultdict(int)
    pos = 0
    while pos < N:
        # Must have an element
        element_m = atom_symbol_pattern.match(s, pos)
        if not element_m:
            raise TypeError("Expecting an element symbol at position %d (%r)" %
                            (pos, s))
        pos = element_m.end()
        
        # May have an optional repeat count
        count_m = count_pattern.match(s, pos)
        if count_m:
            counts[element_m.group()] += int(count_m.group())
            pos = count_m.end()
        else:
            # No repeat count; add one for the element symbol
            counts[element_m.group()] += 1
    return counts
        

def test():
    assert element_counts("CH3COOH") == {"C": 2, "H": 4, "O": 2}
    assert element_counts("Ne") == {"Ne": 1}
    assert element_counts("") == {}
    assert element_counts("NaCl") == {"Na": 1, "Cl": 1}

if __name__ == "__main__":
    test()

Tokenizer

It's annoying that the two functions implement the same bit of parsing code. I would rather they share more of the code. They can. I'll have both function call a new function which "tokenizes" the input string. In parsing terminology the element symbol and repeat count are known as tokens.

I'll implement the tokenizer using the 'yield' statement. Err, but in Python 2.5 it's an expression. I'll have to get used to that change. Yield has been around for a while now so I won't explain it here and only show this example

>>> def count_to_five():
...   yield "one"
...   yield "two"
...   yield "three"
...   yield "four"
...   yield "five"
... 
>>> for number in count_to_five():
...   print number
... 
one
two
three
four
five
>>> 

The tokenizer figures out which token is next: atom symbol or repeat count. It needs to let the caller know the token type and information specific to the token. In this case the token type will be the string "element" or "count" and the token information will be the element symbol or the integer count value, respectively.

import re
import collections

atom_symbol_pattern = re.compile(
    r"C[laroudsemf]?|Os?|N[eaibdpos]?|S[icernbmg]?|P[drmtboau]?|"
    r"H[eofgas]?|A[lrsgutcm]|B[eraik]?|Dy|E[urs]|F[erm]?|G[aed]|"
    r"I[nr]?|Kr?|L[iaur]|M[gnodt]|R[buhenaf]|T[icebmalh]|"
    r"U|V|W|Xe|Yb?|Z[nr]")

count_pattern = re.compile(r"\d+")

def tokenize(s):
    N = len(s)
    pos = 0
    while pos < N:
        m = atom_symbol_pattern.match(s, pos)
        if not m:
            raise TypeError("Expecting an element symbol at position %d (%r)" %
                            (pos, s))
        yield ("element", m.group())
        pos = m.end()
        
        m = count_pattern.match(s, pos)
        if m:
            yield ("count", int(m.group()))
            pos = m.end()

def atom_count(s):
    count = 0
    for (token_type, token) in tokenize(s):
        if token_type == "element":
            count += 1
        else:
            # -1 because the element already added 1
            count += token-1
    return count

def element_counts(s):
    counts = collections.defaultdict(int)
    element = None
    for (token_type, token) in tokenize(s):
        if token_type == "element":
            element = token
            counts[element] += 1
        else:
            # -1 because the element already added 1
            counts[element] += token-1
    return counts

######
def assert_raises(exc, f, *args):
    try:
        f(*args)
    except exc:
        pass
    else:
        raise AssertionError("Expected %r" % (exc,))

def test_element_counts():
    assert element_counts("CH3COOH") == {"C": 2, "H": 4, "O": 2}
    assert element_counts("Ne") == {"Ne": 1}
    assert element_counts("") == {}
    assert element_counts("NaCl") == {"Na": 1, "Cl": 1}
    assert_raises(TypeError, element_counts, "Blah")
    assert_raises(TypeError, element_counts, "10")
    assert_raises(TypeError, element_counts, "1C")

def test_atom_count():
    assert atom_count("He") == 1
    assert atom_count("H2") == 2
    assert atom_count("H2SO4") == 7
    assert atom_count("CH3COOH") == 8
    assert atom_count("NaCl") == 2
    assert atom_count("C60H60") == 120
    assert_raises(TypeError, atom_count, "SeZYou")
    assert_raises(TypeError, element_counts, "10")
    assert_raises(TypeError, element_counts, "1C")

def test():
    test_atom_count()
    test_element_counts()

if __name__ == "__main__":
    test()
    print "All tests passed."

The parser

The count functions are still complicated. They have to handle each token type different, and the element_counts function has to remember what the previous element was in case the count comes next. The tokenizer is also more complicated than it needs to be. It addition to tokenizing it verifies that the tokens are in the right order. In this case the two tokens - element symbols and numbers - cannot be confused with each other. The tokenizer should just identify the next token and let something else verify that the tokens are in the correct order. That something else could also simplify matters by merging the symbol and count information into a single data structure.

That "something else" is called a parser. It consumes tokens and generates what's usually called a parse tree or an abstract syntax tree. In this case though the chemical equations are linear and there is no tree so the parser can generate "Atom" objects with the symbol and count information.

class Atom(object):
    def __init__(self, symbol, count):
        self.symbol = symbol
        self.count = count

Rather than show the whole code I'll only include the parts that changed. You'll need to add the import statements and test code yourself. I hope though that you see the usefulness of a good test suite. I can make large changes to the underlying implementation and be pretty sure that because it passes the tests the new code is as correct as the old code. Also, I did check that the coverage was still complete.

def tokenize(s):
    N = len(s)
    pos = 0
    while pos < N:
        m = atom_symbol_pattern.match(s, pos)
        if m:
            yield ("element", pos, m.group())
            pos = m.end()
            continue
        
        m = count_pattern.match(s, pos)
        if m:
            yield ("count", pos, int(m.group()))
            pos = m.end()
            continue

        raise TypeError("Unknown text at position %d (%r)" %
                        (pos, s))

# The "parse tree" is a list of Atom instances, returned through a generator
class Atom(object):
    def __init__(self, symbol, count):
        self.symbol = symbol
        self.count = count
        
# take a list or iterator returning token information; yield Atom instances
def parse(tokens):
    token_stream = iter(tokens)
    symbol = None
    for token_type, pos, token in token_stream:
        if token_type == "element":
            if symbol is not None:
                # there was a previous element waiting for a count
                yield Atom(symbol, 1)
            symbol = token
        else:
            if symbol is None:
                raise TypeError(
                "count token %r at position %d but no element symbol to repeat" %
                (token, pos))
            else:
                # token contains the integer repeat count
                yield Atom(symbol, token)
                symbol = None
    if symbol is not None:
        # last element symbol has no repeat count; has a count of 1
        yield Atom(symbol, 1)
            
def atom_count(s):
    count = 0
    for atom in parse(tokenize(s)):
        count += atom.count
    return count

def element_counts(s):
    counts = collections.defaultdict(int)
    for atom in parse(tokenize(s)):
        counts[atom.symbol] += atom.count
    return counts
You can see that the atom_count and element_counts functions are now very simple, and with little code duplication. I could merge the parse(tokenize(s)) calls into a new function, but that's trivial. The parse() function is rather complex, mostly because it's now in charge of verifying the syntax and combining the element symbol information with the optional repeat count.



Copyright © 2001-2013 Andrew Dalke Scientific AB