[ previous | newer ] /home/writings/diary/archive/2005/04/21/screen_scraping

Screen scraping

Most of the interesting servers in the world are web servers. While the layout of the web pages is in HTML that a machine can handle (with some effort), the essential data in that file is meant for human to read and is rarely designed to be easily extracted by software. But there are ways.

I considered using OpenEye's demo site and PubChem as possible examples but they proved to be too complex for this esssay. After some searching I came across a program at NIST that allows searching for compounds based on molecular weight.

For the first version of this code I'll only support searching for a given atomic weight +/- 0.5 amu. I want the interface to look like this:

>>> results = mw_search(145)
>>> len(results)
118
>>> results[0]
(144.86, 'AsCl2', 'AsCl₂')
>>>

That is, I give it a value and it returns a list of the hits. Each hit is a 3-tuple of the weight (as a float), the simple name for normal ASCII and the name for HTML.

The HTTP protocol used for the web supports many different request types. Most of them are GET requests and some are POST requests. The easiest way to identify a GET request is to look at the URL for a search page. If it's "complex" (has a '?' followed by additional text) then it's probably a GET request. One test is to bookmark the page, leave, then come back to the bookmark. If the results are unchanged then it's a GET request. Another way to find out is to look at the HTML of the page that starts the search. If it is "<input type="POST" ...>" then it's a POST. If not specified then it's a GET request.

When I tried the search through a web page the results page had the URL:

http://webbook.nist.gov/cgi/cbook.cgi?
Value=145&VType=MW&Formula=&AllowExtra=on&Units=SI

I split it over two lines to make it shorter for the screen.

This is almost certainly a GET request. To test it I changed the "145" which was my MW search criterion to "146". The new results page changed accordingly. GET searches are easier to handle because you can see everything on the URL line. For POST requests you need to look at the HTML, and/or using a debugging proxy, or network sniffer, or perhaps these days a Firefox extension.

Trying to figure out how something works is called reverse engineering. In this case it's pretty simple. The parameters are pretty easily matched to the inputs on the main page:

Value = the atomic weight
VType = "MW" (this is a hidden field in the HTML, fixed to be "MW")
Formula = the optional formula for restricting the search
AllowExtra = "on" if more element types are allowed than given in the formula
Units = "SI" for SI units, "CAL" for calories

Python has several libraries for working with the web. There are libraries for the different protocols (HTTP, FTP) and a library on top of that for working with URLs. Actually there are two; urllib and urllib2. I'll use the second which is meant as a replacement for some of the shortcomings of the first. In the following I'll give it the known good URL. It returns with a file-like object. I'll read the full response and display the first 200 characters.

>>> f = urllib.urlopen("http://webbook.nist.gov/cgi/cbook.cgi?"
...            "Value=145&VType=MW&Formula=&AllowExtra=on&Units=SI")
>>> s = f.read()
>>> print s[:200]
<!DOCTYPE html
      PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
     "DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Search Result
>>>

The data I want starts further on down

>>> print s[1700:2200]
following will be displayed:
</p>
<ul>
<li>Molecular&#160;weight</li>
<li>Chemical name</li>
<li>Chemical formula</li>
</ul>
<p>
Click on the name to see more data.
</p>
<ol>
<li><strong>&#160; 144.86 </strong>  <a href="/cgi/cbook.cgi?ID=C41996376&Units=SI">AsCl2</a>  (AsCl<sub>2</sub>)</li>
<li><strong>&#160; 144.86 </strong>  <a href="/cgi/cbook.cgi?ID=B131&Units=SI">AsCl2 anion</a>  (AsCl<sub>2</sub><sup>-</sup>)</li>
<li><strong>&#160; 144.89 </strong>  <a href="/cgi/cbook.cgi?ID=C1
>>>

I just need to make my function create the right URL query string. That's a simple string substitution:

import urllib2

_weight_query = ("http://webbook.nist.gov/cgi/cbook.cgi?"
                 "Value=%f&VType=MW&Formula=&AllowExtra=on&Units=SI")
                 #      ^^ the weight goes here

def mw_search(weight):
    query = _weight_query % (weight,)
    return urllib2.urlopen(query)

print mw_search(145).read()

I can run this and see the raw HTML printed to the screen.

After looking at the HTML for a bit I see that the lines I want always start with "<li><strong>". If I assume the format never changes I can use a pretty simple parser to get the fields I want. Here it is

import urllib2

_weight_query = ("http://webbook.nist.gov/cgi/cbook.cgi?"
                 "Value=%f&VType=MW&Formula=&AllowExtra=on&Units=SI")
                 #      ^^ the weight goes here

def _extract_data(infile):
    results = []
    for line in infile:
        if not line.startswith("<li><strong>"):
            continue
        # These lines contain the data I want

        # The weight is between the ';' and the '<'
        # <li><strong>&#160; 144.86 </strong>
        weight_start = line.index(";")+1
        weight_end = line.index("<", weight_start)
        weight = float(line[weight_start:weight_end])

        # The chemical name is between the 'SI">' and the next '<'
        #  SI">AsCl2</a>
        name_start = line.index('SI">')+4
        name_end = line.index('<', name_start)
        name = line[name_start:name_end]

        # The chemical formula (in HTML) is between the parens
        formula_start = line.index("(", name_end) + 1
        formula_end = line.index(")", formula_start)
        formula = line[formula_start:formula_end]

        results.append( (weight, name, formula) )

    return results

def mw_search(weight):
    query = _weight_query % (weight,)
    f = urllib2.urlopen(query)
    return _extract_data(f)

if __name__ == "__main__":
    results = mw_search(145)
    print results[0]
    print len(results)

Not a very elegant parser but it works. Here's the output

(144.86000000000001, 'AsCl2', 'AsCl<sub>2</sub>')
118

This process of extracting data from the HTML is called screen scraping because it's scraping the data off the screen instead of getting the data more directly. The basic process is exactly like this example: construct a request, parse the response. Though in more complicated cases it may need to make several iterations before it gets the needed results.

Parsing the HTML is often the trickest part of the problem. The HTML returned from the server is ill-defined and often not even valid. Even when valid, there's nothing to define which elements are where or how to identify the data to be extracted. That needs to be figured out by inspection combined with experience.

One helpful library for HTML screen scraping is BeautifulSoup. It tries to convert even poor quality HTML into a tree structure that's easier to parse than working with the HTML as a string.

It does require knowing about the document structure as a tree instead of a set of lines. In this case it looks like the chemical information is in the li fields of the only ol in the record.

>>> f = urllib.urlopen("http://webbook.nist.gov/cgi/cbook.cgi?"
...            "Value=145&VType=MW&Formula=&AllowExtra=on&Units=SI")
>>> s = f.read()
>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup(s)
>>> ol = soup.first("ol")
>>> ol.first("li")
<li><strong>&#160; 144.86 </strong> <a href="/cgi/cbook.cgi?ID=C41996376&Units=SI">AsCl2</a>  (AsCl<sub>2</sub>)</li>
>>>

With some experimentation and testing here's the BeautifulSoup version of the parser

import BeautifulSoup
import urllib2

_weight_query = ("http://webbook.nist.gov/cgi/cbook.cgi?"
                 "Value=%f&VType=MW&Formula=&AllowExtra=on&Units=SI")
                 #      ^^ the weight goes here

def _extract_data(soup):
    results = []
    ol = soup.first("ol")
    for li in ol.fetch("li"):
        weight_term = li.first("strong").string
        # Ignore that leading unicode character
        weight = float(weight_term.split()[1])

        name = li.first("a").string

        # I still need to use text searching for this  :(
        s = str(li)
        formula_start = s.find("(")+1
        formula_end = s.find(")", formula_start)
        formula = s[formula_start:formula_end]

        results.append( (weight, name, formula) )
    return results


def mw_search(weight):
    query = _weight_query % (weight,)
    f = urllib2.urlopen(query)
    soup = BeautifulSoup.BeautifulSoup(f.read())
    return _extract_data(soup)


if __name__ == "__main__":
    results = mw_search(145)
    print results[0]
    print len(results)

For this case it's only a bit clearer than the original line-oriented parser, mostly because I chose a server that was easy to parse and haven't tried to deal with errors. So let's do that.

What does the server do if I pass it a value that's negative? Trying it interactively I get the page:

No Matching Species Found

No species with the requested data and a molecular weight in the range of [-145.50, -144.50] were found in the database.

and using the function above I get an empty list. That's what I wanted.

Okay, what if there's only one match? I found that searching for a mw=2011 returns

In14P13 anion

Formula: In₁₄P₁₃^-
Molecular Weight: 2010.11
CAS Registry Number: 243867-98-3

followed by some additional information. It looks like when there's only one compound then the server shows more data and in different format. The relevant HTML for the parsing is

<h1><a id="Top" name="Top">In14P13 anion</a></h1>
<ul>
<li><strong>Formula:</strong> In<sub>14</sub>P<sub>13</sub><sup>-</sup></li>
<li><strong>Molecular Weight:</strong> 2010.11</li>

I can write a parser for this case, I just need to know when to use which one. After looking at the HTML for a bit, if the h1 field has an a in it then it's the detailed information for a single compound. Otherwise it's a list of results or an error message saying there were no results in that range. Not the most satisfying of solutions but that's typical when screen scraping.

The following code implements that logic. Notice how I have one function to identify the contents of the soup then I pass it off to the appropriate parser to extract the right data. This partitioning makes the code easier to read and test.

import BeautifulSoup
import urllib2

_weight_query = ("http://webbook.nist.gov/cgi/cbook.cgi?"
                 "Value=%f&VType=MW&Formula=&AllowExtra=on&Units=SI")
                 #      ^^ the weight goes here

## Parses the following
# <ol>
# <li><strong>  144.86 </strong>  <a href="/cgi/cbook.cgi?ID=C41996376&Units=SI">AsCl2</a>  (AsCl<sub>2</sub>)</li>
# <li><strong>  144.86 </strong>  <a href="/cgi/cbook.cgi?ID=B131&Units=SI">AsCl2 anion</a>  (AsCl<sub>2</sub><sup>-</sup>)</li>
# <li><strong>  144.89 </strong>  <a href="/cgi/cbook.cgi?ID=C166899805&Units=SI">Al3S2 anion</a>  (Al<sub>3</sub>S<sub>2</sub><sup>-</sup>)</li>
def _extract_search_results(soup):
    results = []
    ol = soup.first("ol")
    for li in ol.fetch("li"):
        weight_term = li.first("strong").string
        # Ignore that leading unicode character
        weight = float(weight_term.split()[1])

        name = li.first("a").string

        # I still need to use text searching for this  :(
        s = str(li)
        formula_start = s.find("(")+1
        formula_end = s.find(")", formula_start)
        formula = s[formula_start:formula_end]

        results.append( (weight, name, formula) )
    return results

## Parses the following
# <h1><a id="Top" name="Top">In14P13 anion</a></h1>
# <ul>
# <li><strong>Formula:</strong> In<sub>14</sub>P<sub>13</sub><sup>-</sup></li>
# <li><strong>Molecular Weight:</strong> 2010.11</li>
# <li><strong>CAS Registry Number:</strong> 243867-98-3</li>

def _extract_single_result(soup):
    name = soup.first("h1").first("a").string
    
    lis = soup.first("ul").fetch("li")
    # It's the text between the space and the </li>
    s = str(lis[0])
    formula_start = s.index(" ")+1
    formula_end = s.index("</li>")
    formula = s[formula_start:formula_end]

    weight = float(lis[1].contents[1].string)

    return [(weight, name, formula)]

def _extract_data(soup):
    h1 = soup.first("h1")

    # If there's an 'a' tag in the 'h1' then it's
    # a single element result
    if h1.first("a") is not BeautifulSoup.Null:
        return _extract_single_result(soup)
    else:
        return _extract_search_results(soup)


def mw_search(weight):
    query = _weight_query % (weight,)
    f = urllib2.urlopen(query)
    soup = BeautifulSoup.BeautifulSoup(f.read())
    return _extract_data(soup)

There's more that can be done. The function call could be expanded to support more of the server search parameters. The parser for the list results page could include the links to the detailed information about each compound. The information it returns should include a flag if the search limit was reached. The parser for the compound details could extract more of the available details on the page; the image for the structure, 2D mol file, CAS number, alternate names, and so on.

The overall process for developing a client to a web service is similar to the one earlier for interfacing with a subprocess wrapped executable. First, figure out how you want to interact with the system, balancing your expectations with feasability. The data model you come up with might not match that on the server: you can partition overloaded server functions into different API functions, or make one object that merges multiple server requests.

Code the basic functionality. When that works, figure out how to make the server fail in strange ways. Be creative. Remember to put those tests into a automatic testing system. After you rewrite or clean up some code, run the tests. When they pass you know your changes didn't introduce new problems or skip known old problems.

Expand, test, inspect, break, fix. Repeat until you have what you need, bearing in mind that you don't need to implement features you aren't going to need or test for failures that aren't going to happen.

By the way, if you are going to construct more complicated URL query strings, make sure you use use the urllib.urlencode() function. In this essay the only user-defined parameter was a number which is easy to handle using a %f. In most other cases a parameter field may contain arbitrary characters. The rules for URLs put restrictions on which characters are allowed in the URL proper. Any other character must be escaped according to those rules, which urlencode does for you.

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me