PyDaylight, part II
       -or-
  What has Andrew been doing the last four weeks?


I'll start with part I, which was covered at my MUG presentation.  I
want to go into some more detail since I believe this is a more
technical audience.

The work was all developed in Python, and extends Roger Critchlow's
dayswig work.  The only new C code was to provide the callback hooks
to support the drawing library.  I'll talk about that in a bit.

DayPerl, DayTcl and DaySWIG all wrap the C-level Daylight toolkit and
provide interfaces to "very high-level languages" (VHLL).  An
advantage to these languages is they provide many higher-level
constructs, like auto-resizable lists, associative arrays, exceptions
and garbage collection, which reduces the time, expertise and amount
of code needed to develop new software.

Making the toolkit accessible to the VHLL doesn't automatically add
these new behaviours.  If dt_cansmiles returns a NULL, the wrapper
code doesn't raise an exception in the VHLL, it just returns the
'invalid string' data type.  In fact, in some cases the return type
may or may not be an error, depending on the context.  dt_charge of an
atom can return an integer, such as -1, but dt_charge of a molecule is
an error, so it returns -1.

   Reducing polymorphism with classes

Python, my VHLL of choice, is an object-oriented langauge.  It allows
you to create and manipulate new data types, like atom, bond, SMARTS
pattern, Thor server and Merlin pool.  The toolkit also has a type
system, including those same data types.  However, because the library
was developed to work with C and FORTRAN codes, the type information
is hidden behind opaque integer handles.

Thus, toolkit functions can (must!) be very polymorphic, to get around
C and FORTRAN's lack of overloading.  The same function can do many
different things, depending on the underlying toolkit types.  But all
that the C compiler knows is that you are manipulating integers.  So
it is easy to make simple mistakes, like call dt_charge(molecule)
instead of dt_charge(atom), but the compiler doesn't know enough to
stop you.  The developer must remember more about how the underlying
toolkit types work, which increases the difficulty of writing new
code.

One way to fix the problem is to wrap dt_charge with a function like:

def checked_dt_charge(handle):
    if dt_type(handle) == TYP_ATOM:
        return dt_charge(handle)
    raise Error("Cannot get the charge: " + dt_typename(handle))

It's straight-forward to see how to add type checks to all of the
toolkit functions in this manner.

The problem is, it doesn't take advantage of my ability to create new
classes.  Conceptually, "charge" is an attribute of an atom, and of
nothing else.  I would prefer using a system which explicitly reflects
that, like:

  print atom.charge

rather than

  print dt_charge(atom)

Functionally, you can also wonder if dt_charge(atom) adds a charge to
the atom, whereas viewing it as an attribute you don't have that
question.

So having classes helps determine the context I was missing earlier.
In PyDaylight, dt_charge will never be called on anything other than
an atom, so I can completely forego the dt_type() check.  Instead, the
check is moved to the language's type system.  If you do
molecule.charge in Python you get an AttributeError exception, and if
you redo what I did in C++, the compiler would give you the compile
time error message that molecules don't have charge.

What's interesting about this is that the polymorphic toolkit
functions do perform an explicit type check on all the input
parameters, very much like my "checked_dt_charge" above.  By switching
to a language with a better type system, nearly all of these checks
could be done automatically, which would reduce the size of the
toolkit and increase its performance.

   Run-time checks as exceptions

Not all of errors are caused by using the wrong data type in a
polymorphic function.  Some are truely run-time issues, like trying to
add an atom when the molecule's "mod" state is off (the molecule is in
a "read-only" mode).  In PyDaylight, these are explicitly checked, as
in:

    a = dt_addatom(mol, atomic_number, hcount)
    if a == NULL_OB:
	raise Error('Cannot add atom')

The problem with a language without exceptions, like C or FORTRAN, is
that most programmers don't like adding all of the error checking.
The check code usually doubles the amount of code written, and it is
hard to test all the code paths.  Instead, most people do as is
encouraged in the Daylight programming manual, and defer the error
checking to later.  Then when errors do occur, it is much harder to
track down just where the error started, since the locality was lost.

I should also add that I don't check all of the errors, just the ones
that lead to confusion, like add atom.

As an observational fact, I know people weren't checking all of the
function results, since I found a couple places where the return value
of dt_dealloc was wrong.  :)


   Attributes

Atoms have an attribute called "charge", but how does that call
dt_charge to get the charge?  It uses a very cool feature of Python,
so let me start with the easy version first.

A Python class definition looks like:

class Atom:
  def __init__(self, handle):
    self.handle = handle

  def get_charge(self):
     return dt_charge(self.handle)

  def set_charge(self, charge):
     if dt_setcharge(self.handle) ==


This defines two methods, "__init__", which is the constructor, and
"charge".  For C++ programmers, the "self" is very much like the
"this" pointer, but Python require that it be mentioned explicitly.

Suppose you know that 6 is a handle to an atom in the toolkit.  Then
you can create the Atom wrapper like:

a = Atom(6)

The value of a's "handle" will be set to 6.  To print the charge:

print a.get_charge()

To set it to 2

a.set_charge(2)

I would like to say "print a.charge" or "a.charge = 5", but in Python,
attributes are simply references, and no function can get called.
However, Python does have a special method which gets called if a
lookup fails.  It's called "__getattr__" and is called with the name
of the variable which failed:


class Atom:
  def __init__(self, handle):
    self.handle = handle

  def __getattr__(self, name):
     if name == "charge":
       return dt_charge(self.handle)
     raise AttributeError, name

a = Atom(6)

So if you do "print a.charge", Python lookups up the attribute
"charge".  These isn't one, so it finds the __getattr__ method and
calls it with the name "charge".  This method then does the lookup to
call the toolkit function and return its results.

The "AttributeError" is a special Python exception type which is used
to indicate that a attribute lookup failed.

Another special method, __setattr__, overrides the default way to set
attributes.  It is called with the attribute name and value, like the
following:

   def __setattr__(self, name, value):
     if name == "charge":
        return dt_setcharge(self.handle, value)
     self.__dict__[name] = value

The special attribute "__dict__" contains a dictionary (associative
array) of the standard attributes in an object.  That piece of code
emulates the original way to set attributes.

By using these __getattr__ and __setattr__ methods, it is possible to
map attributes to toolkit functions, like charge, or stringvalue, or
typename, or ....


(Aside: It's possibly to do that same thing in C++, so that
"atom.charge" gets mapped to a function.  However, the usual way to
get/set values in C++ is through functions, which is fine since C++
supports inlining.)


  Memory management

Another advantage to most VHLLs (and most languages in general!) is
automatic garbage collection.  C requires explicit garbage collection,
so you have to match your mallocs with your frees.  The toolkit can
create objects, which must be properly dt_dealloc'ed.  By looking at
example code, about 20% of the toolkit function calls were to
dt_dealloc, so this is definitely an important part of toolkit
programming.

The simpliest form of garbage collection is reference counting.  When
an object is created, it has a reference count of 1.  If something
else use that object, the reference count goes up by one.  When it is
no longer needed, the reference count goes down.  If the reference
count is 0, it is deallocated.

Python uses reference counting for its objects.  But the Daylight
object is that opaque integer handle, and Python doesn't know enough
to deallocate the handle when it isn't needed.  Instead, PyDaylight
wraps the integer with something which is garbage collected, a "smart
pointer" instance.  The real class definition is only slightly more
complicated than:

class smart_ptr:
   def __init__(self, handle):
      self.handle = handle
   def __del__(self):
      dt_dealloc(self.handle)
   def __int__(self):
      return self.handle


You know __init__ is the constructor.  The __del__ is something like
the destructor, expect that it only gets called after the reference
count goes to 0.  It's more like the Java finializer.  The third
special method is "__int__".  All of the toolkit handles are integers,
so when an object is used instead of an integer, Python will try to
coerce the object to an integer.  This is done by calling "__int__",
if it exists.

The smart pointer is used like:

class Molecule:
  def __init__(self, handle):
     self.handle = handle

def smilin(s):
   smart_h = smart_ptr(dt_smilin(s))
   return Molecule(smart_h)

mol = smilin("c1ccccc1")
del mol

When the variable "mol" is del'ed, its reference count goes to 0.
Then its attributes' reference counts are reduced by 1, including the
"handle" attribute, which is the smart pointer.  If this goes to 0,
then the smart_ptr.__del__ method is called, which deallocates the
handle.

There is a similar class called "smart_list" used to wrap streams
where all of the items of the stream also need to be deallocated when
the stream is deallocated.

Somewhat tricky, but it works.  The result is that there are no calls
to dt_dealloc by PyDaylight users -- except to delete parts of a
molecule or reaction.

I'll say that again -- PyDaylight users do not need to worry about
memory management.


  Iterators

There is one final aspect to this "Part I" talk, iterators.  The
toolkit provides two hybrid list/iterator data types, a "stream" type
and a "sequence" type.  Python has a native data type called a "list"
which is similar, except that it can be random access.

  Python also lets you create new "list-like" classes.  Again, this is
done with a special method, called "__getitem__", which takes an
index.  An example class is:

class FakeList:
   def __getitem__(self, index):
      if index == 0:
         return "Andrew"
      if index == 1:
         return "Dalke"
      raise IndexError, index

As with the AttributeError for __getattr__, the IndexError is a
special exception type to indicate that the index is out of range.

It is possible to wrap the Daylight stream with list-like class.  The
simplest case is:

class StreamList:
   def __init__(self, handle):
      self.handle = handle
   def __getitem__(self, index):
      x = dt_next(self.handle)
      if x == NULL_OB:            # technically, should check if I'm at the end ...
         raise IndexError, index
      return x

which can be used:

atom_stream = dt_stream(molecule, TYP_ATOM)
atoms = StreamList(atom_stream)
print atoms[0], atoms[1]

  -or-

atom_stream = dt_stream(molecule, TYP_ATOM)
for atom in StreamList(atom_stream):
  print Atom(atom).charge

This only allows you to go forward through the list but doesn't check
that the input index is correct.  A safer version might look like:

class StreamListForward:
   def __init__(self, handle):
      self.handle = handle
      self.i = 0
   def __getitem__(self, index):
      if index != self.i:
         raise AssertionError, "forward iterator"
      self.i = self.i + 1
      x = dt_next(self.handle)
      if x == NULL_OB:
         raise IndexError, index
      return x

Or the code could emulate a random access iterator by advancing to the
correct position, possibly resetting if needed.  I think you get the
idea :)

On top of that, the current version only returns the integer handle.
Often you really want the given data type, so you need some sort of
converter.

class TypedStreamList:
   def __init__(self, handle, converter):
      self.handle = handle
      self.converter = converter
   def __getitem__(self, index):
      x = dt_next(self.handle)
      if x == NULL_OB:
         raise IndexError, index
      return self.converter(x)

atom_stream = dt_stream(molecule, TYP_ATOM)
for atom in StreamList(atom_stream, Atom):
  print atom.charge


Using the explicit "dt_stream" is tedious, but even that can be
hidden.  You can think of the list of atoms as an attribute of a
molecule, and create a new attribute name understood by __getattr__:

class Molecule:
   def __init__(self, handle):
      self.handle = handle
   def __getattr__(self, name):
      if name == "atoms":
         stream = dt_stream(self.handle, TYP_ATOM)
         smart_stream = smart_ptr(stream)
         return TypedStreamList(smart_stream, Atom)


then you can say (assuming 1 is a handle to a molecule):

mol = smilin("c1ccccc1")
for atom in mol.atoms:
  print atom.charge

I've been lying to you.  PyDaylight only rarely exposes the StreamList
style of class to a user.  In most cases it converts the whole stream
to a native list and uses the list.  This is because Daylight's
iterator concept is bound too closely to the data container.  If you
want to have two iterators over the data (as with doing an NxN
comparison), you need to make copy of the data for one of the
iterators.

The only places PyDaylight exposes an explicit iterator object are
when it streams through datatrees and strings of a database.


These capabilities (object wrappers with dynamic attribute lookup,
exceptions, and reference counting) are sufficient to handle what
Bioreason needed for manipulating Molecule objects (with Atom, Bond
and Cycle) and for doing SMARTs searches.  The framework has proven to
be very reliable and easy to use, and I estimate has cut the
development time down by at least a factor or 4, and made it easier
for more people to do toolkit programming.


  End of Part I.


   Interlude -- Depictions

Bioreason has had some support for depictions for a while.  Depictions
are troublesome since the depiction library needs to be linked to a
drawing library, written in C.  So I wrote a small library to map the
C function calls back to Python and included these functions as part
of dayswig when it compiles.

When you call the function you give it a "callback object".  This
contains methods which implement the toolkit functions.  For example,
when the drawing library C function "setlinewidth" is called, it is
forwarded to the callback object's "setlinewidth" method.

Currently supported, more or less, are callback objects for Tk and
PIL, the Python Imaging Library.  I gotta say, though, that fonts are
a bitch -- I would love to use the vector character set used by
Daylight.


import Image, ImageFont
from daylight import Smiles, Depict
from daylight.Depict import PILCallback

im = Image.new("RGB", (350, 200) )

cb = PILCallback.PILCallback(im, font = ImageFont.load_path(
    "-adobe-courier-bold-o-normal--24-240-75-75-m-150-iso8859-1.pil"))
                             
mol = Smiles.smilin("O[C@H]1CCCC[C@H]1O")
dep = Depict.Depiction(mol)
dep.calcxy()
dep.color_atoms()
dep.depict(cb)
im.save(open("cis-resorcinol.png", "wb"), "png")


   Part II.

The major new addition to PyDaylight during my stay ay Daylight was
support for Thor and Merlin.  Of course, support was already there
since the toolkit functions were available from DaySWIG, but they
weren't wrapped to support exceptions, attributes nor reference
counting.

   Where are the objects?

The first step is to discover the objects and how they interrelate.
Some are easy to discover -- if there's a Daylight type, then it's
likely an object.

It isn't all that easy when you consider what they're supposed to do.
Take Thor, which is where I started.

There is a Thor server.  It can be used to open, create or delete
databases.  It can administer server passwords, and you can modify the
userpath.  You can kick people off and a few more things.  Seems easy,
right?

Let'c consider use cases.  I want to change a user password.


   Data Discovery


Python has pretty good support for what is now called "introspection",
that is, the ability to discover information about objects.  If you
have a dictionary you can get a list of its keys ...

Thor.open --> ThorServer
ThorServer.open -> ThorDatabase
ThorServer.databases -> DatabaseManager
 DatabaseManager["name"] -> DBManger
 DBManager.open() -> ThorDatabase
ThorServer.users -> UserManager
ThorServer.passwords -> PasswordManager

Thor is straightforward

   Reference counting, again

There are a few functions in the toolkit which reuse existing handles.
For example, the full call to open a Thor connection is:

     dt_Handle dt_open(dt_Handle server, dt_Integer dblen, dt_String
       dbname, dt_Integer mlen, dt_String mode, dt_Integer plen, dt_String
       passwd, dt_Integer *isnew)

The last parameter, "isnew", is TRUE if the handle returned is reused.
This interferes with the reference counted garbage collector since the
smart_ptr doesn't know how many other objects are using the handle.

What's needed is a shared area where the different smart pointers can
communicate to see who really is the last one.  It is possible to do
this in pure Python, but I was worried about how to play nicely with
other Daylight/Python modules or possible C extensions.  Instead, I
used the Daylight properties to store an integer field called
"__ref_count" and created a new smart pointer called "isnew_smart_ptr"
to manage that property.  When the smart pointer's finalizer is
called, it checks the __ref_count, and if the count is 1 (meaning this
was the last reference), the object is dealloc'ed.


The Merlin and Thor interfaces are similar, so there are base classes
called "Server" shared by ThorServer and MerlinServer, and a
"Database" shared by ThorDatabase and MerlinPool.


Merlin more complex

Do sorts and searches on hitlists

New class, "Task".
The manual gives a C fragment for doing tasks.
Encapsulate and simplify for reuse.

task = hits.sort( ... some parameters here ...)
task.finish()

# SMARTS search (takes some time)
task = hits.superselect( ... some parameters here ...)
task.wait(Task.TextStatusBar())
print


I'm hiding the parameters because they are part of the next point.  

     dt_Integer dt_mer_sort(dt_Handle hitlist, dt_Handle column, dt_Integer
        sortmethod, dt_Integer direction, dt_Integer *status)

DaySWIG converts this to (ahh, the joys of being able to return more
than a scalar!):

   progress, status = dt_mer_sort(hitlist, column, sortmethod, direction)


After looking at it for a while, I realized that the column and
sortmethod terms are related; string sort methods only work with
string columns, and numeric with numeric.  This is really expressing


   progress, status = dt_mer_sort(hitlist, (column, sortmethod), direction)

where the (column, sortmethod) is some combination data type.  This
seems interesting, but how does it help?  Take a look at another task
function:

     dt_Integer dt_mer_strsearch(dt_Handle hitlist, dt_Handle column, dt_Integer
        searchtype, dt_Integer action, dt_Integer find_next, dt_Integer *
        status, dt_Integer s1len, dt_String s1, dt_Integer s2len, dt_String s2)

or, in Python:

   progress, status = dt_mer_strsearch(hitlist, column, searchtype, action,
                                       find_next, s1, s2)

Some of the parameters are related to each other:

  if action is not MOVE_HITS or MOVE_NONHITS then find_next is ignored
  
  the searchtype and s1, s2 values define the type of string search to do,
     and some searchtype and string values are illegal

  you can only do string searches on non-numeric columns


After even more thought, this can be made into:

   progress, status = dt_mer_strsearch(hitlist, (column, (searchtype, s1, s2)),
                                       (action, find_next))

or generically as:

   progress, status = dt_mer_strsearch(hitlist, searcher, actioner)

Written this way, the strsearch algorithm is generically written as:

    matches = []
    for i in range(len(hitlist)):
      if searcher.match(hitlist, i):
        matches.append(i)
    return actioner.combine(hitlist, matches)

I realize that the Thor server is much smarter than this algorithm,
but the key thing is that this algorithm is generic, that is, the
algorithm works on anything you pass it.  Suppose it was a string
search for a specific match, then the search object might be:

class StringMatch:
  def __init__(self, column, s):
     self.column = column
     self.s = s
  def match(self, hitlist, i):
     return dt_mer_cellvalue(self.column, hitlist, i) == self.s

Suppose it was to find if two different columns matched at the given
position, then the class might be:

class PairMatch:
  def __init__(self, col1, col2
     self.col1 = col1
     self.col2 = col2
  def match(self, hitlist, i):
     return dt_mer_cellvalue(self.col1, hitlist, i) == \
            dt_mer_cellvalue(self.col2, hitlist, i)


The actioner can also do what it needs to do to add, delete, move,
etc. given the old hitlist and the list of matches.  

Realistically there will be some collusion between the three different
components for performance reasons, but the result is identical.

The -er suffix is to say that these are objects which do something --
they are functors, or function objects.  These are like function
pointers, except they also can contain data.


MCL converter

====================
import daylight
from daylight import Merlin, Grid

database = Merlin.open_fullname("medchem98@origin")
SMILES = database.typename_column("$SMI")
NAME = database.typename_column("$NAM")
ACTIVITY = database.typename_column("AC")

hits = database.hitlist()

hits.zapna(SMILES)
hits.zapna(NAME)
hits.zapna(ACTIVITY)
task = hits.sort(NAME.sorts.ASC)
task.finish()

print "There are", len(hits), "elements.  The top 5 are:"
g = Grid.Grid(hits, [SMILES, NAME, ACTIVITY])
for i in range(5):
  for j in range(3):
     print "  %.20s" % g[i,j],
  print
====================

Server Hacks