PyDaylight, part II -or- What has Andrew been doing the last four weeks? I'll start with part I, which was covered at my MUG presentation. I want to go into some more detail since I believe this is a more technical audience. The work was all developed in Python, and extends Roger Critchlow's dayswig work. The only new C code was to provide the callback hooks to support the drawing library. I'll talk about that in a bit. DayPerl, DayTcl and DaySWIG all wrap the C-level Daylight toolkit and provide interfaces to "very high-level languages" (VHLL). An advantage to these languages is they provide many higher-level constructs, like auto-resizable lists, associative arrays, exceptions and garbage collection, which reduces the time, expertise and amount of code needed to develop new software. Making the toolkit accessible to the VHLL doesn't automatically add these new behaviours. If dt_cansmiles returns a NULL, the wrapper code doesn't raise an exception in the VHLL, it just returns the 'invalid string' data type. In fact, in some cases the return type may or may not be an error, depending on the context. dt_charge of an atom can return an integer, such as -1, but dt_charge of a molecule is an error, so it returns -1. Reducing polymorphism with classes Python, my VHLL of choice, is an object-oriented langauge. It allows you to create and manipulate new data types, like atom, bond, SMARTS pattern, Thor server and Merlin pool. The toolkit also has a type system, including those same data types. However, because the library was developed to work with C and FORTRAN codes, the type information is hidden behind opaque integer handles. Thus, toolkit functions can (must!) be very polymorphic, to get around C and FORTRAN's lack of overloading. The same function can do many different things, depending on the underlying toolkit types. But all that the C compiler knows is that you are manipulating integers. So it is easy to make simple mistakes, like call dt_charge(molecule) instead of dt_charge(atom), but the compiler doesn't know enough to stop you. The developer must remember more about how the underlying toolkit types work, which increases the difficulty of writing new code. One way to fix the problem is to wrap dt_charge with a function like: def checked_dt_charge(handle): if dt_type(handle) == TYP_ATOM: return dt_charge(handle) raise Error("Cannot get the charge: " + dt_typename(handle)) It's straight-forward to see how to add type checks to all of the toolkit functions in this manner. The problem is, it doesn't take advantage of my ability to create new classes. Conceptually, "charge" is an attribute of an atom, and of nothing else. I would prefer using a system which explicitly reflects that, like: print atom.charge rather than print dt_charge(atom) Functionally, you can also wonder if dt_charge(atom) adds a charge to the atom, whereas viewing it as an attribute you don't have that question. So having classes helps determine the context I was missing earlier. In PyDaylight, dt_charge will never be called on anything other than an atom, so I can completely forego the dt_type() check. Instead, the check is moved to the language's type system. If you do molecule.charge in Python you get an AttributeError exception, and if you redo what I did in C++, the compiler would give you the compile time error message that molecules don't have charge. What's interesting about this is that the polymorphic toolkit functions do perform an explicit type check on all the input parameters, very much like my "checked_dt_charge" above. By switching to a language with a better type system, nearly all of these checks could be done automatically, which would reduce the size of the toolkit and increase its performance. Run-time checks as exceptions Not all of errors are caused by using the wrong data type in a polymorphic function. Some are truely run-time issues, like trying to add an atom when the molecule's "mod" state is off (the molecule is in a "read-only" mode). In PyDaylight, these are explicitly checked, as in: a = dt_addatom(mol, atomic_number, hcount) if a == NULL_OB: raise Error('Cannot add atom') The problem with a language without exceptions, like C or FORTRAN, is that most programmers don't like adding all of the error checking. The check code usually doubles the amount of code written, and it is hard to test all the code paths. Instead, most people do as is encouraged in the Daylight programming manual, and defer the error checking to later. Then when errors do occur, it is much harder to track down just where the error started, since the locality was lost. I should also add that I don't check all of the errors, just the ones that lead to confusion, like add atom. As an observational fact, I know people weren't checking all of the function results, since I found a couple places where the return value of dt_dealloc was wrong. :) Attributes Atoms have an attribute called "charge", but how does that call dt_charge to get the charge? It uses a very cool feature of Python, so let me start with the easy version first. A Python class definition looks like: class Atom: def __init__(self, handle): self.handle = handle def get_charge(self): return dt_charge(self.handle) def set_charge(self, charge): if dt_setcharge(self.handle) == This defines two methods, "__init__", which is the constructor, and "charge". For C++ programmers, the "self" is very much like the "this" pointer, but Python require that it be mentioned explicitly. Suppose you know that 6 is a handle to an atom in the toolkit. Then you can create the Atom wrapper like: a = Atom(6) The value of a's "handle" will be set to 6. To print the charge: print a.get_charge() To set it to 2 a.set_charge(2) I would like to say "print a.charge" or "a.charge = 5", but in Python, attributes are simply references, and no function can get called. However, Python does have a special method which gets called if a lookup fails. It's called "__getattr__" and is called with the name of the variable which failed: class Atom: def __init__(self, handle): self.handle = handle def __getattr__(self, name): if name == "charge": return dt_charge(self.handle) raise AttributeError, name a = Atom(6) So if you do "print a.charge", Python lookups up the attribute "charge". These isn't one, so it finds the __getattr__ method and calls it with the name "charge". This method then does the lookup to call the toolkit function and return its results. The "AttributeError" is a special Python exception type which is used to indicate that a attribute lookup failed. Another special method, __setattr__, overrides the default way to set attributes. It is called with the attribute name and value, like the following: def __setattr__(self, name, value): if name == "charge": return dt_setcharge(self.handle, value) self.__dict__[name] = value The special attribute "__dict__" contains a dictionary (associative array) of the standard attributes in an object. That piece of code emulates the original way to set attributes. By using these __getattr__ and __setattr__ methods, it is possible to map attributes to toolkit functions, like charge, or stringvalue, or typename, or .... (Aside: It's possibly to do that same thing in C++, so that "atom.charge" gets mapped to a function. However, the usual way to get/set values in C++ is through functions, which is fine since C++ supports inlining.) Memory management Another advantage to most VHLLs (and most languages in general!) is automatic garbage collection. C requires explicit garbage collection, so you have to match your mallocs with your frees. The toolkit can create objects, which must be properly dt_dealloc'ed. By looking at example code, about 20% of the toolkit function calls were to dt_dealloc, so this is definitely an important part of toolkit programming. The simpliest form of garbage collection is reference counting. When an object is created, it has a reference count of 1. If something else use that object, the reference count goes up by one. When it is no longer needed, the reference count goes down. If the reference count is 0, it is deallocated. Python uses reference counting for its objects. But the Daylight object is that opaque integer handle, and Python doesn't know enough to deallocate the handle when it isn't needed. Instead, PyDaylight wraps the integer with something which is garbage collected, a "smart pointer" instance. The real class definition is only slightly more complicated than: class smart_ptr: def __init__(self, handle): self.handle = handle def __del__(self): dt_dealloc(self.handle) def __int__(self): return self.handle You know __init__ is the constructor. The __del__ is something like the destructor, expect that it only gets called after the reference count goes to 0. It's more like the Java finializer. The third special method is "__int__". All of the toolkit handles are integers, so when an object is used instead of an integer, Python will try to coerce the object to an integer. This is done by calling "__int__", if it exists. The smart pointer is used like: class Molecule: def __init__(self, handle): self.handle = handle def smilin(s): smart_h = smart_ptr(dt_smilin(s)) return Molecule(smart_h) mol = smilin("c1ccccc1") del mol When the variable "mol" is del'ed, its reference count goes to 0. Then its attributes' reference counts are reduced by 1, including the "handle" attribute, which is the smart pointer. If this goes to 0, then the smart_ptr.__del__ method is called, which deallocates the handle. There is a similar class called "smart_list" used to wrap streams where all of the items of the stream also need to be deallocated when the stream is deallocated. Somewhat tricky, but it works. The result is that there are no calls to dt_dealloc by PyDaylight users -- except to delete parts of a molecule or reaction. I'll say that again -- PyDaylight users do not need to worry about memory management. Iterators There is one final aspect to this "Part I" talk, iterators. The toolkit provides two hybrid list/iterator data types, a "stream" type and a "sequence" type. Python has a native data type called a "list" which is similar, except that it can be random access. Python also lets you create new "list-like" classes. Again, this is done with a special method, called "__getitem__", which takes an index. An example class is: class FakeList: def __getitem__(self, index): if index == 0: return "Andrew" if index == 1: return "Dalke" raise IndexError, index As with the AttributeError for __getattr__, the IndexError is a special exception type to indicate that the index is out of range. It is possible to wrap the Daylight stream with list-like class. The simplest case is: class StreamList: def __init__(self, handle): self.handle = handle def __getitem__(self, index): x = dt_next(self.handle) if x == NULL_OB: # technically, should check if I'm at the end ... raise IndexError, index return x which can be used: atom_stream = dt_stream(molecule, TYP_ATOM) atoms = StreamList(atom_stream) print atoms[0], atoms[1] -or- atom_stream = dt_stream(molecule, TYP_ATOM) for atom in StreamList(atom_stream): print Atom(atom).charge This only allows you to go forward through the list but doesn't check that the input index is correct. A safer version might look like: class StreamListForward: def __init__(self, handle): self.handle = handle self.i = 0 def __getitem__(self, index): if index != self.i: raise AssertionError, "forward iterator" self.i = self.i + 1 x = dt_next(self.handle) if x == NULL_OB: raise IndexError, index return x Or the code could emulate a random access iterator by advancing to the correct position, possibly resetting if needed. I think you get the idea :) On top of that, the current version only returns the integer handle. Often you really want the given data type, so you need some sort of converter. class TypedStreamList: def __init__(self, handle, converter): self.handle = handle self.converter = converter def __getitem__(self, index): x = dt_next(self.handle) if x == NULL_OB: raise IndexError, index return self.converter(x) atom_stream = dt_stream(molecule, TYP_ATOM) for atom in StreamList(atom_stream, Atom): print atom.charge Using the explicit "dt_stream" is tedious, but even that can be hidden. You can think of the list of atoms as an attribute of a molecule, and create a new attribute name understood by __getattr__: class Molecule: def __init__(self, handle): self.handle = handle def __getattr__(self, name): if name == "atoms": stream = dt_stream(self.handle, TYP_ATOM) smart_stream = smart_ptr(stream) return TypedStreamList(smart_stream, Atom) then you can say (assuming 1 is a handle to a molecule): mol = smilin("c1ccccc1") for atom in mol.atoms: print atom.charge I've been lying to you. PyDaylight only rarely exposes the StreamList style of class to a user. In most cases it converts the whole stream to a native list and uses the list. This is because Daylight's iterator concept is bound too closely to the data container. If you want to have two iterators over the data (as with doing an NxN comparison), you need to make copy of the data for one of the iterators. The only places PyDaylight exposes an explicit iterator object are when it streams through datatrees and strings of a database. These capabilities (object wrappers with dynamic attribute lookup, exceptions, and reference counting) are sufficient to handle what Bioreason needed for manipulating Molecule objects (with Atom, Bond and Cycle) and for doing SMARTs searches. The framework has proven to be very reliable and easy to use, and I estimate has cut the development time down by at least a factor or 4, and made it easier for more people to do toolkit programming. End of Part I. Interlude -- Depictions Bioreason has had some support for depictions for a while. Depictions are troublesome since the depiction library needs to be linked to a drawing library, written in C. So I wrote a small library to map the C function calls back to Python and included these functions as part of dayswig when it compiles. When you call the function you give it a "callback object". This contains methods which implement the toolkit functions. For example, when the drawing library C function "setlinewidth" is called, it is forwarded to the callback object's "setlinewidth" method. Currently supported, more or less, are callback objects for Tk and PIL, the Python Imaging Library. I gotta say, though, that fonts are a bitch -- I would love to use the vector character set used by Daylight. import Image, ImageFont from daylight import Smiles, Depict from daylight.Depict import PILCallback im = Image.new("RGB", (350, 200) ) cb = PILCallback.PILCallback(im, font = ImageFont.load_path( "-adobe-courier-bold-o-normal--24-240-75-75-m-150-iso8859-1.pil")) mol = Smiles.smilin("O[C@H]1CCCC[C@H]1O") dep = Depict.Depiction(mol) dep.calcxy() dep.color_atoms() dep.depict(cb) im.save(open("cis-resorcinol.png", "wb"), "png") Part II. The major new addition to PyDaylight during my stay ay Daylight was support for Thor and Merlin. Of course, support was already there since the toolkit functions were available from DaySWIG, but they weren't wrapped to support exceptions, attributes nor reference counting. Where are the objects? The first step is to discover the objects and how they interrelate. Some are easy to discover -- if there's a Daylight type, then it's likely an object. It isn't all that easy when you consider what they're supposed to do. Take Thor, which is where I started. There is a Thor server. It can be used to open, create or delete databases. It can administer server passwords, and you can modify the userpath. You can kick people off and a few more things. Seems easy, right? Let'c consider use cases. I want to change a user password. Data Discovery Python has pretty good support for what is now called "introspection", that is, the ability to discover information about objects. If you have a dictionary you can get a list of its keys ... Thor.open --> ThorServer ThorServer.open -> ThorDatabase ThorServer.databases -> DatabaseManager DatabaseManager["name"] -> DBManger DBManager.open() -> ThorDatabase ThorServer.users -> UserManager ThorServer.passwords -> PasswordManager Thor is straightforward Reference counting, again There are a few functions in the toolkit which reuse existing handles. For example, the full call to open a Thor connection is: dt_Handle dt_open(dt_Handle server, dt_Integer dblen, dt_String dbname, dt_Integer mlen, dt_String mode, dt_Integer plen, dt_String passwd, dt_Integer *isnew) The last parameter, "isnew", is TRUE if the handle returned is reused. This interferes with the reference counted garbage collector since the smart_ptr doesn't know how many other objects are using the handle. What's needed is a shared area where the different smart pointers can communicate to see who really is the last one. It is possible to do this in pure Python, but I was worried about how to play nicely with other Daylight/Python modules or possible C extensions. Instead, I used the Daylight properties to store an integer field called "__ref_count" and created a new smart pointer called "isnew_smart_ptr" to manage that property. When the smart pointer's finalizer is called, it checks the __ref_count, and if the count is 1 (meaning this was the last reference), the object is dealloc'ed. The Merlin and Thor interfaces are similar, so there are base classes called "Server" shared by ThorServer and MerlinServer, and a "Database" shared by ThorDatabase and MerlinPool. Merlin more complex Do sorts and searches on hitlists New class, "Task". The manual gives a C fragment for doing tasks. Encapsulate and simplify for reuse. task = hits.sort( ... some parameters here ...) task.finish() # SMARTS search (takes some time) task = hits.superselect( ... some parameters here ...) task.wait(Task.TextStatusBar()) print I'm hiding the parameters because they are part of the next point. dt_Integer dt_mer_sort(dt_Handle hitlist, dt_Handle column, dt_Integer sortmethod, dt_Integer direction, dt_Integer *status) DaySWIG converts this to (ahh, the joys of being able to return more than a scalar!): progress, status = dt_mer_sort(hitlist, column, sortmethod, direction) After looking at it for a while, I realized that the column and sortmethod terms are related; string sort methods only work with string columns, and numeric with numeric. This is really expressing progress, status = dt_mer_sort(hitlist, (column, sortmethod), direction) where the (column, sortmethod) is some combination data type. This seems interesting, but how does it help? Take a look at another task function: dt_Integer dt_mer_strsearch(dt_Handle hitlist, dt_Handle column, dt_Integer searchtype, dt_Integer action, dt_Integer find_next, dt_Integer * status, dt_Integer s1len, dt_String s1, dt_Integer s2len, dt_String s2) or, in Python: progress, status = dt_mer_strsearch(hitlist, column, searchtype, action, find_next, s1, s2) Some of the parameters are related to each other: if action is not MOVE_HITS or MOVE_NONHITS then find_next is ignored the searchtype and s1, s2 values define the type of string search to do, and some searchtype and string values are illegal you can only do string searches on non-numeric columns After even more thought, this can be made into: progress, status = dt_mer_strsearch(hitlist, (column, (searchtype, s1, s2)), (action, find_next)) or generically as: progress, status = dt_mer_strsearch(hitlist, searcher, actioner) Written this way, the strsearch algorithm is generically written as: matches = [] for i in range(len(hitlist)): if searcher.match(hitlist, i): matches.append(i) return actioner.combine(hitlist, matches) I realize that the Thor server is much smarter than this algorithm, but the key thing is that this algorithm is generic, that is, the algorithm works on anything you pass it. Suppose it was a string search for a specific match, then the search object might be: class StringMatch: def __init__(self, column, s): self.column = column self.s = s def match(self, hitlist, i): return dt_mer_cellvalue(self.column, hitlist, i) == self.s Suppose it was to find if two different columns matched at the given position, then the class might be: class PairMatch: def __init__(self, col1, col2 self.col1 = col1 self.col2 = col2 def match(self, hitlist, i): return dt_mer_cellvalue(self.col1, hitlist, i) == \ dt_mer_cellvalue(self.col2, hitlist, i) The actioner can also do what it needs to do to add, delete, move, etc. given the old hitlist and the list of matches. Realistically there will be some collusion between the three different components for performance reasons, but the result is identical. The -er suffix is to say that these are objects which do something -- they are functors, or function objects. These are like function pointers, except they also can contain data. MCL converter ==================== import daylight from daylight import Merlin, Grid database = Merlin.open_fullname("medchem98@origin") SMILES = database.typename_column("$SMI") NAME = database.typename_column("$NAM") ACTIVITY = database.typename_column("AC") hits = database.hitlist() hits.zapna(SMILES) hits.zapna(NAME) hits.zapna(ACTIVITY) task = hits.sort(NAME.sorts.ASC) task.finish() print "There are", len(hits), "elements. The top 5 are:" g = Grid.Grid(hits, [SMILES, NAME, ACTIVITY]) for i in range(5): for j in range(3): print " %.20s" % g[i,j], print ==================== Server Hacks