Dalke Scientific Software: More science. Less time. Products

Python Training for Cheminformatics

I teach training courses in Python programming for computational chemistry, with an emphasis on cheminformatics. No public courses are currently scheduled. I will be in Ireland for OpenEye's CUP conference in September and in England in October. If you are interested in on-site training during those trips, let me know!

To apply or ask questions, send email to trainingdiscard@dalkescientific.com. Up-to-date information can always be found at http://dalkescientific.com/training/.

Why programming courses for chemists?

Computational chemists are not programmers, but programming is an essential skill for developing new algorithms, generating data, and analyzing results. Most working scientists have little training in programming and end up spending a lot of time figuring out how to parse a file format or work with a software library, rather than figuring out the science.

My training course is designed for just these people.

I teach both corporate courses and public ones. The presentations are a mixture of lecture and hands-on exercises in a similar style to my NBN courses. The specific topics will vary based on the needs of the audience. Contact me if there's something you specifically want me to discuss, and see the bottom of this page for a partial list of topics I can cover.

Who am I?

My name is Andrew Dalke. I am a professional software developer with years of experience creating tools for cheminformatics, molecular modeling, bioinformatics, and related fields. Some the more public projects I've been part of are VMD, NAMD, BioPython, PyDaylight, and the Open Bioinformatics Foundation. Thoughout my career I have worked closely with chemists to help them be more effective, by developing software, providing one-on-one advice, training, and writing essays about software side of this field.

Most of my work over the last 10 years has been in Python, which is the most popular high-level language in computational chemistry. Many tools, especially in molecular visualization and chemical informatics, have Python interfaces. Python is also one of the most popular computer languages in the world, with mature software libraries for everything from image manipulation and SQL databases to GUI and web development. I am a member of the Python Software Foundation, which is the non-profit that holds the copyright to Python.

Scheduled courses

The following courses are meant for computational chemists with some programming background. It is not an introduction to programming course. You must know how to write programs* and should have some experience with Python.

Course fees include coffe breaks, lunch, and all presentation materials. Each day starts at 9.00 and ends around 17.30.

If you have any questions, send me an email at trainingdiscard@dalkescientific.com. If you are from an academic or non-profit group then you may qualify for a discount.

Python programming

Leipzig, Germany, 14-15 February 2011

This intense two day course covers the basics of how to use Python. The first day is an overview the core Python language and OEChem. The second day covers essential libraries for calling out to command-line programs, handling CSV files, making plots, and more.

Attendees must have some programming experience (know how to use variables, for-loops, if-statements, and know how to use text editors and command-line tools.

Registration will be €900 including VAT and is limited to 8 people. Special discount: attend this course plus the Django course for €2,000 instead of €2,200.

The course will be hosted by and invoices sent by the Python Academy, located at Zur Schule 20, Leipzig. Register for the Python course or contact me at trainingdiscard@dalkescientific.com if you have any questions.

Web Application Development with Django

Leipzig, Germany, 16-18 February 2011

This three day course walks through two real-world examples based on the Django web application framework: an interactive descriptor calculator and a PubChem database search system. This course is meant for computational chemists who want to set up an in-house cheminformatics server for specific analysis tasks. The topics I'll cover are:

Attendees must have some existing Python experience. I will be teaching a two day Python course immediately before the Django course. If you attend both courses you will get a discount.

Registration for the Django course is €1,300 and is limited to 8 people. Special discount: attend this course plus the Python course for €2,000 instead of €2,200.

The course will be hosted by and invoices sent by the Python Academy, located at Zur Schule 20, Leipzig. Register for the Python course or ontact me at trainingdiscard@dalkescientific.com if you have any questions.

What experience do you need?

My courses are meant for computational chemists who are not programmers but have some programming experience. Computational chemists in this case means small-molecule chemistry with an emphasis on cheminformatics and a bit of molecular modeling. You must already know the basic science, like SMILES, molecular graph representation, SMARTS and substructure searches.

The phrase "some programming experience" means people who are comfortable with strings, integers, floats, variables, if-statements, for-loops, variables, lists/arrays, and defining functions. You must also be comfortable working on the command-line and using a text editor or IDE to write programs. You do not need to know object oriented programming.

You should have some experience with Python but that is not essential. I'll teach the Python-specific features as I work through my examples. Most of the code will look similar in other languages so it should be easy to follow.

For those just starting off in Python, the Python Beginner's Guide contains links to many resources including online tutorials and a list of books. You might be interested in the tutor mailing list "for folks who want to ask questions regarding how to learn computer programming with the Python language."

I have taught beginning programming to bioinformatics graduate students. You can see my lecture notes under the header "Introduction to Programming for Bioinformatics in Python."

Possible topics

I cover different topics in my courses, depending on the length of the course, background of the participants, and what they want me to focus on. For each course I list the topics I will cover and I may also cover some additional topics depending on the time and class expertise. The following are some of the topics I can cover, so you can get an idea of what to expect. I will not cover all of these in a single course. Contact me if there's something special you to know more about.

Overview of Python

Some background about Python, how it got to where it is, and where it's going. This will be about how Python fits into the world. It is not an introduction to programming course.


IPython enhances the normal interactive Python interpreter to make it better for exploratory programming. New features include easier access to normal shell commands, improved history, support for matplotlib, and better help. You might find Jeff Rush's IPython videos helpful.


matplotlib is a python 2D plotting library which produces publication quality figures for hard-copy and interactive use. I'll work through several examples based on chemistry data sets, such as producing a scatter plot and exporting the result. See the gallery for some examples of what it can do.

parsing CSV files

A lot of chemistry data is passed around as comma-, space-, or tab- separated files. I'll work through how to use Python to parse these sorts of files, with a focus on SMILES files.


OEChem is a commercial programming library from OpenEye for small molecule chemistry. It's full-featured, fast, and powerful, but a bit on the hard side to use. Some of the topics I can cover are:
  • parsing SMILES and working with the molecule object
  • parsing SMARTS and substructure searches
  • working with SMILES and SD files
  • maximum common substructure
  • translation between chemistry models

OpenBabel and RDKit

OpenBabel and RDKit are two open source cheminformatics toolkits with Python interfaces. I've used OpenBabel a bit, mostly through the high-level pybel library.

You might be thinking "why not teach OpenBabel instead of OEChem"? After all, the library is freely available so anyone can install it without needing a special license. I could teach it, but I don't have as much experience as I do with OEChem. There are certain nuances of every library, which I don't know for OpenBabel. I also think that OEChem is a better library for most use, if you have the money and consider proprietary software acceptable.

As an example of a difference, OpenBabel follows the Daylight model and assumes that the right behavior is to convert everything into the same chemistry model. OEChem doesn't make that assumption and instead has functions to convert between different chemistry models; you have to call those functions. This causes small differences in how the libraries treat conditions like aromaticity. I have more experience with the OEChem way and can explain it better.

If you are interested in me teaching specifically OpenBabel or RDKit then contact me.

Generating and searching fingerprints

I've developed a set of Python tools to work with molecular fingerprints. I'll describe how to use them to generate fingerprints and do Tanimoto searches.


PyMol is a very popular structure visualization program from DeLano Scientific. It contains an extensive Python programming interface which can be used for structure analysis, movie making, and more. I'll show you some of the things you can do with it.


numpy is a collection of numeric tools for Python, with an emphasis on arrays and linear algebra.


Many chemistry programs are only available through the command line. I'll show how to use Python's subprocess module to call them from Python.


A lot of data is available through web pages designed for people, not software. BeautifulSoup makes it easier to get access to that data in your programs. One example I can cover is how to get data from a PubChem web page.

working with a relational database

Corporate compound databases are often stored in a relational database like Oracle. You don't need to be corporate IS to set up a relational database. MySQL is a widely used free database server. I'll shows some examples of creating a database schema, loading compound data into the database, and searching the data set. I've developed a set of chemistry extensions to MySQL so you can get some experience in doing chemically-aware queries, like substructure searches and similarity ranking.

developing web applications

Django is a popular web development framework for Python which makes developing web applications much simpler than traditional CGI programming.

These lectures will be based around developing a web application for substructure searches and will cover how to work with the database, generate templates, structure the URLs, using CSS, and Javascript interactivity with JQuery.

Software development best practices

Software development is more than sitting down and programming. How do you keep track of changes to the code over time? If you change code, how do you figure out if the change broke something? How to multiple people work together on the same code base? What are some of the common development traps that people can get stuck in? Why do you need to care about security?

Many of these are covered in Greg Wilson's Software Carpentry. I'll specifically talk about version control, project builds, testing with nose and development practices like code reviews, YAGNI and agile development.


Despite the best efforts of some, there's still very little direct influence of XML in computational chemistry. We still use SD and SMILES files. But there is sometimes a need, especially when getting data from IT-oriented software developers. I'll show some examples of using ElementTree to read and extract fields from an XML file.

Using COM to control Excel

Python for Windows has excellent support for COM, which is how Windows programs can embed or control other Windows programs. Chemists use Excel a lot. It's not hard to write a Python program which opens Excel, creates a new spreadsheet, loads data into the spreadsheet, and creates a plot of the results. Not hard for experienced programmers that is. I'll show a few examples you can build off of and provide pointers of where to get more in-depth information.

R and Python

R is a great software environment for statistical computing and generating plots. If you are building models or doing data mining then you should know about this project. R includes its own programming language and a number of high-quality analysis packages. It's not a general purpose language like Python and it doesn't include the diversity of modules that Python has.

RPy is an interface module which lets Python call R functions directly, including plotting. I've used it in a model calculation system using OEChem and other tools to compute descriptor values then passing the code over R to evalute the model.

R and Python have different ways of doing things. RPy minimize the difference, it's still there. I'll describe some of the basic R data type, how to create them from Python, how to call the R functions through RPy, and how to understand the R documentation enough to be able to call it from Python.

Contact Us | Home
Copyright © 2001-2013 Andrew Dalke Scientific AB. All rights reserved.
Contact Us