Dalke Scientific Software: More science. Less time. Products
Products

Python Training for Cheminformatics

I teach training courses in Python programming for computational chemistry, with an emphasis on cheminformatics. Upcoming courses are:

  • 6-7 October, 2008 in Leipzig, Germany [details]
  • early December, 2008, San Francisco Bay Area [details]

To apply or ask questions, send email to trainingdiscard@dalkescientific.com. Up-to-date information can always be found at http://dalkescientific.com/training/.

Why programming courses for chemists?

Computational chemists are not programmers, but programming is an essential skill for developing new algorithms, generating data, and analyzing results. Most working scientists have little training in programming and end up spending a lot of time figuring out how to parse a file format or work with a software library, rather than figuring out the science.

My training course is designed for just these people.

I teach both corporate courses and public ones. The presentations are a mixture of lecture and hands-on exercises in a similar style to my NBN courses. The specific topics will vary based on the needs of the audience. Contact me if there's something you specifically want me to discuss, and see the bottom of this page for a partial list of topics I can cover.

Who am I?

My name is Andrew Dalke. I am a professional software developer with years of experience creating tools for cheminformatics, molecular modeling, bioinformatics, and related fields. Some the more public projects I've been part of are VMD, NAMD, BioPython, PyDaylight, and the Open Bioinformatics Foundation. Thoughout my career I have worked closely with chemists to help them be more effective, by developing software, providing one-on-one advice, training, and writing essays about software side of this field.

Most of my work over the last 10 years has been in Python, which is the most popular high-level language in computational chemistry. Many tools, especially in molecular visualization and chemical informatics, have Python interfaces. Python is also one of the most popular computer languages in the world, with mature software libraries for everything from image manipulation and SQL databases to GUI and web development. I am a member of the Python Software Foundation, which is the non-profit that holds the copyright to Python.


Scheduled courses

Leipzig, Germany, 6-7 October 2008

I will be teaching a training course in Leipzig on 6-7 October. Interested? Want to register? Have questions? Send me an email at trainingdiscard@dalkescientific.com.

This course is meant for computational chemists with some programming background. It is not an introduction to programming course. You must know how to write programs* and should have some experience with Python.

Registration on or before 10 September is €800 and after 10 September is €900. Price include VAT. Maximum class size is 8. Minimum class size is 4. You can register up until the day of the course but if fewer than 4 people are signed up by 12 September then the course will be canceled and all payments returned.

The registration fee includes two lunches, dinner on 6 October, and coffee break with snacks. It does not include lodging. I can provide pointers to local area hotels for those who want it.

The course will be hosted by and invoices sent by the Python Academy, located at

Zur Schule 20
04158 Leipzig
Germany
They will provide Windows desktop machines and WLAN for those with their own laptops. I will install the needed software and licenses on the Windows machines. If you want to use your own laptop for the course then let me know so I can give you a list of software to install before coming.

The topics I plan to cover are still in flux but will likely be:

See below for details of each topic. Let me know if there's anything specific you want me to cover.

San Francisco Bay Area, early December, 2008

I am planning a course in the Bay Area during the first week of December, probably 2-3 December (Tuesday and Wednesday). It will be very similar to the Leipzig course. I'm currently searching for a venue.

If you are interested in attending or hosting this course, email me at trainingdiscard@dalkescientific.com.


What experience do you need?

My courses are meant for computational chemists who are not programmers but have some programming experience. Computational chemists in this case means small-molecule chemistry with an emphasis on cheminformatics and a bit of molecular modeling. You must already know the basic science, like SMILES, molecular graph representation, SMARTS and substructure searches.

The phrase "some programming experience" means people who are comfortable with strings, integers, floats, variables, if-statements, for-loops, variables, lists/arrays, and defining functions. You must also be comfortable working on the command-line and using a text editor or IDE to write programs. You do not need to know object oriented programming.

You should have some experience with Python but that is not essential. I'll teach the Python-specific features as I work through my examples. Most of the code will look similar in other languages so it should be easy to follow.

For those just starting off in Python, the Python Beginner's Guide contains links to many resources including online tutorials and a list of books. You might be interested in the tutor mailing list "for folks who want to ask questions regarding how to learn computer programming with the Python language."

I have taught beginning programming to bioinformatics graduate students. You can see my lecture notes under the header "Introduction to Programming for Bioinformatics in Python."

Possible topics

I cover different topics in my courses, depending on the length of the course, background of the participants, and what they want me to focus on. For each course I list the topics I will cover and I may also cover some additional topics depending on the time and class expertise. The following are some of the topics I can cover, so you can get an idea of what to expect. I will not cover all of these in a single course. Contact me if there's something special you to know more about.

Overview of Python

Some background about Python, how it got to where it is, and where it's going. This will be about how Python fits into the world. It is not an introduction to programming course.

IPython

IPython enhances the normal interactive Python interpreter to make it better for exploratory programming. New features include easier access to normal shell commands, improved history, support for matplotlib, and better help. You might find Jeff Rush's IPython videos helpful.

matplotlib

matplotlib is a python 2D plotting library which produces publication quality figures for hard-copy and interactive use. I'll work through several examples based on chemistry data sets, such as producing a scatter plot and exporting the result. See the gallery for some examples of what it can do.

parsing CSV files

A lot of chemistry data is passed around as comma-, space-, or tab- separated files. I'll work through how to use Python to parse these sorts of files, with a focus on SMILES files.

OEChem

OEChem is a commercial programming library from OpenEye for small molecule chemistry. It's full-featured, fast, and powerful, but a bit on the hard side to use. Some of the topics I can cover are:
  • parsing SMILES and working with the molecule object
  • parsing SMARTS and substructure searches
  • working with SMILES and SD files
  • maximum common substructure
  • translation between chemistry models

OpenBabel and RDKit

OpenBabel and RDKit are two open source cheminformatics toolkits with Python interfaces. I've used OpenBabel a bit, mostly through the high-level pybel library. I still can't get RDKit to compile on my Mac.

You might be thinking "why not teach OpenBabel instead of OEChem"? After all, the library is freely available so anyone can install it without needing a special license. I could teach it, but I don't have as much experience as I do with OEChem. There are certain nuances of every library, which I don't know for OpenBabel. I also think that OEChem is a better library for most use, if you have the money and consider proprietary software acceptable.

As an example of a difference, OpenBabel follows the Daylight model and assumes that the right behavior is to convert everything into the same chemistry model. OEChem doesn't make that assumption and instead has functions to convert between different chemistry models; you have to call those functions. This causes small differences in how the libraries treat conditions like aromaticity. I have more experience with the OEChem way and can explain it better.

If you are interested in me teaching specifically OpenBabel or RDKit then contact me.

Generating and searching fingerprints

I've developed a set of Python tools to work with molecular fingerprints. I'll describe how to use them to generate fingerprints and do Tanimoto searches.

PyMol

PyMol is a very popular structure visualization program from DeLano Scientific. It contains an extensive Python programming interface which can be used for structure analysis, movie making, and more. I'll show you some of the things you can do with it.

numpy

numpy is a collection of numeric tools for Python, with an emphasis on arrays and linear algebra.

subprocess

Many chemistry programs are only available through the command line. I'll show how to use Python's subprocess module to call them from Python.

BeautifulSoup

A lot of data is available through web pages designed for people, not software. BeautifulSoup makes it easier to get access to that data in your programs. One example I can cover is how to get data from a PubChem web page.

working with a relational database

Corporate compound databases are often stored in a relational database like Oracle. The best way to access Oracle data with Python is through the cx_Oracle extension module. There are similar modules for the other database servers like MySQL and PostgreSQL. All of them implement the Python DB-API 2.0, which standardizes most of the details of talking to the database.

Python comes with a simple but powerful relational database called sqlite. I'll show some examples of creating a database schema, loading compound data into the database, and searching the data set. This will not include working with chemistry cartridges.

developing web applications

This can be a several day course in its own right. Previously I taught this using TurboGears. I'm currently evaluating if I should teach this as Django instead.

Software development best practices

Software development is more than sitting down and programming. How do you keep track of changes to the code over time? If you change code, how do you figure out if the change broke something? How to multiple people work together on the same code base? What are some of the common development traps that people can get stuck in? Why do you need to care about security?

Many of these are covered in Greg Wilson's Software Carpentry. I'll specifically talk about version control, project builds, testing with nose and development practices like code reviews, YAGNI and agile development.

XML

Despite the best efforts of some, there's still very little direct influence of XML in computational chemistry. We still use SD and SMILES files. But there is sometimes a need, especially when getting data from IT-oriented software developers. I'll show some examples of using ElementTree to read and extract fields from an XML file.

Working with COM on Windows

Python for Windows has excellent support for COM, which is how Windows programs can embed or control other Windows programs. Chemists use Excel a lot. It's not hard to write a Python program which opens Excel, creates a new spreadsheet, loads data into the spreadsheet, and creates a plot of the results. Not hard for experienced programmers that is. I'll show a few examples you can build off of and provide pointers of where to get more in-depth information.

R and Python

R is a great software environment for statistical computing and generating plots. If you are building models or doing data mining then you should know about this project. R includes its own programming language and a number of high-quality analysis packages. It's not a general purpose language like Python and it doesn't include the diversity of modules that Python has.

RPy is an interface module which lets Python call R functions directly, including plotting. I've used it in a model calculation system using OEChem and other tools to compute descriptor values then passing the code over R to evalute the model.

R and Python have different ways of doing things. RPy minimize the difference, it's still there. I'll describe some of the basic R data type, how to create them from Python, how to call the R functions through RPy, and how to understand the R documentation enough to be able to call it from Python.



Contact Us | Home
Copyright © 2001-2008 Dalke Scientific Software, LLC. All rights reserved.
Company
Contact Us
News