Dalke Scientific Software: More science. Less time. Products

See Assignment #1 for the instructions of how to submit this assignment. The short version is to send me a tar or zip archive of a directory named "assignment5" with answers in the README file. You will also include at least one other file with your answers.

You will use the following SMILES data set (listing various drugs) to answer some of the questions:

N12CCC36C1CC(C(C2)=CCOC4CC5=O)C4C3N5c7ccccc76 Strychnine
c1ccccc1C(=O)OC2CC(N3C)CCC3C2C(=O)OC cocaine
COc1cc2c(ccnc2cc1)C(O)C4CC(CC3)C(C=C)CN34 quinine
OC(=O)C1CN(C)C2CC3=CCNc(ccc4)c3c4C2=C1 lyseric acid
CCN(CC)C(=O)C1CN(C)C2CC3=CNc(ccc4)c3c4C2=C1 LSD
C123C5C(O)C=CC2C(N(C)CC1)Cc(ccc4O)c3c4O5 morphine
C123C5C(OC(=O)C)C=CC2C(N(C)CC1)Cc(ccc4OC(=O)C)c3c4O5 heroin
c1ncccc1C1CCCN1C nicotine
CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12 caffeine
C1C(C)=C(C=CC(C)=CC=CC(C)=CCO)C(C)(C)C1 vitamin a

Part 1

Here is a list of fingerprint rules. The given bit is set (meaning it's True/1) if the structure:

  1. contains two or more oxygens
  2. has a ring of size 5
  3. contains elements besides C, N, O, S or H
  4. has only 1 ring
  5. there is a linear subgraph of 10 or more non-hydrogens atoms
(NOTE: it may be that some of these cannot be done with a SMARTS pattern.)

Some people have asked about the 5th bit. I'm looking for 10 atoms which are bonded in a row, without branches or returning to itself. That is, can you start at one atom and count out 10 atoms in a row without making a loop? The atoms may be in a cycle, it's the subgraph which cannot have a cycle.

Questions:

(Hint for the 5th one; [*] matches any atom and by default implicit hydrogens are never matched by a SMARTS pattern.)

What is the fingerprint for ...

The answers will look like "00101".
Use OpenEye's depict matcher or Daylight's depict matcher to help answer these questions.

Part 2

Write a Python function named fp_count which takes two bitstrings, represented as a string containing the characters "0" and "1" and returns the bit counts a, b, c, and d using the Daylight definitions:

a is the count of bits on in object A but not in object B.
b is the count of bits on in object B but not in object A.
c is the count of the bits on in both object A and object B.
d is the count of the bits off in both object A and object B.
The return value will be the 4-tuple of (a, b, c, d). Assuming I didn't make a mistake in my code, your code must be able to pass this test:
for (s1, s2, a, b, c, d) in (
    ("0", "0", 0, 0, 0, 1),
    ("1", "0", 1, 0, 0, 0),
    ("0", "1", 0, 1, 0, 0),
    ("1", "1", 0, 0, 1, 0),
    ("01", "00", 1, 0, 0, 1),
    ("11", "00", 2, 0, 0, 0),
    ("00", "11", 0, 2, 0, 0),
    ("01", "11", 0, 1, 1, 0),
    ("1011001010101", "0101010011011", 4, 4, 3, 2),
    ):
    x = fp_count(s1, s2)
    if x != (a, b, c, d):
        raise AssertionError( (x, (a,b,c,d) ) )
Put the function definition in the file "fp_search.py" and include the above code as a test function which is called when run from the command-line. (Use the if __name__ == "__main__": technique.)

Part 3

Define two new functions in fp_search.py, "tanimoto" which computes the Tanimoto measure and "yule" which computes the Yule measure. The two functions will take the values a, b, c and d as input. Your functions should be defined like this:


def tanimoto(a, b, c, d):
    ...

def yule(a, b, c, d):
    ...
Be sure to add test code for these functions.

NOTE: By default in Python if you divide an integer by and integer you'll get an integer. In Python, 1/2 == 0. To make Python do what you expect, either convert enough the integers into a float (eg, float(1)/2) or place the following at the top of your file to make all divisions in the file work as you expect.

from __future__ import division

Question:
Given your bitstrings from part 1, what is the tanimoto similarity between:

Part 4

OpenEye has a molecular 2D similarity demo page based on the Mesa implementation of the MACCS keys. Use it to do similarity searches of the above SMILES in order to answer the following two questions:

Optional:

Optional:



Copyright © 2001-2020 Andrew Dalke Scientific AB