Indexing ChEMBL for chemistry search

[ previous | newer ] /home/writings/diary/archive/2014/11/28/indexing_chembl

Indexing ChEMBL for chemistry search

I've just downloaded the SD file for ChEMBL-19. I'm happy that after many years they have finally placed the record identifier in the title line of each record! It used to contain either a blank line, or occasionally text that was essentially junk.

Now I want to be able to work with the data in the file. More specifically, I want a simple cheminformatics system where I can:

Get the original SD record given its identifier.
Get the canonical SMILES for a given identifier.
Search for records similar to a given query.
Check if a given SMILES is in the database.

and I want to show how to use the new features in chemfp-1.2 to do this.

Parse records, find the SMILES, and add to the SQLite database

The easiest way is to set up two files. The first is a SQLite database containing a Compound table. The Compound will contain three columns: the record identifier, the original record (as a compressed string), and the canonical SMILES string. SQLite is a very widely used SQL database, and it's included as part of the standard Python distribution. Here's the SQL that I'll used to create the database schema.

CREATE TABLE Compounds (
   id TEXT PRIMARY KEY NOT NULL,
   record BLOB,
   smiles TEXT)

which when embedded in Python code looks like:

import sqlite3

database_filename = "chembl_19.sqlite3"
conn = sqlite3.connect(database_filename)
cursor = conn.cursor()

cursor.execute("""
    CREATE TABLE Compounds (
       id TEXT PRIMARY KEY NOT NULL,
       record BLOB,
       smiles TEXT)
       """)

I need some way to get the SD records for each one, so I can save it as a compressed blob in the database, and so I can create the canonical SMILES string for the record. I pointed out in my previous essay that chemfp includes a "read_sdf_ids_and_records" function in its text_toolkit submodule, which reads each record and also pulls out the identifier. The code for this, along with commented-out code to support older ChEMBL releases which only stored the identifier in the "chembl_id" tag, looks like:

from chemfp import text_toolkit

filename = "/Users/dalke/databases/chembl_19.sdf.gz"
id_tag = None # Use the title as the id
#id_tag = "chembl_id" # Uncomment for pre-ChEMBL-19 releases

reader = text_toolkit.read_sdf_ids_and_records(filename, id_tag=id_tag, errors="ignore")

for (id, record) in sdf_reader:
   # work with the identifier and SDF record

In practice I also want to track the current record number. I can use the enumerate() function for that, and I'll seed it so that the first record is 1.

recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
   # work with the identifier and SDF record

You might wonder why I set recno to 0 before doing the iteration. This handles the case where the input file is empty. If that happens then recno will never be set, leaving it with whatever the old value was. If the value was never set, which is the case here, it would raise a NameError. Here's an example:

>>> for i, c in enumerate(""):
...   print i, c
... 
>>> print i
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'i' is not defined

I want to include the isomeric canonical SMILES string. Chemfp-1.2 has a common toolkit API to work with RDKit, OpenEye/OEChem, and Open Babel, so I'll simply use "toolkit" for respectively "rdkit_toolkit", "openeye_toolkit", or "openbabel_toolkit". The conversion is straight-forward; turn the SDF record into a toolkit molecule, then turn the toolkit molecule into the "smistring" (that's the chemfp format name for "isomeric canonical SMILES string").

toolkit_mol = toolkit.parse_molecule(record, "sdf")
smistring = toolkit.create_string(toolkit_mol, "smistring")

It's possible that the toolkit doesn't handle a given record, and unlikely but still perhaps possible that toolkit can't make a SMILES string for the molecule. If that happens, these functions will raise a chemfp.ParseError, which I can catch, ignore, and continue on to the next molecule.

recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
    try:
        toolkit_mol = toolkit.parse_molecule(record, "sdf")
        smistring = toolkit.create_string(toolkit_mol, "smistring")
    except chemfp.ParseError:
        continue
    .... work with the id, SDF record, and SMILES string ...

All that's left for the SQLite database is to compress the SDF record using the zlib library, and add the new row to the database.

    # Compress the record (result is about 1/3rd of the original string)
    compressed_record = zlib.compress(record)

    # Save into the database and write to the fingerprint file
    conn.execute("""
        INSERT INTO Compounds (id, record, smiles)
                    VALUES (?, ?, ?)""",
        (id, sqlite3.Binary(compressed_record), smistring))

Compression is useful because the uncompressed file takes up 2.8GB while the original compressed file takes 470 MB for a 6x compression ratio. Compressing each record independently isn't as effective, because the compressor can't share any of its analysis between records, but it still manages about a 3x compression ratio.

However, the compressed string is a byte string with NUL characters and non-ASCII data. The BLOB and the sqlite3.Binary() tell SQLite and Python's sqlite3 API to treat the data as a byte string and not as a Unicode string.

The last bit of code creates an index on the SMILES column (so I can see if a given SMILES exists or find records with the given SMILES) and commits the new data to the database and closes it.

conn.execute("CREATE INDEX Compounds_by_smiles on Compounds (smiles)")

conn.commit()
conn.close()

Generate fingerprint and save to the FPB fingerprint file

The other file I need is an FPB fingerprint file containing the fingerprints for similarity search. I'll use RDKit's hash fingerprints, and since I'm using RDKit to generate the fingerprints I might as well use RDKit as the toolkit to convert the SDF record into a SMILES string.

fingerprint_type = "RDKit-Fingerprint fpSize=2048"
fptype = chemfp.get_fingerprint_type(fingerprint_type)
toolkit = fptype.toolkit

I'll ask chemfp to open up a fingerprint writer. I want it to include the canonical fingerprint type string in the metadata, and for good record-keeping I'll also record the original source filename for the structures. The fingerprint type's get_metadata() has a helper function to create the correct metatdata object, which I can pass to the writer.

filename = "/Users/dalke/databases/chembl_19.sdf.gz"
fingerprint_filename = "chembl_19.fpb"
metadata = fptype.get_metadata(sources=[filename])
fp_writer = chemfp.open_fingerprint_writer(fingerprint_filename, metadata=metadata)

Then for each toolkit molecule, generate the fingerprint and save the id and the fingerprint to the file.

recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
    # This part is shared with the SQLite builder
    try:
        toolkit_mol = toolkit.parse_molecule(record, "sdf")
        smistring = toolkit.create_string(toolkit_mol, "smistring")
    except chemfp.ParseError:
        pass
     ...
    # Compute the fingerprint and save to the file
    fp = fptype.compute_fingerprint(toolkit_mol)

    fp_writer.write_fingerprint(id, fp)

Once finished, I'll close the file explicitly, rather than depend on the garbage collector to do it for me.

fp_writer.close()

Putting it all together

You've now seen the details, here's it is when I put it all together, along with some progress information and a summary report when it's complete.

import sys
import sqlite3
import zlib
import time

import chemfp
from chemfp import text_toolkit

# Configuration information

filename = "/Users/dalke/databases/chembl_19.sdf.gz"
id_tag = None # Use the title as the id
#id_tag = "chembl_id" # Uncomment for pre-ChEMBL-19 releases

fingerprint_type = "RDKit-Fingerprint fpSize=2048"

database_filename = "chembl_19.sqlite3"
fingerprint_filename = "chembl_19.fpb"

# Open the output database and create the schema

#import os  # Uncomment to delete the database before trying again
#os.unlink(database_filename)
conn = sqlite3.connect(database_filename)
cursor = conn.cursor()

cursor.execute("""
    CREATE TABLE Compounds (
       id TEXT PRIMARY KEY NOT NULL,
       record BLOB,
       smiles TEXT)
       """)

# Use the fingerprint type string to get the fingerprint type
# and the toolkit for that type

fptype = chemfp.get_fingerprint_type(fingerprint_type)
toolkit = fptype.toolkit

# Open the fingerprint writer, along with the correct metadata.
# (It handles the canonical fingerprint type string automatically.
# I give it a bit more information about the source.)

metadata = fptype.get_metadata(sources=[filename])
fp_writer = chemfp.open_fingerprint_writer(fingerprint_filename, metadata=metadata)

# Iterate through the file as a set of SDF records.

sdf_reader = text_toolkit.read_sdf_ids_and_records(filename, id_tag=id_tag, errors="ignore")
start_time = time.time()

recno = 0
for recno, (id, record) in enumerate(sdf_reader, 1):
    # Display some progress information
    if recno == 1 or recno % 1000 == 0:
        dt = time.time() - start_time
        sys.stderr.write("Processing record %d. Elapsed time: %.1f seconds\n" % (recno, dt))

    # Convert the record into a molecule and then get the SMILES and fingerprint.
    # Ignore any conversion errors.
    
    try:
        toolkit_mol = toolkit.parse_molecule(record, "sdf")
        smistring = toolkit.create_string(toolkit_mol, "smistring")
    except chemfp.ParseError:
        continue

    # Compute the fingerprint
    fp = fptype.compute_fingerprint(toolkit_mol)

    # Compress the record (result is about 1/3rd of the original string)
    compressed_record = zlib.compress(record)

    # Save into the database and write to the fingerprint file
    conn.execute("""
        INSERT INTO Compounds (id, record, smiles)
                    VALUES (?, ?, ?)""",
        (id, sqlite3.Binary(compressed_record), smistring))
    
    # Save the id and fingerprint to the output fingerprint file
    fp_writer.write_fingerprint(id, fp)

# How long did it take?
end_read_time = time.time()
dt = end_read_time - start_time
sys.stderr.write("Processed %d records in %.1f seconds. %.2f records/sec\n" %
                 (recno, dt, recno/dt))

# Create the index to be able to search on SMILES.
conn.execute("CREATE INDEX Compounds_by_smiles on Compounds (smiles)")
dt = time.time() - end_read_time
sys.stderr.write("Indexing took %.1f seconds.\n" % (dt,))

# Save everything
conn.commit()
conn.close()
fp_writer.close()

I ran the above code, and at the end it reports:

Processed 1404752 records in 7243.6 seconds. 193.93 records/sec
Indexing took 29.1 seconds.

When I profile the code, I can see that most of the time is spent creating the fingerprint. When I first ran this code I used an older version of RDKit, which took twice as long. I'm glad that RDKit has gotten faster over time.

Let's see if it works, first using the sqlite interactive shell:

% sqlite3 chembl_19.sqlite3 
SQLite version 3.6.12
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> select count(*) from Compounds;
1404532

(Oddly, it takes a long time to compute this count the first time I run it, but once done I can quit and restart sqlite and it's fast.)

How about some more interesting queries?

sqlite> select smiles from Compounds where id = "CHEMBL23456";
CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1
sqlite> select id, smiles from Compounds limit 5;
CHEMBL153534|Cc1cc(-c2csc(N=C(N)N)n2)cn1C
CHEMBL440060|CC[C@H](C)[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O)[C@@H](N)CCSC
)[C@@H](C)O)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](Cc1c[nH]cn1)C(=O)N
[C@@H](CC(N)=O)C(=O)NCC(=O)N[C@@H](C)C(=O)N[C@@H](C)C(=O)N[C@@H](CCC(N)=O)C(=O)N
[C@@H](CC(C)C)C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N[C@@H](CCC(N)=O)
C(=O)N[C@@H](CC(C)C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)NCC(=O)N[C@@H](CCC(N)=O)C(=O)N[
C@@H](CC(C)C)C(=O)NCC(=O)N1CCC[C@H]1C(=O)N1CCC[C@H]1C(=O)NCC(=O)N[C@@H](CO)C(=O)
N[C@@H](CCCN=C(N)N)C(N)=O
CHEMBL440067|CC(C)[C@@H]1NC(=O)[C@@H](CCCCN)NC(=O)[C@H](Cc2c[nH]c3ccccc23)NC(=O)
[C@@H](Cc2c[nH]cn2)NC(=O)[C@H](NC(=O)[C@@H](N)Cc2ccc3ccccc3c2)CSSC[C@@H](C(=O)N[
C@@H](Cc2cccc(-c3ccccc3)c2)C(N)=O)NC1=O
CHEMBL440245|CCCC[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CCCCN)NC(=O)[C@H](CCCN=C(N
)N)NC(=O)[C@H](CC(N)=O)NC(=O)[C@H](CO)NC(=O)[C@H](Cc1c[nH]cn1)NC(=O)[C@H](C)NC(=
O)[C@H](CCC(N)=O)NC(=O)[C@H](CCC(N)=O)NC(=O)[C@H](C)NC(=O)[C@H](CC(C)C)NC(=O)[C@
H](CCC(N)=O)NC(=O)[C@@H]1CCCCNC(=O)CC[C@H](NC(=O)[C@H](CC(C)C)NC(=O)[C@@H](NC(=O
)[C@H](CCC(=O)O)NC(=O)[C@H](CCCN=C(N)N)NC(=O)[C@H](CC(C)C)NC(=O)[C@H](CC(C)C)NC(
=O)[C@H](Cc2c[nH]cn2)NC(=O)[C@H](N)Cc2ccccc2)C(C)C)C(=O)N[C@@H](CCCC)C(=O)N[C@@H
](C)C(=O)N[C@@H](CCCN=C(N)N)C(=O)N[C@@H](C)C(=O)N1)C(=O)N[C@@H](CCC(=O)O)C(=O)N[
C@H](C(=O)N[C@H](C(=O)C(N)=O)[C@@H](C)CC)[C@@H](C)CC
CHEMBL440249|CC(C)C[C@@H]1NC(=O)CNC(=O)[C@H](c2ccc(O)cc2)NC(=O)[C@@H]([C@@H](C)O
)NC(=O)[C@H](c2ccc(O[C@H]3O[C@H](CO)[C@@H](O)[C@H](O)[C@@H]3O[C@H]3O[C@H](CO)[C@
@H](O)[C@H](O)[C@@H]3O)cc2)NC(=O)[C@@H](CCCN)NC(=O)[C@H](Cc2ccccc2)NC(=O)[C@H]([
C@@H](C)O)NC(=O)[C@@H](c2ccc(O)cc2)NC(=O)[C@H](c2ccc(O)cc2)NC(=O)[C@@H](C(C)C)NC
(=O)[C@@H](CCCN)NC(=O)[C@@H](c2ccc(O)cc2)NC(=O)[C@@H](CNC(=O)[C@H](CC(N)=O)NC(=O
)Cc2cccc3ccccc32)[C@@H](C(N)=O)OC(=O)[C@H](c2ccc(O)c(Cl)c2)NC(=O)[C@@H](C)NC1=O
sqlite> select id, smiles from Compounds order by length(smiles) limit 5;
CHEMBL17564|C
CHEMBL1098659|O
CHEMBL1160819|N
CHEMBL1200739|S
CHEMBL1233550|I
sqlite> .quit

That last query takes a little while because the database isn't indexed by SMILES length.

Okay, now what about the FPB file? When know from the previous code that CHEMBL23456 has the SMILES CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1, so I'll search for its nearest 5 neighbors.

% time simsearch --query 'CNC(=O)c1[nH]c2ccc(Cl)cc2c1Sc1ccccc1' -k 5 chembl_19.fpb
#Simsearch/1
#num_bits=2048
#type=Tanimoto k=5 threshold=0.0
#software=chemfp/1.2b4
#targets=chembl_19.fpb
#query_sources=/Users/dalke/databases/chembl_19.sdf.gz
#target_sources=/Users/dalke/databases/chembl_19.sdf.gz
5	Query1	CHEMBL23456	1.00000	CHEMBL180330	0.94105	CHEMBL286759	0.93508
CHEMBL314618	0.93295	CHEMBL382002	0.92812
0.262u 0.139s 0:00.47 82.9%	0+0k 0+0io 0pf+0w

You can verify that CHEMBL23456 is in the output, with the expected similarity of 1.0.

Also, take a look at the timings. Because of the new binary format in chemfp-1.2, it takes less than 0.5 seconds to search 1.4 million fingerprints from ChEMBL, from the command-line. I'm rather proud of that.

Finally, I'll using Python's interactive shell to connect to the database and display the uncompressed SD record for CHEMBL286759.

>>> import sqlite3
>>> db = sqlite3.connect("chembl_19.sqlite3")
>>> c = db.cursor()
>>> c.execute('SELECT record FROM Compounds WHERE id =  "CHEMBL286759"')
<sqlite3.Cursor object at 0x100d690e0>
>>> record, = c.fetchone()
>>> record
<read-write buffer ptr 0x100c64280, size 385 at 0x100c64240>
>>> 
>>> import zlib
>>> print zlib.decompress(record)
CHEMBL286759
          11280714492D 1   1.00000     0.00000     0

 20 22  0     0  0            999 V2000
    6.7292   -2.0542    0.0000 C   0  0  0  0  0  0           0  0  0
    6.2417   -1.3875    0.0000 C   0  0  0  0  0  0           0  0  0
    6.2417   -2.7292    0.0000 N   0  0  0  0  0  0           0  0  0
    5.4542   -1.6417    0.0000 C   0  0  0  0  0  0           0  0  0
    5.4542   -2.4750    0.0000 C   0  0  0  0  0  0           0  0  0
    7.5542   -2.0500    0.0000 C   0  0  0  0  0  0           0  0  0
    6.4917   -0.6042    0.0000 S   0  0  0  0  0  0           0  0  0
    4.7375   -1.2292    0.0000 C   0  0  0  0  0  0           0  0  0
    7.9625   -1.3292    0.0000 O   0  0  0  0  0  0           0  0  0
    4.7375   -2.8875    0.0000 C   0  0  0  0  0  0           0  0  0
    7.9667   -2.7625    0.0000 N   0  0  0  0  0  0           0  0  0
    4.0167   -1.6417    0.0000 C   0  0  0  0  0  0           0  0  0
    5.9417    0.0083    0.0000 C   0  0  0  0  0  0           0  0  0
    4.0167   -2.4750    0.0000 C   0  0  0  0  0  0           0  0  0
    3.3000   -1.2292    0.0000 Cl  0  0  0  0  0  0           0  0  0
    6.2000    0.7958    0.0000 C   0  0  0  0  0  0           0  0  0
    5.1417   -0.1667    0.0000 C   0  0  0  0  0  0           0  0  0
    4.5875    0.4458    0.0000 C   0  0  0  0  0  0           0  0  0
    5.6417    1.4083    0.0000 C   0  0  0  0  0  0           0  0  0
    4.8375    1.2333    0.0000 C   0  0  0  0  0  0           0  0  0
  2  1  2  0     0  0
  3  1  1  0     0  0
  4  2  1  0     0  0
  5  3  1  0     0  0
  6  1  1  0     0  0
  7  2  1  0     0  0
  8  4  1  0     0  0
  9  6  2  0     0  0
 10  5  1  0     0  0
 11  6  1  0     0  0
 12  8  2  0     0  0
 13  7  1  0     0  0
 14 10  2  0     0  0
 15 12  1  0     0  0
 16 13  2  0     0  0
 17 13  1  0     0  0
 18 17  2  0     0  0
 19 16  1  0     0  0
 20 18  1  0     0  0
  4  5  2  0     0  0
 14 12  1  0     0  0
 20 19  2  0     0  0
M  END
> <chembl_id>
CHEMBL286759

$$$$

Simple variations

You can easily see how to expand the database. You likely also want to store and index the non-isomeric canonical SMILES, which you can do by asking for the "cansmiles" format. You might also want to pre-compute and store molecular properites, like cLogP and molecular weight, along with a searchable index. It's very easy to make your own specialized database once you know how.

Of course, I really would like to integrate the SQL query with the fingerprint query, to search for things like "similar records with a given logP range." It's possible to hack these a bit, but what I would like to do some day is write a chemfp virtual table for SQLite - a sort of similarity cartridge - so they can be integrated. I just need funding. ;)

Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me