Optimizing substructure keys
Advertisement: This essay describes a research project I've slowly been working on during the last 18 months. I believe I can do very fast substructure searches of large compound data sets. I would like to make it more real, but that requires funding. Contact me if you want to collaborate and can pay for it.
There are two related goals in my quest for an inverted index library, which is unfortunate because what I want for one is not necessarily the best for the other. One goal is to see if I can do PubChem-scale substructure searches with "interactive" performance (under 0.1 seconds). I think I can do that with a lot of precomputation based on substructure enumeration and optimized key development. Based on the feedback from various people, I think this is best handled within an existing database, and most probably PostgreSQL.
The first goal requires developing new database substructure search methods. My other goal is to see if I can optimize substructure keys within the framework used in most existing substructure search programs.
The existing approaches use either a explicit substructure-based fingerprint like the MACCS or CACTVS/PubChem keys, or an exhaustive hash-based fingerprint like the Daylight fingerprints. The substructure-based ones are hand-optimized and small (under 1000 bits), but don't do well with unusual chemistry. The exhaustive ones can handle diverse chemistries, but are larger (typically 1024-4096 bits).
I want to see if I can use my subgraph enumeration approach to do target- and query-based key optimization. That is, given a set of structures in the database and a characteristic set of queries, I would like to develop short, dense fingerprints which can optimize those queries.
Problem: I lack a query set
I'm not there yet. The biggest problem is that I lack any sort of representative sample of the types of queries people do. People at pharmas don't want to contribute their data sets because it's hard to remove all of the proprietary queries from the dataset. By law, PubChem is not allowed to release this data.
Do you have a good (1,000+) query set that I can use for a benchmark? If yes, email me!
So until I get a good query set, I'll have to assume that the targets themselves are representative queries. This of course is incorrect. By definition, a substructure search will likely be smaller than what's in the database. Plus, if someone uses a sketch program to make the query, then they are going to build it up an atom or ring at a time, so the first dozen queries or so will likely be small and have lots of matches.
Optimizing the query keys
At its heart, a substructure query is a subgraph isomorphism test. This is slow. A substructure filter is a quick test which tries to reduce the number of required isomorphism tests. For example, if an "N" (nitrogen) exists in the query but does not exist in the target then there's no need to do the full isomorpism test.
However, if "N" does not exist in the query then it might still match a target molecule which has an "N" in it. For example, the query "O" should still match the target "N#[N+][O-]" (nitrous oxide).
So if the query feature is "1" then the target feature must be a "1", otherwise the isomorphism test is required. This is a simple decision tree; all compounds with featurei take one branch, and all which don't take the other. Since I'm assuming that the queries are sampled from the targets, then what I want is to end up with the set of features which minimize the size of the number of nodes at the end of the decision tree.
Last year I tried a genetic algorithm to find this, but it didn't converge well, and the results weren't pleasing. For example, if I decide that I want to add 32 bits to a fingerprint, then I have to redo the search from scratch and I end up with a completely different set of fingerprints.A greedy optimization algorithm
I came up with a greedy optimization instead. The base of the decision tree has all of the records in a single node. I find the "most divisive" feature, defined as the feature which best separates the tree into two branches, with about half of the records in one branch and half in the other. Each iteration finds the features which most evenly tries to split all of the nodes. I use the scoring function:
def score(current_leaves, feature): score = 0 for node in current_leaves: num_on = len(node & feature) num_off = len(node) - num_on score += (num_on-num_off)**2 return scoreThis does a lot of set intersection tests; about 30 million per iteration in my benchmark. Actually, since all I care about is the number of elements in the intersection, and not the set itself, I wish that Python had a "set.intersection_size()" method.
I use the scoring function to find the most divisive feature. This is the one which generates the smallest score. I use that feature to split all of the current leaves into "haves" and "have nots." As an optimization, I've decided that if there are fewer than 10 elements in a node then I don't care about further splitting that node. That's small enough that I can easily do a subgraph isomorphism test.
There are a few other optimizations. I ignore duplicate fingerprints, and I ignore features with fewer than 10 records in them. Set intersections are expensive, so I also do a short-circuit evaluation while computing the scoring function and abort the calculation if the current score is ever worse than the current best score. I also sort the nodes so the ones with the largest size go first. This increases chance of early short-circuiting.
I do some other optimizations, but you'll need to see the code for all the details. Yesterday I found that pypy 1.9's set intersection was about twice as fast as the one in cpython, so that's what I used to run the code.
Preliminary results
I used the same benchmark set (37 MB 7zip compressed) from yesterday's essay, which is a small subset of my full data set. From previous analysis, I know it has high local fingerprint diversity, so I figure it's a good enough start. Here's the subset of output containing the feature keys and the amount of time it took to find each one:
0 Cccc 5.5 seconds 1 CCO 3.6 seconds 2 CN 3.8 seconds 3 C=O 3.8 seconds 4 CCCC 5.1 seconds 5 cN 4.7 seconds 6 Cl 9.0 seconds 7 CC(C)C 10.2 seconds 8 n 8.2 seconds 9 C=C 10.5 seconds 10 cO 10.1 seconds 11 S 11.3 seconds 12 COC 11.7 seconds 13 cc(c)c 24.3 seconds 14 CC(C)O 24.1 seconds 15 O 24.5 seconds 16 CN(C)C 27.4 seconds 17 C1CCCCC1 41.4 seconds 18 CC 36.8 seconds 19 CCC(=O)O 44.1 seconds 20 F 55.2 seconds 21 CCCCCC 52.1 seconds 22 Br 48.2 seconds 23 CCC 52.3 seconds 24 C 51.1 seconds 25 C(CO)O 53.5 seconds 26 CccC 64.2 seconds 27 CCNCC 66.1 seconds 28 P 60.2 seconds 29 CCC(C)C=O 59.6 seconds 30 c1ccccc1 68.4 seconds 31 N 74.9 seconds 32 C1CCCC1 80.9 seconds 33 cCO 85.1 seconds 34 NO 96.5 seconds 35 C(CCO)CO 97.8 seconds 36 C(CN)N 96.4 seconds 37 cc(c)cc(c)c 99.7 seconds 38 CC(C)(C)C 99.2 seconds 39 I 102.6 seconds 40 CcccC 208.8 seconds 41 CCCCCCC 100.7 seconds 42 cc(c)nc 98.0 seconds 43 [Si] 93.7 seconds 44 CNC 95.3 seconds 45 CC(=O)C 94.5 seconds 46 c(cO)O 91.7 seconds 47 CCCCC 89.6 seconds 48 cCc 90.5 seconds 49 CC(C)(C)O 82.6 seconds 50 [Na] 82.6 seconds 51 o 84.3 seconds 52 [As] 85.6 seconds 53 NN 84.6 seconds 54 C(O)O 81.6 seconds 55 c(n)n 83.3 seconds 56 [Se] 80.4 seconds 57 CC=CC 80.1 seconds 58 COCCCOC 79.6 seconds 59 c(cCl)cCl 78.1 seconds 60 C=N 105.0 seconds 61 C#C 59.6 seconds 62 [Cr] 58.5 seconds 63 c(ccN)cN 58.1 seconds 64 CccccC 58.0 seconds 65 CC(C)C(C)C 54.9 seconds 66 ccc 52.7 seconds 67 B 53.3 seconds 68 CCCOCCC 55.1 seconds 69 CccO 51.0 seconds 70 [K] 48.7 seconds 71 CCN(CC)CC 48.6 seconds 72 CCCC=C(C)C 47.4 seconds 73 [Te] 47.8 seconds 74 cc=O 47.4 seconds 75 [Ti] 47.4 seconds 76 O=S(=O)O 47.7 seconds 77 CCC(CC)CO 46.7 seconds 78 [Fe] 46.8 seconds 79 c1ccncc1 46.7 seconds 80 CCOCCOC 52.5 seconds 81 c(ccN)ccN 35.2 seconds 82 [Al] 34.5 seconds 83 CC(C)N 34.2 seconds 84 [Sn] 32.4 seconds 85 c(cCl)Cl 32.4 seconds 86 CSC 31.9 seconds 87 C=CC=C 30.9 seconds 88 c(ccO)ccO 30.3 seconds 89 [Mo] 30.0 seconds 90 s 30.0 seconds 91 C(Cl)Cl 29.3 seconds 92 [Ca] 28.7 seconds 93 CccN 28.3 seconds 94 C(CO)CO 27.3 seconds 95 [W] 26.7 seconds 96 O(P)P 26.8 seconds 97 [Cu] 27.0 seconds 98 CCC(=O)CC 27.5 seconds 99 CCCNCCC 26.8 seconds 100 CO 28.0 seconds 101 [Pb] 19.6 seconds 102 CCOC(=O)C 19.6 seconds 103 [Zn] 18.8 seconds 104 N=N 18.6 seconds 105 [Mg] 18.2 seconds 106 CC(C)CC(C)C 18.0 seconds 107 [Hg] 17.7 seconds 108 C=CCCO 17.2 seconds 109 [V] 17.1 seconds 110 [H] 17.0 seconds 111 cc(c)c(c)c 17.1 seconds 112 c(ccO)cO 16.8 seconds 113 [Zr] 16.5 seconds 114 CC(=O)N 16.5 seconds 115 c(ccCl)cCl 15.9 seconds 116 cc(c)cN 15.8 seconds 117 [Mn] 15.6 seconds 118 COP(=O)OC 15.5 seconds 119 Ccnc 15.4 seconds 120 [Ni] 15.7 seconds 121 CCCCCC=O 12.4 seconds 122 C1CC1 12.1 seconds 123 cOc 11.9 seconds 124 [Co] 11.7 seconds 125 c1cccc1 11.7 seconds 126 [Sb] 11.5 seconds 127 [Pt] 11.5 seconds 128 C(N)N 11.4 seconds 129 O=S 11.0 seconds 130 [Nb] 10.8 seconds 131 C(CCO)CCO 10.8 seconds 132 [U] 10.6 seconds 133 c(cBr)cBr 10.7 seconds 134 CCc 10.6 seconds 135 [Ba] 10.2 seconds 136 CC(C)OC(C)C 10.0 seconds 137 C[Si](C)C 10.1 seconds 138 ccNcc 9.8 seconds 139 [Ru] 9.5 seconds 140 C=CCC=C 9.7 seconds 141 CN(C)(C)C 7.5 seconds 142 SS 7.3 seconds 143 C(F)(F)F 7.2 seconds 144 [Re] 7.0 seconds 145 ccnccn 6.9 seconds 146 CC(C)(C)CCO 6.8 seconds 147 [Cd] 6.7 seconds 148 CCC(C)CC 6.7 seconds 149 CCSCC 6.5 seconds 150 [Ge] 6.5 seconds 151 Cc(c)c(c)c 6.4 seconds 152 [Sr] 6.3 seconds 153 C(COCCO)O 6.2 seconds 154 Ccc(C)cC 6.0 seconds 155 c(c(cCl)Cl)Cl 6.0 seconds 156 OO 6.0 seconds 157 ccnc=O 5.9 seconds 158 C(CCl)Cl 5.8 seconds 159 [Pd] 5.7 seconds 160 c(cO)N 5.8 seconds 161 [Ag] 4.7 seconds 162 C1CCCCCC1 4.7 seconds 163 [Ta] 4.6 seconds 164 [Ce] 4.6 seconds 165 C(CC=O)C=O 4.5 seconds 166 cccCl 4.4 seconds 167 CCC(CC)O 4.3 seconds 168 [Gd] 4.2 seconds 169 [Li] 4.2 seconds 170 cc(c)c(c)O 4.2 seconds 171 C=CCO 4.1 seconds 172 CP 4.1 seconds 173 C#N 4.0 seconds 174 COcc(c)OC 3.9 seconds 175 [Bi] 3.8 seconds 176 c(cN)cS 3.8 seconds 177 [Y] 3.8 seconds 178 CC(C)(C)C=O 3.7 seconds 179 c(ccBr)cBr 3.7 seconds 180 [Yb] 3.8 seconds 181 [In] 3.1 seconds 182 CC(C)CO 3.1 seconds 183 Cc(c)cccO 3.0 seconds 184 COCC(O)O 3.0 seconds 185 [La] 2.9 seconds 186 c(cN)N 2.9 seconds 187 [Eu] 2.9 seconds 188 CCC(CC)CC 2.8 seconds 189 [Ga] 2.8 seconds 190 cCCc 2.8 seconds 191 COC(CO)CO 2.7 seconds 192 [Th] 2.7 seconds 193 C=C(C)C 2.7 seconds 194 c(cF)cF 2.6 seconds 195 C[Si]C 2.6 seconds 196 CncccnC 2.6 seconds 197 cc(c)nc(c)c 2.6 seconds 198 C(CNCCN)N 2.5 seconds 199 [Tl] 2.5 seconds 200 C=C(C)C(C)(C)C 2.5 seconds 201 [Ho] 2.2 seconds 202 [Au] 2.2 seconds 203 CC(C)S 2.2 seconds 204 [Nd] 2.1 seconds 205 cc(cCl)c(c)Cl 2.1 seconds 206 CCC(CC)OC 2.1 seconds 207 c(ccS)ccS 2.1 seconds 208 C(=O)OC=O 2.0 seconds 209 [Be] 2.0 seconds 210 CC(C)c 2.0 seconds 211 [Tb] 2.0 seconds 212 COc 1.9 seconds 213 CCCCO 1.9 seconds 214 c(cBr)Br 1.9 seconds 215 CcccccC 1.8 seconds 216 C(COP)COP 1.8 seconds 217 C(CBr)Br 1.8 seconds 218 [Rh] 1.7 seconds 219 CCCCNC=O 1.7 seconds 220 C(CO)N 1.7 seconds 221 O[Si]O[Si]O 1.5 seconds 222 COcccOC 1.5 seconds 223 CC(C)CC=O 1.4 seconds 224 Cn(c)c(c)c 1.4 seconds 225 [Sm] 1.4 seconds 226 CCC(CO)CO 1.4 seconds 227 C(Cl)(Cl)Cl 1.4 seconds 228 CS 1.3 seconds 229 [Cs] 1.3 seconds 230 cc(c)c(cCl)Cl 1.3 seconds 231 CC1CCC(C)C1 1.3 seconds 232 cc(c)c(c)F 1.2 seconds 233 C(S)S 1.2 seconds 234 cnn 1.2 seconds 235 O=P(O)O 1.1 seconds 236 CCOCOCC 1.1 seconds 237 O=[As]O 1.1 seconds 238 CccccN 1.1 seconds 239 [Er] 1.0 seconds 240 C1CCNCC1 1.0 seconds 241 C(=O)CO 0.7 seconds 242 CC(=C(C)C)C 0.6 seconds 243 [Tm] 0.6 seconds 244 cc1ccncc1 0.6 seconds 245 N=O 0.6 seconds 246 C[Si](C)(C)[Si] 0.6 seconds 247 O=POPOP=O 0.6 seconds 248 CCCOC(C)C 0.6 seconds 249 Ccc(c(C)c)O 0.6 seconds 250 C(CC=O)CC=O 0.6 seconds 251 c(cN=O)cN=O 0.6 seconds 252 c(c(cBr)Br)Br 0.6 seconds 253 C(=CCO)CO 0.6 seconds 254 CCCCCCN 0.5 seconds 255 [Dy] 0.5 seconds 256 c(cCl)OccCl 0.5 seconds 257 CC(C)(F)F 0.5 seconds 258 c(cI)cI 0.5 seconds 259 Cc(c)ccCO 0.5 seconds 260 CCCCC(=O)O 0.5 seconds 261 [Ir] 0.4 seconds 262 c(ccCl)cN 0.4 seconds 263 cc(c)cccs 0.4 seconds 264 cc(c)oc(c)c 0.4 seconds 265 cc(c)c(c)Cl 0.4 seconds 266 COCCC(O)O 0.4 seconds 267 [Pu] 0.4 seconds 268 CCCCOC=O 0.4 seconds 269 cc(c)c(c)S 0.3 seconds 270 c(cn)cncN 0.3 seconds 271 C=CCCC=C 0.3 seconds 272 [Rb] 0.3 seconds 273 CNCCCNC 0.3 seconds 274 C1CC2CCC1C2 0.3 seconds 275 C(=O)ccC(=O)O 0.3 seconds 276 CCCSCCC 0.3 seconds 277 C(C=O)C=O 0.3 seconds 278 CCcccCC 0.3 seconds 279 p 0.3 seconds 280 COc(c)cc=O 0.3 seconds 281 cc1cc(c)sc1 0.2 seconds 282 [Pr] 0.2 seconds 283 CNCO 0.2 seconds 284 CCCC(C)C 0.2 seconds 285 CC[Sn](CC)CC 0.2 seconds 286 C(=O)NC=O 0.2 seconds 287 CC(cc(c)c)O 0.2 seconds 288 c(ccSN)cN 0.2 seconds 289 Ccccc(c)c 0.2 seconds 290 C(C=O)CO 0.2 seconds 291 CCP(CC)CC 0.2 seconds 292 [Hf] 0.2 seconds 293 CN(C)CccO 0.1 seconds 294 O([Si])[Si]O[Si]O[Si] 0.1 seconds 295 CCCCCOC 0.1 seconds 296 CCcccOC 0.1 seconds 297 C1CCC1 0.1 seconds 298 CC(C(C)(C)C)O 0.1 seconds 299 cc1cncc1c 0.1 seconds 300 c1cccccc1 0.1 seconds 301 CCCNcc=O 0.1 seconds 302 CSccc(c)c 0.1 seconds 303 [Os] 0.1 seconds 304 CC(C[Se]C)N 0.1 seconds 305 CCC(CC)Cl 0.1 seconds 306 CCC=CCC 0.1 seconds 307 C(=O)C=O 0.1 seconds 308 CCccCC 0.1 seconds 309 c(cc=O)c=O 0.1 seconds 310 CC(c)c 0.1 seconds 311 CCCOC1CC1 0.1 seconds 312 O=Cl(=O)O 0.1 seconds 313 c(cN(=O)O)cO 0.1 seconds 314 C(CCl)O 0.1 seconds 315 C(C(CO)(O)O)O 0.1 seconds 316 CCC(CC)Br 0.1 seconds 317 cc(cBr)cBr 0.1 seconds 318 C(ccccO)O 0.1 seconds 319 CCCCCCO 0.1 seconds 320 CCCN(C)P 0.1 seconds 321 COCcc(c)c 0.0 seconds 322 C[Si](C)([Si])[Si] 0.0 seconds 323 Cc(c)cc(c)S 0.0 seconds 324 C=C(C)CC(C)c 0.0 seconds 325 CBr 0.0 seconds 326 C(CC(Cl)Cl)CCl 0.0 seconds 327 C(S(=O)=O)S(=O)=O 0.0 seconds 328 C1=CCCCC1 0.0 seconds 329 C1CNCCN1 0.0 seconds 330 C=Cc(c)c(c)c 0.0 seconds 331 CC(c)N(C)Cc 0.0 seconds 332 CCC(=O)OCC 0.0 seconds 333 CcncccF 0.0 seconds 334 O(P)POPOP 0.0 seconds 335 c(c(cF)F)N 0.0 seconds 336 cc(c)cccI 0.0 seconds 337 C(C=O)COC=O 0.0 seconds 338 CCC(CC)N 0.0 seconds 339 CCCC(=O)S 0.0 seconds 340 CC[Pb](CC)CC 0.0 seconds 341 COP(OC)OC 0.0 seconds 342 C[Si](C)(C)C[Si] 0.0 seconds 343 Ccc(C=O)cC 0.0 seconds 344 cc1ccocc1 0.0 seconds 345 ccscc(c)c 0.0 seconds 346 CC(C)(C)COC 0.0 seconds 347 cc(c)cccN 0.0 seconds 348 cc(c)cccO 0.0 seconds 349 C(CO)COC=O 0.0 seconds 350 COc(c)cOc 0.0 seconds 351 [as] 0.0 seconds 352 c(ccCl)ccCl 0.0 secondsThis says that 353 bits are all that's needed in order to put 202,632 unique fingerprints into bins with at most about 10 unique fingerprints in a given bin.
I think that's quite amazing given that most people use fingerprints which are at least twice and usually at least four times that length.
Also, it took 1.75 hours to generate this, and most of the time is spent doing len(set1 & set).
Is it useful?
I really wish I had a good query set to use for tuning - and another dataset to use for validation. I don't, but Greg Landrum made a synthetic one for testing the RDKit/PostgreSQL cartridge. Last December we worked on optimizing the fingerprint filters. I used an earlier version of this program to generate 54 query bits, which he fused with his hash-based fingerprint. Greg reported:
Tacking your most recent 54 bits, interpreted as SMILES, onto the end of a shorter version of my new fingerprint gives: [06:10:40] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 80.30 [06:10:40] INFO: screenout: 0.03, accuracy: 0.84 The original performance for the shorter fingerprint was: [05:04:23] INFO: FINISHED 50001 (41150823 total, 1693315 searched, 1118055 found) in 90.90 [05:04:23] INFO: screenout: 0.04, accuracy: 0.66 so those bits definitely help with the screenout. Putting the bits at the beginning of the fingerprint gets a bit more speed: [06:14:48] INFO: FINISHED 50001 (41150823 total, 1325699 searched, 1118055 found) in 71.09 [06:14:48] INFO: screenout: 0.03, accuracy: 0.84I think a 20% speedup is nothing to sneeze at.
I would still like an inverted index library
The greedy algorithm does a lot of set intersections, partitions sets into subsets, and removes old sets. I feel like this is not what databases are designed for, That's why I'm looking for an inverted index library which lets me manage boolean set operations myself. Then again I'm not a database person and I don't know if my belief is true.
I ran my greedy feature selection algorithm until it ran out of data, at only 353 bits. I want to have 1024 substructure bits. This means I need to feed my algorithm a lot more data. That's fine - my test set is only 1/200th of the data I have. However, my implementation running the test set takes up 2 GB of memory on my 12 GB desktop machine, and took 6300 seconds (1h45m) to run.
My concern is that a distributed memory implementation of my algorithm is more complex to write than a shared memory one. The Amazon high-memory instances have 68 GB of RAM. I estimate I need 400 GB with Python sets, and under 40 GB with a more compact bitset implemented in C. This seems eminently doable, which is why I'm looking to see if someone has already done the work.
Based on my searching, and the lack of pointers by others, the answer is "no."
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2020 Andrew Dalke Scientific AB