Unique fragments in PubChem
For reasons I'll get into later, I wanted to get an idea of the subgraph distribution of PubChem. That is, given my method for molecular subgraph enumeration, create all subgraphs of up to size 7 atoms and get an idea of how common they are. More specifically, atom uniqueness depends only on the atomic element and aromaticity, as assigned by OEChem, and the unique bond categories are "single-or-aromatic", double, and triple.
Last month I downloaded 2,138 sdf.gz files from PubChem and did structure perception with OpenEye's OEChem. Starting a couple of weeks ago, I use my subgraph enumeration algorithm to process 1,724 of them. For some reason, it stopped at that point. Since it took 7.5 days to process those files, and the data set is already a bit ungainly, I decided to leave the full analysis for another time and to not figure out what happened with the processing.
In the 1,724 files are 21,570,907 PubChem records and my enumeration found 1,925,185 unique substructures.
I kept track of the number of unique fragments per input file and the
running total number of unique fragments over all of the files,
You can see that 50% of the unique fragments are in the first 25% of the data files and essentially all are found in the first 50% of the files. (The number does increase after the 1000th file, but it's very slow.) It's also interesting to see the internal structural diversity in the different files. I suspect there are some large regions made from contributed combinitorial libraries.
The unique fragments which exist in the most number of records are:
21387437 C 20195255 O 19959057 c 19892743 cc 19755355 ccc 19457485 cccc 19270867 CC 19015890 ccccc 18599872 cccccc 18488545 c1ccccc1 18386628 N 17672171 Cc 17324074 Ccc 17109361 CN 16985355 Cccc 16533358 C=O 16522121 Ccccc 15993406 Cc(c)c 15759069 Cc(c)cc 15508521 CcccccYou shouldn't be surprised to see that carbon is found in 21,387,437 of the 21,570,907 structures.
I made a distribution plot of the fragments, where the horizontal axis
is rank order (C then O, cc, and so on). I show it at a few different
scales in order to get a better understanding of the
distribution. It's quite obviously *not* a Zipf distribution.
The vertical axis is the count in millions. You can see that the 10,000th most common substructure is in a very small percentage of the structure; it's actually 0.5%.
At the other end of the list, 478,278 fragments (24.8%) exist only once (like C#NF), 251,372 fragments (13.1%) exist twice (like B#[Cr]), and 132,574 fragments (6.89%) exist thrice. Here's the first 20 values as a table,
1 478278 # In other words, 478,278 substructures exist only once in the data set 2 251372 3 132574 4 100665 5 67536 6 57500 7 42959 8 37983 9 31750 10 28684 11 24016 12 23169 13 18695 14 17659 15 15501 16 14717 17 13452 18 12500 19 11394 20 11276and in graphical form.
Andrew Dalke is an independent consultant focusing on software development for computational chemistry and biology. Need contract programming, help, or training? Contact me
Copyright © 2001-2010 Dalke Scientific Software, LLC.