Molecular Coding
Molecular Coding
In this episode recorded March 8, 2011, I talk with Brian Cole and Imran Haque about their experience with GPU computing. Brian works at OpenEye, and their ROCS implementation is about 1,000 times faster using GPUs than CPUs. Imran Haque is at Stanford where he implemented a GPU version of the LINGO algorithm and contributed to GPU Computing Gems: Emerald Edition.
Transcript begins:
A: Welcome to Molecular Coding: the podcast about the software side of cheminformatics. I started recording this series last November with two interviews at the German conference on cheminformatics in Goslar. Earlier this year, I recorded two more interviews at OpenEye's CUP user conference in Santa Fe, New Mexico. It took me until now to start editing and putting them online. For this podcast I talk with Imran Haque and Brian Cole about their experiences with GPU computing and their success at implementing the ROCS and Lingo algorithms on the GPU.
*music*
A: Welcome to my podcast, Molecular Coding. Came up with the name. Very happy with it. And we're now at the OpenEye CUP conference. 12th year of the CUP conferences in Santa Fe, New Mexico. I'm here with Imran and Brian and we're going to talk about GPUs and GPUs in computational chemistry. So, if you two could introduce yourselves. Brian?
B: Alright, so I'm Brian Cole. I work for OpenEye. I run the conference here.
I: I'm Imran Haque, a grad student at Vijay Pande's lab at Stanford. I'm working there on computational techniques for drug discovery.
A: So, I wanted to talk with both of you because of both of you doing work with GPUs and making some of the algorithms for cheminformatics very fast. So, for instance, Brian, you've done work in making ROCS... I suppose you've done work in making ROCS fast.
B: Right, correct.
A: So, what is ROCS, first off?
B: So, ROCS is... Literally, what it stands for is Rapid Overlay of Chemical Structures. Basically, it's a 3D comparison method for taking two chemical structures and trying to overlay them, and then, what you're actually calculating is the volume of the overlap. You get a similarity measure out of it called Tanimoto. So, essentially, they use it for...it could be applied to many, many different things, but the most traditional is virtual screening, basically.
A: Alright.
B: I have an active and I want to find another active in my database that looks like it. So...
A: And so, how fast is ROCS?
B: So, ROCS, typically on a CPU, the typical number we like to use is about a thousand per second. It's a good way to think of it.
A: And FastROCS?
B: FastROCS on like a desktop machine, can easily go a million a second. Over a million a second.
A: So, a thousand times faster.
B: A thousand times faster.
A: What is it about the algorithm that makes it appropriate for GPUs. Why is this algorithm so much faster on a GPU?
B: A lot of it comes down to basically: one, GPUs are single precision, and actually the algorithm doesn't need that much precision, so GPUs actually excel in precision for one.
A: Mm, hm.
B: The other is that they excel at 3D operations. It's a 3D method. And, a lot of it actually gets down to what GPUs are good at. And that's hiding memory latency. And actually the inter-loop of ROCS has always been hiding memory latency. Or, basically, a random hit into memory, if you can think of it that way. So, the CPUs would always be basically: you see all these very high cache missed rates. And, but, there was no way to get around that. GPUs offer a very nice way of basically just flooding the thing with as many threads as possible to try to hide that memory latency..
A: So, Imran, you've also been working on the FastROCS code. You also worked on other projects, too.
I: Yeah, that's right.
A: What else have you worked on?
I: So, I haven't actually worked on the FastROCS. I developed my own version of a... of a GPU-enabled, you know, shape overlay method, which we call “Paper.” And it works slightly differently to FastROCS. But, some of the other things I've been working on are accelerated versions of, 2D chemical searching. So, for example, there's an algorithm called Lingo, which compares molecules by looking at overlapping sub-strings, in their SMILES strings and comparing the centers, intersections and differences between nodes.
A: So, when you start working on an algorithm, like the Lingo algorithm, do you start with the algorithm that's submitted in a C code, and you just port that directly over, or how does the translation of an algorithm go?
I: So, it depends a lot on the particular algorithm that you're...that you're using. So, um, for example, the algorithms used in shape overlay and in ROCS, are much more straight forward, I think, to put on a GPU because of things like memory latency and arithmetic intensity that are really well suited for a GPU. On the other hand, the algorithm that's canonical for CPU Lingo is actually very poorly suited for the GPU because it's extremely branch heavy. It's extremely branch heavy or it requires a very large jump table to implement a finite state automaton.
A: That's the DFA solution for doing Lingo.
I: Exactly. And so, in order to get Lingo to run well on the GPU, I actually had to go back sort of the definition of Lingo and figure out a completely different algorithm to implement it. To help it run efficiently.
B: That was something we had to do as well. Was basically, from the ground up we had to look at every piece of the algorithm.
A: For doing the FastROCS.
B: For doing FastROCS and it was basically... It wasn't just taking, you know, some old C code that we had lying around and copying and pasting it in. It was literally every line of code was kind of inspected and analyzed in the of how will this run on the GPU. So, a lot of it was basically just uh...you know, going every line of code and going, “Okay, what would cause things like branch divergence?” Cause actually one of the tricky bits of ROCS is actually it's an optimizer. And by optimizers' very nature, they are branchy.
A: Mm, hm.
B: You know, “Do I go left? Do I go right?” laughs
A: So you do solve this by... I mean, I know that in some cases when you do branching, you can say, merge the branches so that arithmetically, you're doing the branch...both branches but just use the proper result. How do you handle the branching problem? How do you solve it? How do you get around it?
B: The trick is you just try to keep your IF statements as small as possible. That's kind of one way of thinking about it.
A: So you cancel a lot of IF statements.
B: Huh. It's … some of it's canceling out, you just...some of it's just kind of “okay, these IF statements are equivalent”. But other parts of it are just like...so, for example if you wanted to negate a vector kind of thing, but you only wanted to negate the vector sometimes. You could go like, “IF True, negate my entire vector,” or you can multiply the vector by minus one, you set a value of minus one. And always multiply that vector by some constant. But, you're switching on what that constant is in a very...a much smaller IF statement in that block. So you can use lots of tricks like that to kind of cut down on branch divergence.
I: And actually, for really small statements like that the GPUs have hardware support to
B: for...
I: to...
B: for predicating it...
I: to deal with that.
B: So if it's small enough it doesn't really affect anything...
I: Right. There is no branch even though it looks like there is one in the code.
A: So, if I wanted to get started doing GPU programming, I understand I can use my laptop and just work with that. Cause my laptop has this GPU in it...
B: Mm, hm.
I: Right.
A: What do I do to get started? What software do I use?
Laughing
B: This is where we're gonna differ.
I: I don't think so actually. So, I mean, I think for people who are starting out with GPU programming, it probably makes sense to look at OpenCL...As, as the language that you're going to use. So, there are essentially two major competing standards in the market right now for doing GPU programming. One of them is CUDA, which is a propriety language from NVIDIA; the other is OpenCL. In terms of the code that actually runs in the GPU, they're fairly similar C-like languages, with basically the same programming model. The code that you have on the host site to support it looks a little bit different. I think the key differences are really that CUDA runs only on NVIDIA whereas OpenCL is sort of an open vendor standard. But as a consequence, there are sometimes performance consequences and, you know, speed of feature adoption consequences between the two languages, so...
B: I thought he was gonna say “CUDA,” so I'm gonna take the route and say CUDA, actually.
Laughing
I: Just to play devil's advocate was it?
B: 'Cause if it's a hobbyist kind of person, and if it's just you know, you wanna experiment...
A: On a Mac?
B: On a Mac. And you're just a hobbyist and you just, you know, “I wanna do a little bit of GPU computing,” it would actually be quicker and easier to get up and running with CUDA.
A: Why's that?
B: Uh...simple things like just setting your kernel argument in OpenCL is an individual function call with an individual index for each one.
I: Right.
B: And it can be quite a nightmare to make sure, “okay, are all my arguments mapped up?” And CUDA takes care of that automatically by making a single source compile.
I: So, what makes CUDA really nice is they actually have two different APIs to program in on the host side. One called the “Run-Time API” and one called the “Driver API.” And the run-time API makes things really simple. To call a function on the GPU, you basically, you know, add three angle brackets and it makes a function call – it's like it's going. Whereas as CL is basically equivalent to the CUDA driver API, which is a much bigger pain to use. Now, as far as getting started with GPU programming on a Mac, I'd actually say there's a bit of a caveat to that, in that the newest MacBook Pros are using AMD GPUs and so CUDA won't run on them. But, at least for the last generation ones, where they were using NVIDIA, then CUDA is still a good solution for that.
B: Okay, so OpenCL, you could make it a lot easier by using some of the higher level things that are out there, like PyOpenCL, and C++Py leads to OpenCL. And that will take care of a lot of the kluginess of trying to use the C API, which is actually the standard. The standard's defined with C API. So, if you use some of the higher level stuff, like PyOpenCL, it takes of some of that klugy setting arguments and stuff like that for you. Plus it has a lot of other really nice high level features in it. So, yeah, you can use it.
I: Yeah, and I'll throw in a plug for PyOpenCL and PyCUDA, if you're trying to do GPU stuff on Python, their actually really good.
B: It's actually done by the same person.
Laughing
A: Okay, so now, pretend I have now done some GPU programming for a while, and I wanna make this work in my research group. How do I bring these in? Do I get a bunch of laptops with graphics cards, or what do I do?
B: It really depends on what you're trying to do. So, like FastROCS was designed as like a server. It's kind of this large in memory service that any client can actually hit. So, it's more along the lines of buying one of these machines that has as many GPUs as you could pack into it as possible.
A: And it can cool off.
B: And can cool off, and a lot of other neat things like that. And it can handle the power draw. Actually, some of these rack-mountable things require, like 220 Volt power, that kind of thing. 'Cause they just draw a lot of power. So, if you're going for that sort of service-type architecture you can buy... for FastROCS actually, you can get a machine that's up and running for two million conformers a second for less than 20 grand.
A: Okay.
B: You go to a like a super micro reseller type company, you get, like, four Tesla GPUs and that for like 15 grand. Really, it depends on how much money you wanna spend. Because if you don't wanna spend a lot of money, you can actually get up and going with gamer cards. Gamer cards are actually faster than the Tesla high performance computing cards
I: With a couple problems.
B: With a couple problems, but they are faster.
A: So you just take a couple gamer cards, plug it in...don't plug it in NVIDIA
B: That's where part of the problem comes – is “a couple gamer cards.” Like, if you only want one card in your system, it's pretty easy to get a gamer card in there. It's when you're scaling the multiple that...
I: Well, that's not really that big of a deal. So, my...
B: If you can get the machine that can do it...
I: So, we were looking for... So, to go back to your original question: “how do I get people working with them?” So, I think, you know, getting a couple laptops with GPUs is an alright solution when you start playing with it, but the performance implications when trying to deal with... in larger GPU are often quite different. Especially because the laptop GPU architecture tend to lag a generation or two behind; sometimes they're slightly different.
B: And they're particularly low powered on purpose.
I: Yes.
B: To try to conserve power.
A: They've never been good gaming machines.
I: Also, the memory sub systems on laptop GPUs are usually far more performance ?? Particularly for memory-bound things like FastROCS, that'll make a big difference.
So, when you're working with GPU, you could start off with a laptop, but you can't get good conclusive results until you've actually tested it on a real...
I: I think so.
B: You could easily see...I just ran this the other week – FastROCS on somebody's Quadro card in the office, and was getting like one-tenth the performance. But, the Quadro card wasn't designed for that sort of thing, it was designed for just visualization.
I: Right, and a lot of the visualization cards are relatively low shade or count, and so they're perfectly adequate for graphics, but they're not really designed for really high-end compute stuff. For example, the first machine that I was programming on for GPU stuff was a low-end, like $60 GPU. When it became clear there really isn't there there, we got some high-end GPUs and started pushing those. So, what Brian was saying about gamer GPUs, the primary GPU desk machine I use in the lab now actually has a pair of gamer GPUs: a GTX 480 and GTX 260. And when we're buying the machine, we're trying to find, you know, just some online vendor where you can plug in...worried about power supply big enough to deal with two GPUs and so we compared, you know, Dell workstations and HP workstations, and in the end it turned out the cheapest solution was just an Alienware Gaming desktop.
B: Any reason you mixed the 260 and the 480? 'Cause, they're different generations...was that on purpose?
I: Yeah, well, no, it's because when we bought the machine, they were out of the 480, so I got one with a 260 and Vijay said he would buy my a 480 when it became available.
A: I don't know if you look at it...Amazon now offers GPU custom...
I: GPU nodes.
A: GPU nodes, yes. And it's $2.10 an hour for two NVIDIA Tesla Fermis. That's good, I assume?
I: I'm insulated from the costs, cause I work for an academic institution, so I don't really know.
A: Cause I'm thinking about the overhead if you're developing this for yourself, in a group, you've got the “who maintains it?” “who knows the specialized knowledge?” of “what kind of power, CPU...” all the stuff you just talked about to get this working. And especially...when you do stuff with FastROCS, that's usually gonna be...a few minutes of work, or a few hours of work, or what?
I: A typical query of just PubChem demo that we had just last night was something like 68 million conformers – that was 130 seconds to run.
B: But, that depends on having it all loaded up.
I: So, that depends on having it all loaded up, and that's where it becomes tricky. That's why you wouldn't want to go for a cloud solution like Amazon with FastROCS because load time is everything! The idea is, it runs as a service, continuously, and if you have to, you know, you could pay constantly to Amazon.
Laughing
A: I'm sure they'd love that!
B: For one-off screening it might make sense but if you want to do large scale clustering where you're loading it once into memory and hitting it many many times while you actually have a GPU [something] going then it might be viable.
I: Right, as long as you can amortize that load time.
B: As long as you can amortize that load time. So one of the chief things you have to learn when you start to do GPU computing is it's all about data movement between your main memory and your GPU memory. Between your GPU memory and the local caches on your actual chips in the GPU. You're constantly thinking about that data memory and it's this hierarchy of data. That's the downside of the cloud - you have that network.
I: And if you're working on really large data sets, which are I think a lot of the reason people are interested in GPUs, then you really do start to worry about moving from disk just to system memory. I think that's where load time in FastROCS as well as some load time in Paper comes from. We use slightly different data formats but I still have to read this giant HDF5 array from disk and load that in.
A: Of course if you have the algorithm and you want to test it out on a real machine, you go to Amazon and try it out there first.
I: Absolutely.
B: Oh yeah.
A: So how did you two ... these are the sort of skills not typically taught in computational chemistry. How did you get into this field?
B: I got into this field through a friend actually. Basically he offered me an intership with Wyeth. I started doing Python programming in my teens and he's like "oh, you know Python? Well I've got a lot of Python that needs writing." I started it that way.
A: Did you come in as a chemist?
B: I came in .. actually it was only one year in college. I hadn't even selected a major yet. It actually kind of helped me select a major, getting into this field. "Oh, I do like computers and I like science so okay, I'll do that." But for how to learn GPU computing on your own, the best I can say is "read everything." Don't go "I'm programming in OpenCL so I shouldn't read the CUDA documentation" because there's a lot you can glean from the CUDA world and apply to OpenCL, and vice versa as well.
A: I saw there's the book "GPU Gems"
B: (pointing to Imran) And he's an author of a chapter.
A: Congratulations.
I: They were calling it volume 1 and volume 2 and now I think it's called Emerald edition. There's a couple of books out now called, one out now and one out in about six months, called GPU Computing Gems. The first volume is on methods in scientific computing and I think the second is on programmer libraries and more systems software support stuff. Basically what the books are is just a collection of case studies from different groups about applications they've managed to port successfully to the GPU and then sort of the tricks they used to make them work. The chapter I wrote is about how we make Paper go fast for shape overlay and how do we make LINGO go fast by this algorithmic transformation and sort of exploring the different optimizations you make along the way and what kind of performance impact that has.
A: Nice. Is that book out already or is it soon?
I: The first one is supposedly out now, though we haven't gotten the copy we're supposed to get so I'm not sure.
B: I haven't read it yet since it isn't out, but do you actually walk through "okay, this was the first implementation and we transformed it this way?"
I: The chapter that I wrote; obviously there's several thousand lines of code so it focuses on the main objective function kernel, what can we do with this? It gives you a listing for the critical few lines in there and says "okay, what can we do? Why's this slow?" and walks through four different iterations and shows by the end of this we got something that performs 3x as fast. And then what can we do the different kernel on the GPU as far as minimizing the data movement that Brian was talking about? It turns out that because the data movement is so bad there are actually a couple of steps of the algorithm that run really poorly on the GPU, like they are really underutilizing the GPU, but because you don't have to copy data back to the CPU that's a huge win.
A: So I'll ask you the same question I asked Brian, which is how did you get into this field? You came into this field from chemistry and then you started doing computational work and then GPUs?
I: No. I actually came in from the electrical engineering and computer science side, so I was a EE in undergrad and I did some undergrad research with a parallel computing group. The way that came about was actually long before GPUs were sort of flexibly programmable. This was like shader model 1.0 stuff. I talked to my advisor about doing some GPU computing project and she said "that's way too hard. Why don't you learn traditional supercomputers." So I did that and I did some EE stuff for a while. In undergrad I got interested in computational biology and that sort of stuff so when I came to grad school I told my current advisor that "you know, I'm interested in all these sorts of things." Basically the way I got into GPU computing was I was running ROCS. "I want to do really big data sets and it's not fast enough. I heard GPUs are cool. I wonder if I can do something with this." That's just how it came about.
I: As far as how people can get into GPU computing I think Brian is right. Just read as many of the resources as you can. I think when both of us got started a lot of them didn't exist so it was a lot harder to figure out what you were doing. It took me like five or ten times reading the first few chapters of the CUDA programming guide before I understood the memory model it was talking about. Read the nVIDIA guide, I think the AMD docs have gotten a lot better as well. The CL spec is a spec so it's sometimes a bit hard to understand. I think the respective vendor programming guides give you a better introduction. And just write code. The first version of Paper I wrote, I remember coming back to it a year afterwards and just finding things that were horrifying after I learned about the GPU. But you know, it still worked.
A: So what would be a good first project for someone in chemistry? ROCS or shape overlay?
B: I like the Tanimoto.
I: Which Tanimoto?
B: The bit fingerprint Tanimoto.
I: Yeah, the bit vector Tanimoto is not a bad one to go for.
B: It's trivial to get your first thing in there and you'll get a huge speedup. There's also this iterative thing you can do to really really try to squeeze performance out of it as you go.
I: I agree. It's a simple enough calculation and yet it encompasses a lot of the aspects of parallelism and bandwidth and things like that.
B: You'll learn a lot.
I: Yeah, it's self-contained.
A: You both know I have blog where I write things every once in a while and I have a few entries on the Tanimoto, so once I get my first step into doing some GPU programming I'm going to try that one out, and then I'll see what you all did and see how much slower I am.
Laughing.
A: Alright, anything else to add? ... Sounds like a no.
I: If you are doing fingerprint Tanimoto, I think it's easier to go faster in CUDA because it has a population count intrinsic -
A: That's cheating!
I: - so in OpenCL you'll have to write your own, but that's a cute problem on its own.
A: I've done that, yes.
A: Alright, so thank you both very much for talking about GPUs and let's get back to the conference.
*music*
A: I'm Andrew Dalke. Thank you for listening to Molecular Coding, and thanks to Imran and Brian for being my guests. The recording took place on March 8th, 2011 during a break at CUP. The content of Molecular Coding and any accompanying show notes are licensed under a Creative Commons Attribution No Derivative Works 3.0 US license. The theme music was composed and performed by Andreas Steffen.
Transcript ends. Thanks to my wife, Sara Marie Dalke, for transcribing this episode.
Sunday, June 5, 2011
GPU Computing in Cheminformatics