Matrix Computations on the GPU in Clojure (in TFLOPS!)
First some explanations, the code is down there. (working midje tests on github)
EDIT: since version 0.6.0, Neanderthal supports all major GPUs: AMD, Nvidia, and even Intel!
So, you’ve installed Neanderthal and learned how to work with vectors and matrices on the CPU at much more than 10x than the speed of pure Java libs. It is fast, running at tens of GFLOPS, but you’ve read that nowadays the computations on GPUs are what all the hipster geeks do, so you wonder whether there is something to it. If only you could connect to CUBLAS, your algorithms would speed up thousandfold with almost no sweat, as in commercials…
I have news for you:
The bad: it is not that simple. Your algorithms probably suck on massively parallel architectures, and you’d need to learn quite a few new tricks to collect the benefits you see on NVIDIA and AMD websites.
The good: Neanderthal implements BLAS algorithms and abstracts most of that complexity for vector and matrix computations away, behind a friendly Clojure API.
TL/DR: Multiplication of large matrices can be more than 1000x faster with Neanderthal than with optimized pure Java libraries, 50x faster than Neanderthal native engine and many thousands of times faster than the nested looping code, even with primitives, that you’d write yourself.
What You Need to Know About GPU Programming Before We Begin
The most important thing when working with parallel accellerators (GPUs and other) is to be aware that a large part of your code will run on a physically separate device, which also has its own working memory. It can not access data from the main memory, such as your objects, arrays, primitives and such. It also can not run your arbitrary program code (Java, C, R). Even if/when it could (Aparapi) it sucks at it big time, because it is made of thousands of very dumb processors that are excellent at one thing only - raw computations - and poor at everything else, including logic.
So, a typical approach is that your program consists of two parts: host and device. Host is a typical Clojure (Java) program that do the usual things Clojure programs do: talks to the web, formats and extracts data, does the database dance, recursively compute factorials, and perform other logic-heavy tasks. Then, when it has collected the data and stuffed it in a big raw chunk of a bazillion primitive floating point numbers it sends it to the device memory and tells the device to run the kernel programs and compute the data. When the main program wants to see the results, it has to transfer them from the device memory to the main host memory.
Obviously, if the computations are not intensive and demanding, the whole host/device communication might eat all the performance benefits. That’s why you should not bother to send two numbers to the GPU to be computed in microseconds (‘thanks’ to the communication overhead) when even Java would compute that in a nanosecond; in the same way as you should not call a REST service each time you need to add two numbers.
So, the moral of this story is: avoid copying data to and from the device unless it is absolutely necessary. Even avoid the communication with the device unless it is absolutely necessary.
The typical workflow would be: prepare the input data in your Clojure program and send it to the device. Then, call many Neanderthal and ClojureCL functions that work with that data without transferring it back. Only transfer the final result.
You’ll still write your algorithms in Clojure as any normal Clojure code, you just have to be aware that the data is actually on the device, and that the Neanderthal functions you call run the kernels (supplied by Neanderthal) that are on the device.
Require the Right Namespaces and Set up the Context
Functions for creating the appropriate OpenCL (that means GPU but also other accellerators) vectors or matrices are in the
Import the appropriate namespaces:
core for computation functions,
native for ordinary CPU constructors and
opencl for the accelerated stuff. We also need to discover and set up the device, so we need
And, to be sure that this code is always in the proper working condition,
I’ll write a bunch of midje test facts and include them in the test suite.
Therefore, do not mind these
=>s - they’re not part of Neanderthal.
(ns uncomplicate.neanderthal.examples.guides.tutorial-opencl-test (:require [midje.sweet :refer [facts => truthy]] [criterium.core :refer [quick-bench with-progress-reporting]] [uncomplicate.commons.core :refer [with-release]] [uncomplicate.clojurecl.core :refer [with-default finish!]] [uncomplicate.neanderthal [core :refer [asum dot axpy! mv! mm! transfer! copy]] [native :refer [sv sge]] [opencl :refer [with-default-engine clv clge]]]))
To be able to communicate with the GPU, we need to connect to the device drivers via the ClojureCL library and create appropriate contexts and queues through which we can fine tune the computation executions. Neanderthal can also work with the default setting, which we’ll do here because we do not need all the ClojureCL knobs for the beginning.
So, we will wrap all code we work with into
with-default-engine. Our code will then be executed on the first available device on the first available OpenCL platform. On my machine, that would activate the most capable GPU using the AMD’s OpenCL drivers. Your setup may be different. If you have a pre-2.0 OpenCL platform (Nvidia, OS X), you’ll have to use
with-default-1 macro from the
(with-default (with-default-engine (facts "We'll write our GPU code here, but for now here is only the plain CPU stuff you recognize from the plain Neanderthal tutorial." (asum (sv 1 -2 3)) => 6.0)))
Do The Computations
Let’s see how to do the same computation on the GPU:
(with-default (with-default-engine (with-release [gpu-x (transfer! (sv 1 -2 3) (clv 3))] (facts "Create a vector on the device, write into it the data from the host vector and compute the sum of absolute values." (asum gpu-x))) => 6.0))
And that is all you need to begin. That sum has just been computed on your GPU!
I’ll make a cup of coffee and spice it with some chocolate milk so I can drink it while I am explaining what we have just done here. In the meantime, you can study that code, and probably figure it out yourself.
Making the drink…
Here are the important steps:
Create the 3-element empty vector on the GPU device with
clv. That is short for (with the default engine) ‘single precision floating point CL vector’. We are using single precision because affordable consumer-grade GPU devices offer amazing performance with floats, and are much slower (Radeons 290X - 8x, GeForce 980 - 32x) with doubles. Most of the time you do not even need double precision, but when you do, you need to shell more $$$ for the professional grade products such as FirePro and Tesla. Neanderthal supports doubles just the same.
Write the data from an ordinary vector
(sv 1 -2 3)to the GPU. That data needs to travel from Java to raw memory and from raw memory, over PCIe bus, to the GPU memory. That is a lot of steps, and Neanderthal does all these with only one physical copying, but anyway, that data needs to travel and it takes some precious time, so you should do this as rarely as possible.
Very important: hold the reference to that GPU vector and release it once you do not need it. Neanderthal can do that bookkeeping for you if you use ClojureCL’s
with-releasemacro which works just like Clojure’s let form, but keeps in mind to release the memory on the device when the code reaches the end or if any exception happens to be thrown. Neanderthal would work without this, but your GPU memory will fill up after some time, and refuse to work further.
The happy stuff: call Neanderthal core functions in the same way you’d do for the plain CPU Neanderthal vectors and matrices. Yes, it is that easy.
Measure the Performance
So, what speedups should you expect over native optimized CBLAS that is Neanderthal’s default? Let’s measure it. I’m running this on Intel i7 4790k CPU and AMD Radeon R9 290x GPU. Your hardware will give different numbers.
(with-default (with-default-engine (with-release [host-x (sv 1 -2 3) gpu-x (clv 1 -2 3)] (facts "Compare the speed of computing small vectors on CPU and GPU" (asum host-x) => 6.0 (println "CPU:") (with-progress-reporting (quick-bench (asum host-x))) (asum gpu-x) => 6.0 (println "GPU:") (with-progress-reporting (quick-bench (do (asum gpu-x) (finish!))))))))
When measuring very fast code, the
time function gives wildly imprecise results - thus we replace the calls to
time with calls for criterium
quick-bench and it shows much faster and precise measurements. Anyway, we can see that CPU is much faster: 28 nanoseconds vs many microseconds. This is because calling GPU takes some time (on the order of magnitude of 20-30 microseconds per kernel enqueue, depending on the device and/or engine), and when the device starts computing, most of its many computing units are idling because we have loaded it with only 3 numbers to compute, which it does instantly. When measuring GPU calls, we add the call to
finish! to make sure we wait for the computation to actually be completed. If we didn’t do this, we’d measure only the time it takes to tell the GPU to do the computation.
Let’s try again with more data!
(with-default (with-default-engine (let [cnt (long (Math/pow 2 20))] (with-release [host-x (sv (range cnt)) gpu-x (transfer! host-x (clv cnt))] (facts "Let's try with 2^20. That's more than a million." (asum host-x) => (float 5.49754798E11) (println "CPU:") (with-progress-reporting (quick-bench (asum host-x))) (asum gpu-x) => 5.497552896E11 (println "GPU:") (with-progress-reporting (quick-bench (do (asum gpu-x) (finish!)))))))))
On my machine, it’s almost a tie. Criterium reports 99 microseconds on the CPU vs 107 microseconds on the GPU with amd-gcn engine and 375 microseconds with the (default) clblast engine.
A million is still smallish, though. Let’s get serious. Let’s give a vector of 1GB (that’s 268 million entries) to both:
(with-default (with-default-engine ;; I had to change it to 2^28 because a recent update for my GPU driver caused ;; it to complain about insufficient memory, but this is probably a temporary issue. (let [cnt (long (Math/pow 2 28))] (with-release [host-x (sv (range cnt)) gpu-x (transfer! host-x (clv cnt))] (facts "Let's try with 2^28. That's 1GB, half the maximum that Java buffers can currently handle. Java 9 would hopefully increase that." ;; note the wrong result in the CPU vector. That's because single precision floats ;; are not precise enough for so many accumulations. In real life, ;; you must use doubles in such cases. (asum host-x) => (float 3.6780519E16) (println "CPU:") (with-progress-reporting (quick-bench (asum host-x))) ;; GPU engine uses doubles for this accumulation, so the result is more precise. (asum gpu-x) => 3.6028792723996672E16 (println "GPU:") (with-progress-reporting (quick-bench (do (asum gpu-x) (finish!)))))))))
CPU: 45 milliseconds GPU: 12 milliseconds
Underwhelming. Is that it? This GPU has 5632 GFLOPS, while the CPU has only 32 or so. That’s 176x more muscle! Should the difference be much bigger? The point is: we should keep that muscle busy, and we can not, because the computing units are still idling most of the time waiting data to be transferred from the device memory to the device registers. Sum is a so simple operation that the main constraint is memory throughput, not computing power.
(with-default (with-default-engine (let [cnt (long (Math/pow 2 28))] (with-release [host-x (sv (range cnt)) host-y (copy host-x) gpu-x (transfer! host-x (clv cnt)) gpu-y (copy gpu-x)] (facts "Let's try with a more parallel linear operation: adding two vectors. I'll set them to 1GB each because my GPU does not have enough memory to hold 4GB of data (it has 4GB total memory)." (axpy! 3 host-x host-y) => host-y (println "CPU:") (with-progress-reporting (quick-bench (axpy! 3 host-x host-y))) (axpy! 3 gpu-x gpu-y) => truthy (println "GPU:") (with-progress-reporting (quick-bench (do (axpy! 3 gpu-x gpu-y) (finish!)))))))))
CPU: 134 ms GPU: 16 ms
Not bad, but still less than 10x faster. Linear 1D operations are simply so easy on computation that GPU can not show it’s power. They are still useful, though. If your vector data is already on the GPU, where it participates in some complex computations that GPU shines at, then it is easier to compute it on the GPU than to transfer it back and forth to the CPU.
Let’s try with some BLAS 2 operation. Their quadratic complexity should matter. We’ll do a matrix - vector multiplication.
(with-default (with-default-engine (let [cnt 8192] (with-release [host-a (sge cnt cnt (range (* cnt cnt))) host-x (sv (range cnt)) host-y (copy host-x) gpu-a (transfer! host-a (clge cnt cnt)) gpu-x (transfer! host-x (clv cnt)) gpu-y (copy gpu-x)] (facts "Matrix-vector multiplication. Matrices of 8192x8192 (268 MB) are usually demanding enough." (mv! 3 host-a host-x 2 host-y) => host-y (println "CPU:") (with-progress-reporting (quick-bench (mv! 3 host-a host-x 2 host-y))) (mv! 3 gpu-a gpu-x 2 gpu-y) => gpu-y (println "GPU:") (with-progress-reporting (quick-bench (do (mv! 3 gpu-a gpu-x 2 gpu-y) (finish!)))))))))
CPU: 15.4 ms GPU: 1.13 ms
That’s a 15x win for the GPU. Nothing too much, but still ok. Let’s try matrix multiplication and see how that goes.
(with-default (with-default-engine (let [cnt 8192] (with-release [host-a (sge cnt cnt (range (* cnt cnt))) host-b (copy host-a) host-c (copy host-a) gpu-a (transfer! host-a (clge cnt cnt)) gpu-b (copy gpu-a) gpu-c (copy gpu-a)] (facts "Matrix-matrix multiplication. Matrices of 8192x8192 (268 MB) are usually demanding enough." (println "CPU:") (time (mm! 3 host-a host-b 2 host-c)) => host-c (mm! 3 gpu-a gpu-b 2 gpu-c) => gpu-c (finish!) (println "GPU:") (time (do (mm! 3 gpu-a gpu-b 2 gpu-c) (finish!))))))))
CPU: 17678 ms GPU: 293 ms
That’s almost 60x faster than the CPU! But, still, shouldn’t it be even faster? You’ve probably seen those benchmarks with 1000x speed improvements!
Let’s consider matrix multiplication. It is a complex operation - O(n^3), but at its core a very simple computation. For each few float multiplications and additions there is also a couple of memory readings and writings. Therefore, GPU wins hugely, but it still has unused computation power.
Thinking About the Results
Where do those 1000x numbers come from then? That depends on what you compare to.
This is a very important issue. Remember, here we’ve compared Neanderthal’s GPU speed to Neanderthal’s highly optimized native ATLAS BLAS engine, which is a speed demon in its own right! And we got a 60x speedup.
How does Neanderthal compare to pure Java? Check out the Neanderthal Benchmarks page. For 8192x8192 matrices, an optimized and decently fast pure Java library Vectorz (which is the core.matrix flagship implementation) working with primitives and optimizing cache use, needs 6.14 minutes to compute. That’s 368400 milliseconds. Neanderthal GPU is 1200x faster than that! And, there are several GPUs on the market that are considerably faster than my Radeon 290X.
Of course, if you try to write your own nested loops to compute these matrices, even pure Java libraries will run circles around your operations, and Neanderthal will be several thousands times faster, even when you write tight Java loops with primitives.
Matrix algebra is only a start. The real benefit is when you use Neanderthal as a gate and a foundation to write your own ClojureCL numerical computation kernels for your own number crunching algorithms. If they are computationally intensive enough and parallel, THEN you can hope for real thousandfold improvements.
Can You Run This On Your Own Machine?
At the time of writing of this text, Neanderthal builds its accelerator support on ClojureCL, which is a Clojure library for programming with OpenCL, which is in turn an open standard for GPU and accelerated computing. Thanks to Neanderthal’s pluggable architecture, BLAS engines optimized for any OpenCL-compatible architecture can be transparently added, and the code of this tutorial does not need to change at all.
EDIT: since version 0.6.0, Neanderthal supports all major GPUs: AMD, Nvidia, and even Intel!
I am running this tutorial with clblast-single engine on AMD Radeon R9 290X.
(facts "Are you ready to write number crunching Clojure code now?" :answer => truthy)
Tell Us What You Think!
Please take some time to tell us about your experience with the library and this site. Let us know what we should be explaining or is not clear enough. If you are willing to contribute improvements, even better!