slembcke 2011-11-15

Chipmunk Pro goes NEON, up to 40% faster!

So this last weekend I was bored. I could have played any number of video games (including my recently acquired copy of Wario Land 2 😀 ), but instead found myself reading instruction manuals for instructions (CPU instructions that is).

On a number of occasions I’ve found myself looking into ways to better optimize Chipmunk specifically for iOS devices. A few years ago I spent a bunch of time looking into the VFP (the vector math extensions) for the original iPhone and iPod friends. VFP was the SIMD (Single Instruction Multiple Data) instruction set supported by the ARM 6 hardware. The idea is that instead of adding up numbers one at a time, you put them into a vector and add (multiply, divide, etc) them up 2 or more at a time and be 2 or more times faster! Nice idea, not so great implementation. VFP’s big letdown was that it didn’t really make anything 2 or more times faster. It just sort of uses less instructions and is a little bit faster. The really big problem is that in order to get “a little bit faster” you have to write a lot of assembly code, and you have have to do it just right or it will end up being slower than the compiler generated code anyway. Generally people stuck to using VFP for implementing things that were straightforward and well ordered like matrix math or particle effects. Trying to use VFP to optimize Chipmunk seemed like a colossal waste of time.

Now enter the 3GS with it’s shiny new ARM 7 CPU. It has a fancy new vector unit called NEON. Unlike VFP, with NEON you can process a vector of two floats in the same amount of time as applying the same operation to a single float. Also, because NEON uses ordinary instructions without needing to set extra state in the CPU, the compiler provides “intrinsics”. Basically, these are functions that map more or less directly onto the instructions the CPU provides. That way you can let the compiler allocate registers for your variables and move your data into and out of memory, and pack and unpack vectors while you just write the math. There is still one problem though. You can’t just switch back and forth between vectors and scalars whenever it’s convenient. Simply replacing Chipmunks vector operators (cpvadd, cpvsub, etc) with NEON intrinsics (vadd_f32, vsub_f32, etc) works, but the performance is awful (I tried it…). The problem is, despite being a vector instruction set, it doesn’t cover a lot of vector algebra (dot or cross products, etc.) because it’s designed to be simple to implement in silicon. This basically leaves you to do some of the compiler’s work and figure out how to rewrite the code onto the limited vector instruction set.

I’ve actually made several attempts to speed Chipmunk up using NEON now. It’s sort of demoralizing to spend hours rewriting a chunk of code to be more complicated only to have it run slower. Ugh! The first time I simply tried replacing the individual vector functions (cpvadd, cpvmult, cpvdot, etc) and seeing what happened. It was a lot slower. The problem was that all the rest of the code didn’t use NEON vectors, and some operations (such as cpvdot or cpvcross) couldn’t be written to use NEON efficiently. So it ended up spending more time packing and unpacking vectors than it ever saved by processing two numbers at a time. Months later I tried rewriting only cpArbiterApplyImpulse() to be vectorized. In most Chipmunk programs, half or more of the CPU time of everything used by Chipmunk is spent in cpArbiterApplyImpulse() so it’s a good candidate for optimizations and has already been the target of a half dozen passes. After hours of tracking down new bugs that I had made, it just ran slightly slower again. Ugh… The problem was that I hadn’t gone all in. I only tried to vectorize the parts that were easy and still avoided things like dot products and all the fancier operations that I really needed to be able to use. So I was still doing a bunch of unnecessary packing/unpacking. The general trick to get around that is to do something like process two pixels at a time instead of trying to process one pixel twice as fast. Unfortunately, while it’s easy with pixels, I couldn’t really do that when processing contacts because it was just too complicated to set up in a sequential solver.

This time I decided I was going to do the whole function (or just give up). Several of the key vector operations that I needed didn’t map onto the instruction set and processing two contacts at once was too complicated. I needed to do something else and realized that I had a third option, why not just rearrange a bunch of math so that I can perform two unrelated dot products or cross products (or whatever) at the same time? When it’s not possible to pair two operations up, just continue processing as a vector anyway and throw away the unused component at the end instead of unpacking several vectors along the way. So I spent an evening turning the 30 lines of fairly readable code into 90 fairly unreadable ones. The best case scenario is that you can double the speed of a piece of code if it maps perfectly onto the vector instructions. Realistically I was expecting maybe a 10 or 15% speedup due to some inevitable packing/unpacking costs and a few places where I had to waste some of the calculations. Instead I got a 40% speedup. Not bad at all! Running the benchmark demos I have, I found that it was a 19% average speedup *overall*! The demos that had a lot of stacked objects were as much as 30-35% faster. The best part is that unlike multithreading, there is 0 overhead and it works on 95% of the iOS devices out there. You can leave it on all the time for any sort of simulation and it will always make things faster.

This is all very exciting and whatnot, but when will it see the light of day, right? Expect it to be available as part of Chipmunk Pro very soon when I release 6.0.3 (along with properly integrated AutoGeometry). You might also be glad to hear that I’ve made the Git repository for Chipmunk Pro to be readable by our customers. That way when I work on improvements like this or need to fix a bug, you can always pull and use the latest without having to wait for the next release. If you have bought Chipmunk Pro and want a free performance boost _right now_, all you have to do is pull from the repository, build it, and replace ChipmunkSpace with ChipmunkHastySpace (or cpHastySpace if you are using the C API).

To check out Chipmunk Pro from git run this, substituting in your own Chipmunk Pro login name and password.
git clone --recursive http://myLogin@chipmunk-physics.net/release/chipmunkPro/ChipmunkPro.git

From there you can just run the iphonestatic.command script (double click or run on the command line with ruby iphonestatic.command) and copy out the ChipmunkPro-iPhone directory into your project.