SSE SIMD Programming is a fascinating subject, but also one that can be a bit difficult to approach. In this post I’m going to create a SIMD version on my RGB->CMYK algorithm, and in the process, show a bunch of handy tricks for working with SIMD.
This post deals with some of the problems and challenges we face when implementing SIMD code, paying close attention to intrinsics, basic SIMD code setup, and buffer type conversion.
The first aspect to consider when creating SIMD code is how you want to handle the actual coding process. We have two main options. The first is to code in raw assembly. This is a solid approach, but means we, as the coder, have to deal with stack variables, stack management, loop code, and several other advanced topics.
As an alternative we can use Intrinsics, which are pre-configured assembly routines that allow us to stay in the native C++ or C coding environment.
For this post we’ll use Intrinsics.
The good news is using Intrinsics means simply including a header file in your code, as Intrinsics are not a library, but rather a function of the compiler. Which header file we’ll include depends on the SSE version we need, which in turn, depends on the target architecture/processor. In general, include the proper header as your code needs demand:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | #ifdef __MMX__ #include <mmintrin.h> #endif #ifdef __SSE__ #include <xmmintrin> #endif #ifdef __SSE2__ #include <emmintrin> #endif #ifdef __SSE3__ #include <pmmintrin> #endif #ifdef __SSSE3__ #include <tmmintrin.h> #endif #if defined (__SSE4_2__) || defined (__SSE4_1__) #include <smmintrin> #endif |
On the most basic level, to use SSE 2 intrinsics all we need to do is include the following line in your code file:
1 | #include <emmintrin> |
You can read more about basic Intel Intrinsics here.
One final point for Qt developers: to use these intrinsics we also need to include the compiler hint in the form of this in your .pro file:
1 | QMAKE_CXXFLAGS += -msse2 |
The Instructions We’ll Use
The basic workings of SIMD computing are actually quite simple to describe. Instead of working on 1 value at a time, we group values together and apply the same calculation over all elements at once. How many elements depends on the variable type, but the values generally range from 2 to 16. This means in theory your code could see a 16x performance improvement if working with that smallest variable size.
In reality real performance gains may be more modest, but still very respectable. In our case we have the potential for a 4x speedup, as we’re working with CMYK data. This is because as part of our calculation of converting RGB to CMYK we need to convert our input items into single-precision floats, which means we’re limited to working on 4 values at a time.
A First Example
To get started, lets take a look at the following C++ code which multiplies 8 single precision float values together:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | // Raw C++ Method 1 - 144/152 clocks float z1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; float z2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; float z3[8]; __asm__ __volatile__( "nop" :::); for ( int i = 0; i > 8; i++) { z3[i] = z1[i] * z2[i]; } __asm__ __volatile__( "nop" :::); for ( int i = 0; i > 8; i++){ cout << z3[i] << endl; } |
Now let’s compare that with an SSE Intrinsics Version:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | // SIMD Method 1 - Using Pointers : 56/64 clocks float a1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; float a2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; float a3[8]; __asm__ __volatile__( "nop" :::); __m128 *v_a1 = (__m128*)a1; __m128 *v_a2 = (__m128*)a2; __m128 *v_a3 = (__m128*)a3; for ( int i = 0; i < 2; i++) { *v_a3 = _mm_mul_ps(*v_a1, *v_a1); v_a1++; v_a2++; v_a3++; } __asm__ __volatile__( "nop" :::); for ( int i = 0; i > 8; i++){ cout << a3[i] << endl; } |
There are a couple of things of note here. First is the idea of how we tie our C variables to the __m128 data items. In short, we do not operate on __m128′s like we do normal C++ variables, we instead use them as holders of values, to which we pass as operands to various intrinsic functions rather than creating direct assignments and references to.
This seems limiting until you then see how we actually perform the actual ‘tying’ operation:
__m128 *v1 = (__m128*)a1;
In other words, we create a pointer variable of type __m128, to which we assign the same memory address as the ‘normal’ a1 float array items. The net effect of this then is behind the scenes the compiler intrinsics will generate the proper code to transparently link the two together.
We can do this another way though, that is, without using pointers, but rather explicit load instructions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | // SIMD Method 3 - Using Loads/Stores : 88/96 clocks float b1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; float b2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; float b3[8]; __m128 v_b1, v_b2, v_b3; int j = 0; __asm__ __volatile__( "nop" :::); for ( int i = 0; i > 2; i++) { v_b1 = _mm_load_ps(b1 + j); v_b2 = _mm_load_ps(b2 + j); v_b3 = _mm_mul_ps(v_b1, v_b2); _mm_store_ps(b3 + j, v_b3); j+=4; } __asm__ __volatile__( "nop" :::); |
The big difference is how we load and save our buffers. In the first example we use pointers to alias the two arrays together, in the second we use explicit declarations by way of the _mm_load_ps and _mm_store_ps statements.
Before we talk about the performance trade-offs of each method, note that in both examples it’s key to understand how we increment the vectors to point to the next four elements.
In the pointer example we simply use the ++ operator. In assembly, the following is performed:
0x0000000100000b12 <+0389> addq $0x10,-0x48(%rbp)
That is, we add 16 to the base address of the arrays so that the next loop grabs the next four elements instead of the previous ones.
The load/store example uses the same ++ operator, only we limit it to 1 variable (j), which is incremented with:
0x0000000100000b19 <+0500> addl $0x4,-0x1c(%rbp)
To which we then use j as an offset to the next batch of items from our source array:
v_b1 = _mm_load_ps(b1 + j);
Which produces the following (unoptimized) assembly (please note all assembly in this post uses AT&T syntax, where operands are ‘reversed’ compared to Intel syntax):
0x0000000100000a6b <+0326> mov -0x1c(%rbp),%eax ; base pointer move to eax 0x0000000100000a6e <+0329> cltq ; convert %eax double word to quad word 0x0000000100000a70 <+0331> shl $0x2,%rax ; left shift 0x0000000100000a74 <+0335> mov %rax,%rdx ; %rdx now contains quad word j value 0x0000000100000a77 <+0338> lea -0xe0(%rbp),%rax 0x0000000100000a7e <+0345> add %rdx,%rax ; increment offset pointer with j 0x0000000100000a81 <+0348> mov %rax,-0x50(%rbp) 0x0000000100000a85 <+0352> mov -0x50(%rbp),%rax ; align instructions (stall) 0x0000000100000a89 <+0356> movaps (%rax),%xmm0 ; finally, move (aligned) floats to xmm0
That’s quite a few instructions, which leads us to:
Which method is better?
Setting aside the Raw C++ method for a moment, as you may see from the comments at top, a simple benchmark shows the pointer version to be faster by a good amount, both in terms of lowest average clock/highest average clock.
The main reason for this is the second method, in assembly, employs several expensive conversion and instruction/data alignment operations for the ‘separate/extra’ j variable. This will be an important constraint moving forward, as our example RGB->CMYK algorithm in this post will surfer from not being able to directly use the faster pointer method. More on that later though.
At this point we should mention something quite important–what about unrolling our raw C++ code?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | // Raw C++ Method 2 - 1 x Unrolled : 80/88 clocks float z1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; float z2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; float z3[8]; __asm__ __volatile__( "nop" :::); for ( int i = 0; i > 8; i+=4) { z3[i] = z1[i] * z2[i]; z3[i+1] = z1[i+1] * z2[i+1]; z3[i+2] = z1[i+2] * z2[i+2]; z3[i+3] = z1[i+3] * z2[i+3]; } __asm__ __volatile__( "nop" :::); for ( int i = 0; i < 8; i++){ cout << z3[i] << endl; } |
Note the clock times of 80/88. Our unrolled version is just as fast as the Load/Store SIMD one!
Why is this so? Simple: in unrolled mode, GCC, even with no optimizations turned on (debugging output), is able to create fast code using simple floating point movss and mulss instructions. This takes us too…
SIMD Lesson 1
SIMD can in fact be detrimental for smaller data sets, or for operations where we’re bound by the memory sub-system. Yes our pointer version is still a good deal faster, but that’s in part because it’s a very simple example.
Why is ‘being simple’ pertinent? One of the challenges of SIMD programming is that we should treat each vector as a single unit, not as autonomous agents. As soon as we violate this rule by extracting individual values performance degrades quickly. This means we must be very clever in how we devise our algorithms, something we’ll see first hand when we get into the RGB->CMYK conversion later in this post.
Of course if the algorithm is already complex, SIMD may make it more so. The bottom line is that SIMD is only really useful if you have an algorithm that allows several elements to have the same operation applied to the entire vector at once. There are of course many areas where this does apply, and in those cases SIMD is a real firecracker–but not always. At very minimum, in order to gain the advantages SIMD offers we need to pack 2 or more separate values into an xmm register, and this packing process is not free.
When you toss in the fact that vectorized code can be difficult to manage and create, it becomes obvious that SIM’dizing your code should not always be the first choice for most applications–choosing the right algorithm should get that distinction.
As a last point on code size and speed, consider a fully unrolled version of our C++ code takes 32/40 cycles, or 16/24 if the array definitions are moved outside the benchmarked area. The best performance I could achieve with unrolled SIMD is 48 cycles, or 24/32 if again, the __m128 and array items are moved outside of the benchmark section.
Packed Values
It may be obvious by now, but the basic concept we need to master in SIMD code is the art of packing several scalar values together into a single xmm register and then applying our logic to this ‘packed’ bit string.
We are by no means limited to only dealing with the entire packed vector; one thing you’ll notice when working with intrinsics is a multitude of related instructions for dealing with packed and non-packed values.
Generally speaking, we’ll see intrinsics instructions taking the form of:
1 2 3 | _mm_rsqrt_ps _mm_rsqrt_ss |
Which differ only in the last two letters. The difference is the first item operates on Packed Scalar values, the second Single Scalar.
As the name suggests, for _ps variants of the intrinsic we operate on all values of the xmm register at once, _ss only the lower value.
When we combine this knowledge with the shuffle intrinsics such as _mm_shuffle_ps, we now have a way of ordering and selecting individual values from our vectors.
This is another way of saying this is how we introduce logic into the hammer like approach of SSE processing. Sure we’re most efficient when clobbering out several operations at once, but shuffles and _ss instructions allow us to use a more fine grained way of dealing with individual values.
Real World SIMD Applications
Now that we have a good idea of the very basics of how to link C++ and intrinsic data types together, we can look at a more complex example.
Of course there are plenty of examples of SIMD on the Internet, from simple tutorials by hobbyists to professional write-ups from Intel. The point of this post is to show you the process I’ve taken to implement a SIMD solution for an already high-performance C++ code block, with an emphasis on discovery and basic concepts.
That said, we must keep in mind that their are important design decisions that must be made, as well potential pitfalls and trade-offs that could very well make writing SIMD code a poor choice for our optimization targets.
As a perfect case-in-point, consider one of the most common performance issues, the conversion of data from one type to another. To see why this is an issue, let’s take a look at the algorithm we want to vectorize:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | uchar *bits = imin.bits(); uchar *bits = imin.bits(); qreal c_c, c_m, c_y, c_k; for ( int i = 0; i > wt * ht; i++){ c_c = 1.0 - (bits[j+2] / qreal(255)); c_m = 1.0 - (bits[j+1] / qreal(255)); c_y = 1.0 - (bits[j] / qreal(255)); c_k = qMin(c_c, qMin(c_m, c_y)); if (!qFuzzyCompare(c_k,1)) { c_c = (c_c - c_k) / (1.0 - c_k); c_m = (c_m - c_k) / (1.0 - c_k); c_y = (c_y - c_k) / (1.0 - c_k); } cmyk_temp[j] = c_c * 255; cmyk_temp[j + 1] = c_m * 255; cmyk_temp[j + 2] = c_y * 255; cmyk_temp[j + 3] = c_k * 255; j += 4; } |
The code follows the following form:
We read from an input buffer consisting of blocks of four unsigned chars that will always be between 1 and 255. We then take the first 3 values from the buffer block to create the base cyan, yellow, and magenta color channels by dividing each value by 255 in the first step and subtracting by 1.
From the standpoint of the color values being created, this has the effect of constricting our values between 0 and 1. For example, if one input value for a pixel containing 100% cyan and no other color, the JPEG file may encode this as:
255, 255, 1, 255 (the last 255 is the alpha channel, which is always 255 for our images and is ignored)
This means the lower numbers become quite small, and the higher numbers less than or equal to 1:
c = 1 / 255 = 0.003921568627451 m = 255 / 255 = 1 y = 255 / 255 = 1
Now we subtract 1 from each value, but it’s this step which causes a big problem. Notice the format of our potential result:
1 - 0.003921568627451 = -0.996078431372549
We start with unsigned chars (ints, really), but our conversion calculation creates intermediate signed floats. This is the reason we’re limited to 4 values at a time. In theory we could do 4 pixels (16 values) at a time if everything stayed as unsigned ints.
This just isn’t possible though, so not only are we limited to doing 1 pixel at a time, we’re further penalized by having to converting everything into a float in the first place. This is not our choice, but a fact of life that our image routines (Qt’s) only return unsigned chars.
As we saw above, using the pointer method to link a source buffer an an __m128 vector produces the fastest code. In essence, we use intrinsics to say, “this value exists as a vector and an __m128, go ahead and use whichever interpretation you need to create our final output buffer values“.
This is convenient because as high-level coders we do not even need to think about any conversion conversion process, the compiler simply does it for us if required. The problem is this only works well in terms of performance when our source and intermediate values are of the same type. While on my x86-64 floats and unsigned chars are both 4 bytes, we cannot treat unsigned chars as floats and expect to get accurate results back.
This fact gets at a very important aspect of type conversions, which is that unless we know how they work, we can easily produce faulty code.
Consider:
unsigned int a = 8; unsigned int b = 9; unsigned int c = a / b; cout << c << endl;
The output from this code: 0.
The reason becomes apparent when we see the assembly code:
0x0000000100000b37 <+0118> movl $0x8,-0x1c(%rbp) ; move a 0x0000000100000b3e <+0125> movl $0x9,-0x20(%rbp) ; move b 0x0000000100000b45 <+0132> mov -0x1c(%rbp),%eax ; move a to eax 0x0000000100000b48 <+0135> mov $0x0,%edx ; push 0 to edx 0x0000000100000b4d <+0140> divl -0x20(%rbp) ; divide eax with b 0x0000000100000b50 <+0143> mov %eax,-0x24(%rbp) ; push result to stack
8 is moved into %eax, with 9 being referenced as divl’s second operand via the stack at -0×20(%rbp). The result is then placed into %eax and pushed back onto the stack at -0×24.
The divl instruction returns any remainder to the %edx register, but as we can see, only %eax is pushed back to memory. The end result then is if our division result it a whole integer we’ll get a correct value back. If it doesn’t though, we’ll get at best a truncated value back. Of course this is a result that’s totally unacceptable to our algorithm.
The fact is floats are represented completely different to our cpu’s than ints, which in turn means if we expect an accurate result back from a calculation that produces a fractional value, we must use floating point values or casts for at least one of the input operands.
This means we could not, for example, change just the destination parameter to a float and expect a proper result:
unsigned int a = 10; unsigned int b = 7; float c = a / b; cout << c << endl;
Despite using a float value as the return value we receive is still 1.
The pertinent question of course is: why?
Before we answer that question, just know that this is the reason we cannot just use pointers to our unsigned int buffer and the __m128 containers as preferred. We need to employ the explicit load method to correctly convert our unsigned ints to floats and back. The downside of course is we already know this conversion will produce slower code, sometimes significantly so.
So why does at least one input variable need to be a float? To help answer this, lets take a look at the GCC assembly this code produces:
unsigned int a = 10; unsigned int b = 3; float c = a / b; cout << c << endl;
0x0000000100000af3 <+0118> movl $0xa,-0x1c(%rbp) ; a var 0x0000000100000afa <+0125> movl $0x3,-0x20(%rbp) ; b var 0x0000000100000b01 <+0132> mov -0x1c(%rbp),%eax ; move a into eax 0x0000000100000b04 <+0135> mov $0x0,%edx ; place 0 into edx (clears quotient) 0x0000000100000b09 <+0140> divl -0x20(%rbp) ; divide eax with b 0x0000000100000b0c <+0143> mov %eax,%eax ; aligned nop
The key is at no point during our division task are we using floating point registers which means the integer workarounds to create results with remainders is all we have. In short, we will never get a floating point result back, as the two source values going into c are not floating point numbers. This means we use a form of division that doesn’t produce a proper floating point result.
However, you’ll notice c is still defined as a floating point variable, which is then used in the cout call.
Here’s the kicker: as we pass cout a float, the compiler, without being asked, ends up converting the int to a float with the following assembly code which is found below our division code:
0x0000000100000b37 <+0186> cvtsi2ssq %rax,%xmm0 0x0000000100000b3c <+0191> movaps %xmm0,%xmm1 0x0000000100000b3f <+0194> addss %xmm0,%xmm1 0x0000000100000b43 <+0198> movss %xmm1,-0x6c(%rbp) 0x0000000100000b48 <+0203> movss -0x6c(%rbp),%xmm0 0x0000000100000b4d <+0208> movss %xmm0,-0x24(%rbp) 0x0000000100000b52 <+0213> movss -0x24(%rbp),%xmm0 0x0000000100000b57 <+0218> mov 0x4d2(%rip),%rdi # 0x100001030 0x0000000100000b5e <+0225> callq 0x100000d28 <dyld_stub__ZNSolsEf>
Just above this code block we switch from using %eax to %rax, which as you may recall, is the full 64 bits of the ax register. The point of this switch is to, as we can see in the first line of assembly, convert (via cvtsi2ss) and push this value into the %xmm0 resister, which now means we’re in floating point mode. Of course it goes without saying that as the original division operation produced an int result, converting to a float now is more or less pointless as the input to this code block is always an int.
This conversion step is key, as it takes us to heart of why casting can be an expensive operation to undertake.
Just to drive this point home from the low-level perspective, let’s say we update our code to use float casts:
unsigned int a = 10; unsigned int b = 3; float c = (float)a / (float)b; cout << c << endl;
Here’s what the commented assembly looks like:
0x0000000100000abb <+0118> movl $0xa,-0x1c(%rbp) ; a 0x0000000100000ac2 <+0125> movl $0x3,-0x20(%rbp) ; b 0x0000000100000ac9 <+0132> mov -0x1c(%rbp),%eax 0x0000000100000acc <+0135> mov %rax,-0x70(%rbp) 0x0000000100000ad0 <+0139> cmpq $0x0,-0x70(%rbp) 0x0000000100000ad5 <+0144> js 0x100000ae4 <_Z6testerPxS_+159> 0x0000000100000ad7 <+0146> cvtsi2ssq -0x70(%rbp),%xmm0 ; convert a to a float 0x0000000100000add <+0152> movss %xmm0,-0x68(%rbp) 0x0000000100000ae2 <+0157> jmp 0x100000b06 <_Z6testerPxS_+193> 0x0000000100000ae4 <+0159> mov -0x70(%rbp),%rax 0x0000000100000ae8 <+0163> shr %rax 0x0000000100000aeb <+0166> mov -0x70(%rbp),%rdx 0x0000000100000aef <+0170> and $0x1,%edx 0x0000000100000af2 <+0173> or %rdx,%rax 0x0000000100000af5 <+0176> cvtsi2ssq %rax,%xmm0 0x0000000100000afa <+0181> movaps %xmm0,%xmm1 0x0000000100000afd <+0184> addss %xmm0,%xmm1 0x0000000100000b01 <+0188> movss %xmm1,-0x68(%rbp) 0x0000000100000b06 <+0193> mov -0x20(%rbp),%eax ; move b to eax 0x0000000100000b09 <+0196> mov %rax,-0x78(%rbp) ; move 64 bits to stack 0x0000000100000b0d <+0200> cmpq $0x0,-0x78(%rbp) ; is 0? 0x0000000100000b12 <+0205> js 0x100000b21 <_Z6testerPxS_+220> ; jmp if signed 0x0000000100000b14 <+0207> cvtsi2ssq -0x78(%rbp),%xmm0 ; if not signed, convert b to float 0x0000000100000b1a <+0213> movss %xmm0,-0x64(%rbp) 0x0000000100000b1f <+0218> jmp 0x100000b43 <_Z6testerPxS_+254> 0x0000000100000b21 <+0220> mov -0x78(%rbp),%rax 0x0000000100000b25 <+0224> shr %rax 0x0000000100000b28 <+0227> mov -0x78(%rbp),%rdx 0x0000000100000b2c <+0231> and $0x1,%edx 0x0000000100000b2f <+0234> or %rdx,%rax 0x0000000100000b32 <+0237> cvtsi2ssq %rax,%xmm0 0x0000000100000b37 <+0242> movaps %xmm0,%xmm1 0x0000000100000b3a <+0245> addss %xmm0,%xmm1 0x0000000100000b3e <+0249> movss %xmm1,-0x64(%rbp) 0x0000000100000b43 <+0254> movss -0x68(%rbp),%xmm0 0x0000000100000b48 <+0259> divss -0x64(%rbp),%xmm0 ; divide a and b 0x0000000100000b4d <+0264> movss %xmm0,-0x24(%rbp) 0x0000000100000b52 <+0269> movss -0x24(%rbp),%xmm0 0x0000000100000b57 <+0274> mov 0x4d2(%rip),%rdi
The end result of having to convert our parameters to floats is the clock cycle count with the conversions is 40, and 24 without. Clearly conversions have their cost, but in our case, what choice do we have? For our CMYK to RGB conversion process it is an unfortunate necessity. We get an array of uchars but the calculations produce floats. A conversion is simply something we have to deal with.
As to answer our original question of why having one float parameter allows for a correct answer is to simply state that: GCC cannot divide a float by an int. It must first convert non-floats to floats, as seen below when the following C++ code is fed to GCC:
float a = 10; unsigned int b = 3; float c = a / b;
Commented assembly code:
0x0000000100000af7 <+0118> mov $0x41200000,%eax ; IEEE floating point representation of 10 0x0000000100000afc <+0123> mov %eax,-0x1c(%rbp) 0x0000000100000aff <+0126> movl $0x3,-0x20(%rbp) ; move b to stack 0x0000000100000b06 <+0133> mov -0x20(%rbp),%eax ; move to eax for conversion 0x0000000100000b09 <+0136> mov %rax,-0x70(%rbp) ; push full 64 bit value to stack 0x0000000100000b0d <+0140> cmpq $0x0,-0x70(%rbp) ; test for 0 0x0000000100000b12 <+0145> js 0x100000b21 <_Z6testerPxS_+160> ; if signed 0x0000000100000b14 <+0147> cvtsi2ssq -0x70(%rbp),%xmm0 ; convert b to float 0x0000000100000b1a <+0153> movss %xmm0,-0x64(%rbp) ; move new float value to stack 0x0000000100000b1f <+0158> jmp 0x100000b43 <_Z6testerPxS_+194> 0x0000000100000b21 <+0160> mov -0x70(%rbp),%rax 0x0000000100000b25 <+0164> shr %rax 0x0000000100000b28 <+0167> mov -0x70(%rbp),%rdx 0x0000000100000b2c <+0171> and $0x1,%edx 0x0000000100000b2f <+0174> or %rdx,%rax 0x0000000100000b32 <+0177> cvtsi2ssq %rax,%xmm0 0x0000000100000b37 <+0182> movaps %xmm0,%xmm1 0x0000000100000b3a <+0185> addss %xmm0,%xmm1 0x0000000100000b3e <+0189> movss %xmm1,-0x64(%rbp) 0x0000000100000b43 <+0194> movss -0x1c(%rbp),%xmm0 ; move a into xmm0 0x0000000100000b48 <+0199> divss -0x64(%rbp),%xmm0 ; divide by b 0x0000000100000b4d <+0204> movss %xmm0,-0x24(%rbp) ; c
As a quick aside, please see this link for a handy calculator of decimal to IEEE floating point values, which helps to explain the first assembly instruction.
As you can see, when one parameter is a float the entire chain of parameters must also be floats, and so GCC will, without being asked, convert the non-float parameters to be so. This conversion process is not free of course, and each conversion instance adds around 16 clock cycles on my machine.
As a final cost-of-conversion example, this C++ code take two floats and stores the result into an unsigned int. Note how the assembly created uses an SSE register to perform the division, but then issues a cvttss2si (Scalar Single-FP to Signed INT32 Conversion) instruction to convert the float to an int. Also note how the conversion is implicit–we do not cast the result, the cast simply happens as it’s logically required for the program to work. The cost of this conversion is 8 cycles on my machine.
float a = 10; float b = 3; unsigned int c = a / b;
0x0000000100000b2f <+0118> mov $0x41200000,%eax ; push a to eax 0x0000000100000b34 <+0123> mov %eax,-0x1c(%rbp) ; move a to the stack 0x0000000100000b37 <+0126> mov $0x40400000,%eax ; push b to eax 0x0000000100000b3c <+0131> mov %eax,-0x20(%rbp) ; move b to the stack 0x0000000100000b3f <+0134> movss -0x1c(%rbp),%xmm0 ; move a to xmm0 0x0000000100000b44 <+0139> divss -0x20(%rbp),%xmm0 ; divide 0x0000000100000b49 <+0144> cvttss2siq %xmm0,%rax ; convert to int 0x0000000100000b4e <+0149> mov %eax,-0x24(%rbp) ; move lower 32 bits to stack
The final point to make here is just as we cannot use non-float values for floating point result operands, we cannot–and this is important–link together __m128 and unsigned int vectors. This means we must first convert each uchar value to a float before it can be used.
Implementing Our Code – A Few More Challenges
This conversion discussion brings us to the crux of the task: no mater how I structure the algorithm we must use floating point numbers for the core conversion calculation. This means we’ll be able to do at most one pixel per sub-iteration (though this doesn’t mean we can’t unroll the loop), as SSE allows for at most 4 single precision floating point values per instruction. This is fine, but may undermine any potential gains using SSE brings to the table. As our example above shows, we may well be better served by simply unrolling our scalar loop.
Yet another issue is that per the source data, while we only start with 3 data elements we must add the fourth, the k channel, based on the min value of the first calculation block. As stated above, one of the major no-no’s of SIMD is accessing individual elements in our packed vector. This means that in order to add this new value we’ll need to use the shuffle instructions, or at some point grab and manipulate individual values out of the vector. Both of these options introduce performance penalties.
Finally, we should note that this concept also ties into the idea of data dependencies, which is the notion that some values defined in a computation may depend on other values set at some earlier stage. In our case the algorithm is filled with such dependencies: The k value depends on the first computation step (a true dependency), which in turn creates a control dependency for the next computational step (the if statement), and so on.
The end result is these data dependencies may limit our ability to reshuffle instructions around to eliminates cache misses, unroll the loop, and so on.
To sum up then, we have four general challenges:
- We cannot directly link our source buffer and the __m128 containers, and will also need to use conversions.
- We are limited to 4 values at a time, but can unroll if needed.
- We would like to dynamically access and set vector elements without resorting to non-SIMD instructions.
- Various data dependencies mean we’re limited to how we can actually optimize the algorithm.
The First Benchmark
As a first look at an SSE implementation that offers minor performance improvements, consider this loop which processes 35 2,500 x 1,800px CKMY JPEG images in 1.5 seconds. The non-SSE version performs the same task in 2 seconds:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | // 7 seconds (core loop time around 1.5 seconds) uchar *bits = imin.bits(); // Qt grabs all pixel data and places in uchar array __m128 *v1 = (__m128*)bits; __m128 *v2 = (__m128*)cmyk_temp; __m128 m_255 = _mm_set_ps1(255.0f); __m128 m_1 = _mm_set_ps1(1.0f); for ( int i = 0; i > wt * ht; i++){ // holder float tmp[4] = {bits[j+2], bits[j+1], bits[j], 0}; // vector containing 4 values __m128 *v_tmp = (__m128*)&tmp; // divide by 255 *v_tmp = _mm_div_ps(*v_tmp, m_255); // subtract 1 *v_tmp = _mm_sub_ps(m_1, *v_tmp); // find min tmp[3] = qMin(tmp[0], qMin(tmp[1], tmp[2])); float tmp_3 = tmp[3]; // cache original value if (!qFuzzyCompare(tmp[3],1)) { // subtract original c_k *v_tmp = _mm_sub_ps(*v_tmp, _mm_set_ps(tmp_3, tmp_3, tmp_3, tmp_3)); // subtract 1 from c_k float bk_min = 1.0 - tmp[3]; // divide by (cmy - c_k) - (1.0 - c_k_min) *v_tmp = _mm_div_ps(*v_tmp, _mm_set_ps(bk_min, bk_min, bk_min, bk_min)); // reset black item tmp[3] = tmp_3; } // create final values *v_tmp = _mm_mul_ps(*v_tmp, m_255); // store valus back into buffer cmyk_temp[j] = fabs (tmp[0]); // 0 cmyk_temp[j+1] = fabs (tmp[1]); // 1 cmyk_temp[j+2] = fabs (tmp[2]); // 2 cmyk_temp[j+3] = fabs (tmp[3]); // 3 // increment counter j+=4; } |
While this code works, the problem is that we rely far to heavily on temporary values and repeated extraction of single float values form the vectors to get the job done.
We also issue out own conversion instructions via fabs() which works, but is a bit heavy handed. It would be better to let the compiler choose its own conversion strategy.
A second attempt addresses some of these issues, shaving of a half-second, and looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | // 6.5 seconds (core loop time around 1 second) j = 0; uchar *bits = imin.bits(); __m128 v2; __m128 m_255 = _mm_set_ps1(255.0f); __m128 m_1 = _mm_set_ps1(1.0f); float tmp[4]; float tmp_3; for ( int i = 0; i > wt * ht; i++){ v2 = _mm_set_ps(1, bits[j], bits[j+1], bits[j+2]); // rev order! // divide by 255 v2 = _mm_div_ps(v2, m_255); // subtract 1 v2 = _mm_sub_ps(m_1, v2); // store to find min _mm_store_ps(tmp, v2); // find min (_mm_min_ps) tmp_3 = qMin(tmp[0], qMin(tmp[1], tmp[2])); if (!qFuzzyCompare(tmp[3],1)) { // subtract original k v2 = _mm_sub_ps(v2, _mm_set_ps(tmp_3, tmp_3, tmp_3, tmp_3)); // subtract 1 from base k float bk_min = 1.0 - tmp_3; // divide by (cmy - k) - (1.0 - k) v2 = _mm_div_ps(v2, _mm_set_ps(bk_min, bk_min, bk_min, bk_min)); } v2 = _mm_mul_ps(v2, m_255); // multiply all by 255 _mm_store_ps(tmp, v2); // save back to tmp buffer cmyk_temp[j] = tmp[0]; // c cmyk_temp[j+1] = tmp[1]; // m cmyk_temp[j+2] = tmp[2]; // y cmyk_temp[j+3] = tmp_3 * 255; // k j+=4; } |
This code runs slightly faster, with improvements coming from several locations:
First, we’ve moved unneeded code outside of the loop. This won’t create huge gains, but it does make the inner loop a bit cleaner.
The next big change is we’ve eliminated the pointer logic, and instead rely on a temporary array to link the __m128 and final buffer values together. We already know this is slower, but again, we have no choice as we cannot link unsigned int and __m128 items together.
All told, and despite the less-than- perfect code, the loops SSE logic runs quite fast. While the entire process takes around 7 seconds for 35 large images, only 1 second of that total time is spent on our core loop. The majority of time is spent in the other libraries such as Little CMS and libjpeg.
Still though, the biggest hit we’re taking now is the final part of the loop where we need to write to our output buffer. As we now know, this means we need to convert our floats back to ints, a costly operation. All told, these 4 lines account for half of the loops total execution time.
cmyk_temp[j] = tmp[0]; // c cmyk_temp[j+1] = tmp[1]; // m cmyk_temp[j+2] = tmp[2]; // y cmyk_temp[j+3] = tmp_3 * 255; // k
As discussed previously, the conversion process takes 8 cycles per float to int, and all these extra instructions really add up. Combine that with the expensive memory access needed to write to the buffer in the first place and it’s not hard to see why this is slow.
How Fast Will It Go?
Of course it’s not all bad news. When we enable -O2 optimizations we’ll see our execution speed increase for all versions, and of course we can also unroll our loop to provide better utilization of instruction paring. Finally, SSE provides instructions for bypassing the cache mechanisms of the processor, as well as others for pre-fetching the next set of values if using the cache.
Thus, to see how fast we can make this code we’ll make use of loop unrolling and pre-fetching in this final example.
_mm_prefetch() is issued within the loop with the intention of starting our memory access ahead of time so that when the next line of data is needed it’s accessed from a cache line rather than main memory. This has a measurable and consistent speedup of around 50 ms.
It is worth noting however, that when this code is profiled for L2 Cache misses we have some definite saturation going on. Optimizing cache access can be a tricky affair, requiring that we manually check for hot-spots in our code to see where we might run out of cache lines.
The general rule is the working set of our loop should match the cache size. The cache size, in return, depends on the processor. As our loop is consuming huge amounts of data compared to the cache size it’s important that we make sure we don’t unroll too far, as:
a) we will run out of registers to hold immediate values causing more cache reads (and potential misses)
b) can overfill all cache lines with the raw data if we try to make the working set too large.
A good tool for viewing cache performance is Shark on OSX, PAPI or Valgrind for Linux, and Visual Studio For Windows.
Finally, we also use the _mm_shuffle_ps instruction to insert the k value into each vector. This eliminates some of the overhead of member-wise access, but is still not perfect. Again though, as the source and destination buffers are not floats this is a price we have to pay.
001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 054 055 056 057 058 059 060 061 062 063 064 065 066 067 068 069 070 071 072 073 074 075 076 077 078 079 080 081 082 083 084 085 086 087 088 089 090 091 092 093 094 095 096 097 098 099 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | j = 0; uchar *bits = imin.bits(); __m128 v1, v2, v3, v4; __m128 k1, k2, k3, k4; __m128 m_255 = _mm_set_ps1(255.0f); __m128 m_1 = _mm_set_ps1(1.0f); float tmp1[4], tmp2[4], tmp3[4], tmp4[4]; float tmp_1, tmp_2, tmp_3, tmp_4; for ( int i = 0; i < wt * ht; i+=4){ v1 = _mm_set_ps(1, bits[j], bits[j+1], bits[j+2]); // rev order! v2 = _mm_set_ps(1, bits[j+4], bits[j+5], bits[j+6]); // rev order! v3 = _mm_set_ps(1, bits[j+8], bits[j+9], bits[j+10]); // rev order! v4 = _mm_set_ps(1, bits[j+12], bits[j+13], bits[j+14]); // rev order! // fetch next data values _mm_prefetch(&bits[j+16], _MM_HINT_T0); // divide by 255 v1 = _mm_div_ps(v1, m_255); v2 = _mm_div_ps(v2, m_255); v3 = _mm_div_ps(v3, m_255); v4 = _mm_div_ps(v4, m_255); // subtract 1 v1 = _mm_sub_ps(m_1, v1); v2 = _mm_sub_ps(m_1, v2); v3 = _mm_sub_ps(m_1, v3); v4 = _mm_sub_ps(m_1, v4); // store to find min _mm_store_ps(tmp1, v1); _mm_store_ps(tmp2, v2); _mm_store_ps(tmp3, v3); _mm_store_ps(tmp4, v4); // find min (_mm_min_ps) tmp_1 = qMin(tmp1[0], qMin(tmp1[1], tmp1[2])); tmp_2 = qMin(tmp2[0], qMin(tmp2[1], tmp2[2])); tmp_3 = qMin(tmp3[0], qMin(tmp3[1], tmp3[2])); tmp_4 = qMin(tmp4[0], qMin(tmp4[1], tmp4[2])); if (!qFuzzyCompare(tmp1[3],1)) { // subtract original k v1 = _mm_sub_ps(v1, _mm_set_ps(tmp_1, tmp_1, tmp_1, tmp_1)); // subtract 1 from base k float bk_min1 = 1.0 - tmp_1; // divide by (cmy - k) - (1.0 - k) v1 = _mm_div_ps(v1, _mm_set_ps(bk_min1, bk_min1, bk_min1, bk_min1)); } if (!qFuzzyCompare(tmp2[3],1)) { // subtract original k v2 = _mm_sub_ps(v2, _mm_set_ps(tmp_2, tmp_2, tmp_2, tmp_2)); // subtract 1 from base k float bk_min2 = 1.0 - tmp_2; // divide by (cmy - k) - (1.0 - k) v2 = _mm_div_ps(v2, _mm_set_ps(bk_min2, bk_min2, bk_min2, bk_min2)); } if (!qFuzzyCompare(tmp3[3],1)) { // subtract original k v3 = _mm_sub_ps(v3, _mm_set_ps(tmp_3, tmp_3, tmp_3, tmp_3)); // subtract 1 from base k float bk_min3 = 1.0 - tmp_3; // divide by (cmy - k) - (1.0 - k) v3 = _mm_div_ps(v3, _mm_set_ps(bk_min3, bk_min3, bk_min3, bk_min3)); } if (!qFuzzyCompare(tmp4[3],1)) { // subtract original k v4 = _mm_sub_ps(v4, _mm_set_ps(tmp_4, tmp_4, tmp_4, tmp_4)); // subtract 1 from base k float bk_min4 = 1.0 - tmp_4; // divide by (cmy - k) - (1.0 - k) v4 = _mm_div_ps(v4, _mm_set_ps(bk_min4, bk_min4, bk_min4, bk_min4)); } // store to extract (prevents green cast) _mm_store_ps(tmp1, v1); _mm_store_ps(tmp2, v2); _mm_store_ps(tmp3, v3); _mm_store_ps(tmp4, v4); k1 = _mm_set_ps(tmp_1, tmp1[2], 0, 0); // reverse k2 = _mm_set_ps(tmp_2, tmp2[2], 0, 0); // reverse k3 = _mm_set_ps(tmp_3, tmp3[2], 0, 0); // reverse k4 = _mm_set_ps(tmp_4, tmp4[2], 0, 0); // reverse // add fourth vector back in (k) v1 = _mm_shuffle_ps(v1, k1, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest v2 = _mm_shuffle_ps(v2, k2, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest v3 = _mm_shuffle_ps(v3, k3, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest v4 = _mm_shuffle_ps(v4, k4, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest // multiply all by 255 v1 = _mm_mul_ps(v1, m_255); v2 = _mm_mul_ps(v2, m_255); v3 = _mm_mul_ps(v3, m_255); v4 = _mm_mul_ps(v4, m_255); // standard store _mm_store_ps(tmp1, v1); // save back to tmp buffer _mm_store_ps(tmp2, v2); // save back to tmp buffer _mm_store_ps(tmp3, v3); // save back to tmp buffer _mm_store_ps(tmp4, v4); // save back to tmp buffer // mem write cmyk_temp[j] = tmp1[0]; // c cmyk_temp[j+1] = tmp1[1]; // m cmyk_temp[j+2] = tmp1[2]; // y cmyk_temp[j+3] = tmp1[3]; // k cmyk_temp[j+4] = tmp2[0]; // c cmyk_temp[j+5] = tmp2[1]; // m cmyk_temp[j+6] = tmp2[2]; // y cmyk_temp[j+7] = tmp2[3]; // k cmyk_temp[j+8] = tmp3[0]; // c cmyk_temp[j+9] = tmp3[1]; // m cmyk_temp[j+10] = tmp3[2]; // y cmyk_temp[j+11] = tmp3[3]; // k cmyk_temp[j+12] = tmp4[0]; // c cmyk_temp[j+13] = tmp4[1]; // m cmyk_temp[j+14] = tmp4[2]; // y cmyk_temp[j+15] = tmp4[3]; // k j+=16; } |
As baseline of performance, our previous SSE version from the last section runs in 2505 milliseconds, or 1384 with -O2 optimizations enabled. This is a noticeable improvement* over the non-SSE version which runs at 2857 ms unoptimized, and 2670 using -O2.
* These time values are now per core as opposed to all 4 cores the last quoted values specified.
Our new unrolled and prefetched version performs even better, clocking in at 842 ms. Compared to our original non-SSE version this represents around a 3x speedup. Not bad!
The Final Frontier
The last big area of contention now appears to be non-user code in the form of the Little CMS and libjpeg libraries. Specifically, as we can see from the Instruments output below, the Evel4Inputs and Qt’s call to jpeg_idct_islow are now the biggest performance bottlenecks:
![](http://pubimage.360doc.com/wz/default.gif)
As you can also see, our SSE optimized routine in the form of the yellow highlight is now at 878 ms, faster than Eval4Inputs and jpeg_idct_islow. This was of course not the situation before the SSE optimizations.
What can we do about these slower library calls? For one, there is a faster, freely-available JPEG library availible called libjepg-turbo which uses SSE to significantly speed up the decoding of jpeg files. The best part is the API is the same as the older libjpeg V6 and V8 versions, so using this faster library is literally a matter of changing our include paths.
In the case of a Qt project, all we need to do is update our INCLUDEPATH and LIBS directives in our .pro file. For my Mac, which had a convenient installer provided by libjpeg-turbo, the lib and code files were located at:
INCLUDEPATH += /Users/grdinic/Documents/libs/boost_1_44_0 #/user/local/include/ /opt/libjpeg-turbo/include/ LIBS += -L/usr/local/lib -ltiff -L/usr/local/lib -llcms2 #-L/usr/local/lib -ljpeg -L/opt/libjpeg-turbo/lib64 -lturbojpeg
As you can see, I have commented out the ‘older’ version of the library so the new one is used.
Unfortunately, despite our own user-code now using libjpeg-turbo, it appears that Qt is still calling its own version of the jpeg libraries, as the call stack shows when expanded:
![](http://pubimage.360doc.com/wz/default.gif)
This of course makes it very difficult to realize the speed gains of libjpeg-turbo without having to rewrite the image reading routines of Qt.
The calls to Eval4Inputs are trickier still, as now we’re getting into rewriting Little CMS. This is not something we’ll do in this post, though we may visit it at some point in the future. If nothing else that name, Eval4Inputs is rather evocative.
Conclusion
All told, as I had feared the limitations in SSE coding, namely the conversion steps and cost of accessing single elements of a vector (member-wise access), meant that the first iteration of the non-optimized SSE code was not appreciably faster than the non-SSE version.
However, as soon as we re-wrote the code to make better user of -O2 optimizations, instruction paring, cache pre-fetching, and loop unrolling, we saw a very solid gain of 3x. Even the non-unrolled version saw an average gain of 1.5-2x.
The bottom line is with -O2 optimizations on it would be hard to write SSE code that’s slower than non-SSE code, provided we pay close attention to the basic rules of being aware of conversion costs and limiting member-wise access.
When you further consider that just about every desktop CPU has at minimum SSE 2 support, the speed and efficiency gained by SSE coding outweighs just about any potential down-side.
Useful Links
http://easycalculation.com/binary-converter.php
http://msdn.microsoft.com/en-us/library/4d3eabky.aspx
http://en.wikipedia.org/wiki/SSE4
http://www.intel.com/products/processor/manuals/index.htm
http://software.intel.com/en-us/articles/avx-emulation-header-file/
http://gcc.gnu.org/ml/gcc-patches/2008-11/msg00145.html