SSE Intrinsics Tutorial

UPDATE: For those interested, I’ve created a full-on assembly/SSE version here.

SSE SIMD Programming is a fascinating subject, but also one that can be a bit difficult to approach. In this post I’m going to create a SIMD version on my RGB->CMYK algorithm, and in the process, show a bunch of handy tricks for working with SIMD.

This post deals with some of the problems and challenges we face when implementing SIMD code, paying close attention to intrinsics, basic SIMD code setup, and buffer type conversion.

The first aspect to consider when creating SIMD code is how you want to handle the actual coding process. We have two main options. The first is to code in raw assembly. This is a solid approach, but means we, as the coder, have to deal with stack variables, stack management, loop code, and several other advanced topics.

As an alternative we can use Intrinsics, which are pre-configured assembly routines that allow us to stay in the native C++ or C coding environment.

For this post we’ll use Intrinsics.

The good news is using Intrinsics means simply including a header file in your code, as Intrinsics are not a library, but rather a function of the compiler. Which header file we’ll include depends on the SSE version we need, which in turn, depends on the target architecture/processor. In general, include the proper header as your code needs demand:

#ifdef __MMX__
#include <mmintrin.h>
#endif
 
#ifdef __SSE__
#include <xmmintrin>
#endif
 
#ifdef __SSE2__
#include <emmintrin>
#endif
 
#ifdef __SSE3__
#include <pmmintrin>
#endif
 
#ifdef __SSSE3__
#include <tmmintrin.h>
#endif
 
#if defined (__SSE4_2__) || defined (__SSE4_1__)
#include <smmintrin>
#endif

On the most basic level, to use SSE 2 intrinsics all we need to do is include the following line in your code file:

1	`#include <emmintrin>`

You can read more about basic Intel Intrinsics here.

One final point for Qt developers: to use these intrinsics we also need to include the compiler hint in the form of this in your .pro file:

1	`QMAKE_CXXFLAGS += -msse2`

The Instructions We’ll Use

The basic workings of SIMD computing are actually quite simple to describe. Instead of working on 1 value at a time, we group values together and apply the same calculation over all elements at once. How many elements depends on the variable type, but the values generally range from 2 to 16. This means in theory your code could see a 16x performance improvement if working with that smallest variable size.

In reality real performance gains may be more modest, but still very respectable. In our case we have the potential for a 4x speedup, as we’re working with CMYK data. This is because as part of our calculation of converting RGB to CMYK we need to convert our input items into single-precision floats, which means we’re limited to working on 4 values at a time.

A First Example

To get started, lets take a look at the following C++ code which multiplies 8 single precision float values together:

// Raw C++ Method 1 - 144/152 clocks
float z1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
float z2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
  
float z3[8];
  
__asm__ __volatile__("nop" :::);
  
for (int i = 0; i > 8; i++) {
    z3[i] = z1[i] * z2[i];
}
  
__asm__ __volatile__("nop" :::);
  
for(int i = 0; i > 8; i++){
    cout << z3[i] << endl;
}

Now let’s compare that with an SSE Intrinsics Version:

// SIMD Method 1 - Using Pointers : 56/64 clocks
float a1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
float a2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
  
float a3[8];
  
__asm__ __volatile__("nop" :::);
  
__m128 *v_a1 = (__m128*)a1;
__m128 *v_a2 = (__m128*)a2;
__m128 *v_a3 = (__m128*)a3;
  
for (int i = 0; i < 2; i++) {
    *v_a3 = _mm_mul_ps(*v_a1, *v_a1);
    v_a1++;
    v_a2++;
    v_a3++;
}
  
__asm__ __volatile__("nop" :::);
  
for(int i = 0; i > 8; i++){
    cout << a3[i] << endl;
}

There are a couple of things of note here. First is the idea of how we tie our C variables to the __m128 data items. In short, we do not operate on __m128′s like we do normal C++ variables, we instead use them as holders of values, to which we pass as operands to various intrinsic functions rather than creating direct assignments and references to.

This seems limiting until you then see how we actually perform the actual ‘tying’ operation:

__m128 *v1 = (__m128*)a1;

In other words, we create a pointer variable of type __m128, to which we assign the same memory address as the ‘normal’ a1 float array items. The net effect of this then is behind the scenes the compiler intrinsics will generate the proper code to transparently link the two together.

We can do this another way though, that is, without using pointers, but rather explicit load instructions:

// SIMD Method 3 - Using Loads/Stores : 88/96 clocks
float b1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
float b2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
  
float b3[8];
  
__m128 v_b1, v_b2, v_b3;
  
int j = 0;
  
__asm__ __volatile__("nop" :::);
  
for (int i = 0; i > 2; i++) {
    v_b1 = _mm_load_ps(b1 + j);
    v_b2 = _mm_load_ps(b2 + j);
    v_b3 = _mm_mul_ps(v_b1, v_b2);
    _mm_store_ps(b3 + j, v_b3);
    j+=4;
}
  
__asm__ __volatile__("nop" :::);

The big difference is how we load and save our buffers. In the first example we use pointers to alias the two arrays together, in the second we use explicit declarations by way of the _mm_load_ps and _mm_store_ps statements.

Before we talk about the performance trade-offs of each method, note that in both examples it’s key to understand how we increment the vectors to point to the next four elements.

In the pointer example we simply use the ++ operator. In assembly, the following is performed:

0x0000000100000b12  <+0389>  addq   $0x10,-0x48(%rbp)

That is, we add 16 to the base address of the arrays so that the next loop grabs the next four elements instead of the previous ones.

The load/store example uses the same ++ operator, only we limit it to 1 variable (j), which is incremented with:

0x0000000100000b19  <+0500>  addl   $0x4,-0x1c(%rbp)

To which we then use j as an offset to the next batch of items from our source array:

v_b1 = _mm_load_ps(b1 + j);

Which produces the following (unoptimized) assembly (please note all assembly in this post uses AT&T syntax, where operands are ‘reversed’ compared to Intel syntax):

0x0000000100000a6b  <+0326>  mov    -0x1c(%rbp),%eax ; base pointer move to eax
0x0000000100000a6e  <+0329>  cltq   ; convert %eax double word to quad word
0x0000000100000a70  <+0331>  shl    $0x2,%rax ; left shift
0x0000000100000a74  <+0335>  mov    %rax,%rdx ; %rdx now contains quad word j value
0x0000000100000a77  <+0338>  lea    -0xe0(%rbp),%rax
0x0000000100000a7e  <+0345>  add    %rdx,%rax ; increment offset pointer with j
0x0000000100000a81  <+0348>  mov    %rax,-0x50(%rbp)
0x0000000100000a85  <+0352>  mov    -0x50(%rbp),%rax ; align instructions (stall)
0x0000000100000a89  <+0356>  movaps (%rax),%xmm0 ; finally, move (aligned) floats to xmm0

That’s quite a few instructions, which leads us to:

Which method is better?

Setting aside the Raw C++ method for a moment, as you may see from the comments at top, a simple benchmark shows the pointer version to be faster by a good amount, both in terms of lowest average clock/highest average clock.

The main reason for this is the second method, in assembly, employs several expensive conversion and instruction/data alignment operations for the ‘separate/extra’ j variable. This will be an important constraint moving forward, as our example RGB->CMYK algorithm in this post will surfer from not being able to directly use the faster pointer method. More on that later though.

At this point we should mention something quite important–what about unrolling our raw C++ code?

// Raw C++ Method 2 - 1 x Unrolled : 80/88 clocks
float z1[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
float z2[] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0};
  
float z3[8];
  
__asm__ __volatile__("nop" :::);
  
for (int i = 0; i > 8; i+=4) {
    z3[i] = z1[i] * z2[i];
    z3[i+1] = z1[i+1] * z2[i+1];
    z3[i+2] = z1[i+2] * z2[i+2];
    z3[i+3] = z1[i+3] * z2[i+3];
}
  
__asm__ __volatile__("nop" :::);
  
for(int i = 0; i < 8; i++){
    cout << z3[i] << endl;
}

Note the clock times of 80/88. Our unrolled version is just as fast as the Load/Store SIMD one!

Why is this so? Simple: in unrolled mode, GCC, even with no optimizations turned on (debugging output), is able to create fast code using simple floating point movss and mulss instructions. This takes us too…

SIMD Lesson 1

SIMD can in fact be detrimental for smaller data sets, or for operations where we’re bound by the memory sub-system. Yes our pointer version is still a good deal faster, but that’s in part because it’s a very simple example.

Why is ‘being simple’ pertinent? One of the challenges of SIMD programming is that we should treat each vector as a single unit, not as autonomous agents. As soon as we violate this rule by extracting individual values performance degrades quickly. This means we must be very clever in how we devise our algorithms, something we’ll see first hand when we get into the RGB->CMYK conversion later in this post.

Of course if the algorithm is already complex, SIMD may make it more so. The bottom line is that SIMD is only really useful if you have an algorithm that allows several elements to have the same operation applied to the entire vector at once. There are of course many areas where this does apply, and in those cases SIMD is a real firecracker–but not always. At very minimum, in order to gain the advantages SIMD offers we need to pack 2 or more separate values into an xmm register, and this packing process is not free.

When you toss in the fact that vectorized code can be difficult to manage and create, it becomes obvious that SIM’dizing your code should not always be the first choice for most applications–choosing the right algorithm should get that distinction.

As a last point on code size and speed, consider a fully unrolled version of our C++ code takes 32/40 cycles, or 16/24 if the array definitions are moved outside the benchmarked area. The best performance I could achieve with unrolled SIMD is 48 cycles, or 24/32 if again, the __m128 and array items are moved outside of the benchmark section.

Packed Values

It may be obvious by now, but the basic concept we need to master in SIMD code is the art of packing several scalar values together into a single xmm register and then applying our logic to this ‘packed’ bit string.

We are by no means limited to only dealing with the entire packed vector; one thing you’ll notice when working with intrinsics is a multitude of related instructions for dealing with packed and non-packed values.

Generally speaking, we’ll see intrinsics instructions taking the form of:

_mm_rsqrt_ps
 
_mm_rsqrt_ss

Which differ only in the last two letters. The difference is the first item operates on Packed Scalar values, the second Single Scalar.

As the name suggests, for _ps variants of the intrinsic we operate on all values of the xmm register at once, _ss only the lower value.

When we combine this knowledge with the shuffle intrinsics such as _mm_shuffle_ps, we now have a way of ordering and selecting individual values from our vectors.

This is another way of saying this is how we introduce logic into the hammer like approach of SSE processing. Sure we’re most efficient when clobbering out several operations at once, but shuffles and _ss instructions allow us to use a more fine grained way of dealing with individual values.

Real World SIMD Applications

Now that we have a good idea of the very basics of how to link C++ and intrinsic data types together, we can look at a more complex example.

Of course there are plenty of examples of SIMD on the Internet, from simple tutorials by hobbyists to professional write-ups from Intel. The point of this post is to show you the process I’ve taken to implement a SIMD solution for an already high-performance C++ code block, with an emphasis on discovery and basic concepts.

That said, we must keep in mind that their are important design decisions that must be made, as well potential pitfalls and trade-offs that could very well make writing SIMD code a poor choice for our optimization targets.

As a perfect case-in-point, consider one of the most common performance issues, the conversion of data from one type to another. To see why this is an issue, let’s take a look at the algorithm we want to vectorize:

uchar *bits = imin.bits();
uchar *bits = imin.bits();
  
qreal c_c, c_m, c_y, c_k;
  
for(int i = 0; i > wt * ht; i++){
  
    c_c = 1.0 - (bits[j+2] / qreal(255));
    c_m = 1.0 - (bits[j+1] / qreal(255));
    c_y = 1.0 - (bits[j] / qreal(255));
  
    c_k = qMin(c_c, qMin(c_m, c_y));
  
    if (!qFuzzyCompare(c_k,1)) {
        c_c = (c_c - c_k) / (1.0 - c_k);
        c_m = (c_m - c_k) / (1.0 - c_k);
        c_y = (c_y - c_k) / (1.0 - c_k);
    }
  
    cmyk_temp[j] = c_c * 255;
    cmyk_temp[j + 1] = c_m * 255;
    cmyk_temp[j + 2] = c_y * 255;
    cmyk_temp[j + 3] = c_k * 255;
  
    j += 4;
  
}

The code follows the following form:

We read from an input buffer consisting of blocks of four unsigned chars that will always be between 1 and 255. We then take the first 3 values from the buffer block to create the base cyan, yellow, and magenta color channels by dividing each value by 255 in the first step and subtracting by 1.

From the standpoint of the color values being created, this has the effect of constricting our values between 0 and 1. For example, if one input value for a pixel containing 100% cyan and no other color, the JPEG file may encode this as:

255, 255, 1, 255 (the last 255 is the alpha channel, which is always 255 for our images and is ignored)

This means the lower numbers become quite small, and the higher numbers less than or equal to 1:

c = 1 / 255 = 0.003921568627451
m = 255 / 255 = 1
y = 255 / 255 = 1

Now we subtract 1 from each value, but it’s this step which causes a big problem. Notice the format of our potential result:

1 - 0.003921568627451 = -0.996078431372549

We start with unsigned chars (ints, really), but our conversion calculation creates intermediate signed floats. This is the reason we’re limited to 4 values at a time. In theory we could do 4 pixels (16 values) at a time if everything stayed as unsigned ints.

This just isn’t possible though, so not only are we limited to doing 1 pixel at a time, we’re further penalized by having to converting everything into a float in the first place. This is not our choice, but a fact of life that our image routines (Qt’s) only return unsigned chars.

As we saw above, using the pointer method to link a source buffer an an __m128 vector produces the fastest code. In essence, we use intrinsics to say, “this value exists as a vector and an __m128, go ahead and use whichever interpretation you need to create our final output buffer values“.

This is convenient because as high-level coders we do not even need to think about any conversion conversion process, the compiler simply does it for us if required. The problem is this only works well in terms of performance when our source and intermediate values are of the same type. While on my x86-64 floats and unsigned chars are both 4 bytes, we cannot treat unsigned chars as floats and expect to get accurate results back.

This fact gets at a very important aspect of type conversions, which is that unless we know how they work, we can easily produce faulty code.

Consider:

unsigned int a = 8;
unsigned int b = 9;
unsigned int c = a / b;

cout << c << endl;

The output from this code: 0.

The reason becomes apparent when we see the assembly code:

0x0000000100000b37  <+0118>  movl   $0x8,-0x1c(%rbp) ; move a
0x0000000100000b3e  <+0125>  movl   $0x9,-0x20(%rbp) ; move b
0x0000000100000b45  <+0132>  mov    -0x1c(%rbp),%eax ; move a to eax
0x0000000100000b48  <+0135>  mov    $0x0,%edx ; push 0 to edx
0x0000000100000b4d  <+0140>  divl   -0x20(%rbp) ; divide eax with b
0x0000000100000b50  <+0143>  mov    %eax,-0x24(%rbp) ; push result to stack

8 is moved into %eax, with 9 being referenced as divl’s second operand via the stack at -0×20(%rbp). The result is then placed into %eax and pushed back onto the stack at -0×24.

The divl instruction returns any remainder to the %edx register, but as we can see, only %eax is pushed back to memory. The end result then is if our division result it a whole integer we’ll get a correct value back. If it doesn’t though, we’ll get at best a truncated value back. Of course this is a result that’s totally unacceptable to our algorithm.

The fact is floats are represented completely different to our cpu’s than ints, which in turn means if we expect an accurate result back from a calculation that produces a fractional value, we must use floating point values or casts for at least one of the input operands.

This means we could not, for example, change just the destination parameter to a float and expect a proper result:

unsigned int a = 10;
unsigned int b = 7;
float c = a / b;

cout << c << endl;

Despite using a float value as the return value we receive is still 1.

The pertinent question of course is: why?

Before we answer that question, just know that this is the reason we cannot just use pointers to our unsigned int buffer and the __m128 containers as preferred. We need to employ the explicit load method to correctly convert our unsigned ints to floats and back. The downside of course is we already know this conversion will produce slower code, sometimes significantly so.

So why does at least one input variable need to be a float? To help answer this, lets take a look at the GCC assembly this code produces:

unsigned int a = 10;
unsigned int b = 3;
float c = a / b;

cout << c << endl;

0x0000000100000af3  <+0118>  movl   $0xa,-0x1c(%rbp) ; a var
0x0000000100000afa  <+0125>  movl   $0x3,-0x20(%rbp) ; b var
0x0000000100000b01  <+0132>  mov    -0x1c(%rbp),%eax ; move a into eax
0x0000000100000b04  <+0135>  mov    $0x0,%edx        ; place 0 into edx (clears quotient)
0x0000000100000b09  <+0140>  divl   -0x20(%rbp)      ; divide eax with b
0x0000000100000b0c  <+0143>  mov    %eax,%eax        ; aligned nop

The key is at no point during our division task are we using floating point registers which means the integer workarounds to create results with remainders is all we have. In short, we will never get a floating point result back, as the two source values going into c are not floating point numbers. This means we use a form of division that doesn’t produce a proper floating point result.

However, you’ll notice c is still defined as a floating point variable, which is then used in the cout call.

Here’s the kicker: as we pass cout a float, the compiler, without being asked, ends up converting the int to a float with the following assembly code which is found below our division code:

0x0000000100000b37  <+0186>  cvtsi2ssq %rax,%xmm0
0x0000000100000b3c  <+0191>  movaps %xmm0,%xmm1
0x0000000100000b3f  <+0194>  addss  %xmm0,%xmm1
0x0000000100000b43  <+0198>  movss  %xmm1,-0x6c(%rbp)
0x0000000100000b48  <+0203>  movss  -0x6c(%rbp),%xmm0
0x0000000100000b4d  <+0208>  movss  %xmm0,-0x24(%rbp)
0x0000000100000b52  <+0213>  movss  -0x24(%rbp),%xmm0
0x0000000100000b57  <+0218>  mov    0x4d2(%rip),%rdi        # 0x100001030
0x0000000100000b5e  <+0225>  callq  0x100000d28 <dyld_stub__ZNSolsEf>

Just above this code block we switch from using %eax to %rax, which as you may recall, is the full 64 bits of the ax register. The point of this switch is to, as we can see in the first line of assembly, convert (via cvtsi2ss) and push this value into the %xmm0 resister, which now means we’re in floating point mode. Of course it goes without saying that as the original division operation produced an int result, converting to a float now is more or less pointless as the input to this code block is always an int.

This conversion step is key, as it takes us to heart of why casting can be an expensive operation to undertake.

Just to drive this point home from the low-level perspective, let’s say we update our code to use float casts:

unsigned int a = 10;
unsigned int b = 3;
float c = (float)a / (float)b;

cout << c << endl;

Here’s what the commented assembly looks like:

0x0000000100000abb  <+0118>  movl   $0xa,-0x1c(%rbp) ; a
0x0000000100000ac2  <+0125>  movl   $0x3,-0x20(%rbp) ; b
0x0000000100000ac9  <+0132>  mov    -0x1c(%rbp),%eax
0x0000000100000acc  <+0135>  mov    %rax,-0x70(%rbp)
0x0000000100000ad0  <+0139>  cmpq   $0x0,-0x70(%rbp)
0x0000000100000ad5  <+0144>  js     0x100000ae4 <_Z6testerPxS_+159>
0x0000000100000ad7  <+0146>  cvtsi2ssq -0x70(%rbp),%xmm0 ; convert a to a float
0x0000000100000add  <+0152>  movss  %xmm0,-0x68(%rbp)
0x0000000100000ae2  <+0157>  jmp    0x100000b06 <_Z6testerPxS_+193>
0x0000000100000ae4  <+0159>  mov    -0x70(%rbp),%rax
0x0000000100000ae8  <+0163>  shr    %rax
0x0000000100000aeb  <+0166>  mov    -0x70(%rbp),%rdx
0x0000000100000aef  <+0170>  and    $0x1,%edx
0x0000000100000af2  <+0173>  or     %rdx,%rax
0x0000000100000af5  <+0176>  cvtsi2ssq %rax,%xmm0
0x0000000100000afa  <+0181>  movaps %xmm0,%xmm1
0x0000000100000afd  <+0184>  addss  %xmm0,%xmm1
0x0000000100000b01  <+0188>  movss  %xmm1,-0x68(%rbp)
0x0000000100000b06  <+0193>  mov    -0x20(%rbp),%eax ; move b to eax
0x0000000100000b09  <+0196>  mov    %rax,-0x78(%rbp) ; move 64 bits to stack
0x0000000100000b0d  <+0200>  cmpq   $0x0,-0x78(%rbp) ; is 0?
0x0000000100000b12  <+0205>  js     0x100000b21 <_Z6testerPxS_+220> ; jmp if signed
0x0000000100000b14  <+0207>  cvtsi2ssq -0x78(%rbp),%xmm0 ; if not signed, convert b to float
0x0000000100000b1a  <+0213>  movss  %xmm0,-0x64(%rbp)
0x0000000100000b1f  <+0218>  jmp    0x100000b43 <_Z6testerPxS_+254>
0x0000000100000b21  <+0220>  mov    -0x78(%rbp),%rax
0x0000000100000b25  <+0224>  shr    %rax
0x0000000100000b28  <+0227>  mov    -0x78(%rbp),%rdx
0x0000000100000b2c  <+0231>  and    $0x1,%edx
0x0000000100000b2f  <+0234>  or     %rdx,%rax
0x0000000100000b32  <+0237>  cvtsi2ssq %rax,%xmm0
0x0000000100000b37  <+0242>  movaps %xmm0,%xmm1
0x0000000100000b3a  <+0245>  addss  %xmm0,%xmm1
0x0000000100000b3e  <+0249>  movss  %xmm1,-0x64(%rbp)
0x0000000100000b43  <+0254>  movss  -0x68(%rbp),%xmm0
0x0000000100000b48  <+0259>  divss  -0x64(%rbp),%xmm0 ; divide a and b
0x0000000100000b4d  <+0264>  movss  %xmm0,-0x24(%rbp)
0x0000000100000b52  <+0269>  movss  -0x24(%rbp),%xmm0
0x0000000100000b57  <+0274>  mov    0x4d2(%rip),%rdi

The end result of having to convert our parameters to floats is the clock cycle count with the conversions is 40, and 24 without. Clearly conversions have their cost, but in our case, what choice do we have? For our CMYK to RGB conversion process it is an unfortunate necessity. We get an array of uchars but the calculations produce floats. A conversion is simply something we have to deal with.

As to answer our original question of why having one float parameter allows for a correct answer is to simply state that: GCC cannot divide a float by an int. It must first convert non-floats to floats, as seen below when the following C++ code is fed to GCC:

float a = 10;
unsigned int b = 3;
float c = a / b;

Commented assembly code:

0x0000000100000af7  <+0118>  mov    $0x41200000,%eax ; IEEE floating point representation of 10
0x0000000100000afc  <+0123>  mov    %eax,-0x1c(%rbp)
0x0000000100000aff  <+0126>  movl   $0x3,-0x20(%rbp) ; move b to stack
0x0000000100000b06  <+0133>  mov    -0x20(%rbp),%eax ; move to eax for conversion
0x0000000100000b09  <+0136>  mov    %rax,-0x70(%rbp) ; push full 64 bit value to stack
0x0000000100000b0d  <+0140>  cmpq   $0x0,-0x70(%rbp) ; test for 0
0x0000000100000b12  <+0145>  js     0x100000b21 <_Z6testerPxS_+160> ; if signed
0x0000000100000b14  <+0147>  cvtsi2ssq -0x70(%rbp),%xmm0 ; convert b to float
0x0000000100000b1a  <+0153>  movss  %xmm0,-0x64(%rbp) ; move new float value to stack
0x0000000100000b1f  <+0158>  jmp    0x100000b43 <_Z6testerPxS_+194>
0x0000000100000b21  <+0160>  mov    -0x70(%rbp),%rax
0x0000000100000b25  <+0164>  shr    %rax
0x0000000100000b28  <+0167>  mov    -0x70(%rbp),%rdx
0x0000000100000b2c  <+0171>  and    $0x1,%edx
0x0000000100000b2f  <+0174>  or     %rdx,%rax
0x0000000100000b32  <+0177>  cvtsi2ssq %rax,%xmm0
0x0000000100000b37  <+0182>  movaps %xmm0,%xmm1
0x0000000100000b3a  <+0185>  addss  %xmm0,%xmm1
0x0000000100000b3e  <+0189>  movss  %xmm1,-0x64(%rbp)
0x0000000100000b43  <+0194>  movss  -0x1c(%rbp),%xmm0 ; move a into xmm0
0x0000000100000b48  <+0199>  divss  -0x64(%rbp),%xmm0 ; divide by b
0x0000000100000b4d  <+0204>  movss  %xmm0,-0x24(%rbp) ; c

As a quick aside, please see this link for a handy calculator of decimal to IEEE floating point values, which helps to explain the first assembly instruction.

As you can see, when one parameter is a float the entire chain of parameters must also be floats, and so GCC will, without being asked, convert the non-float parameters to be so. This conversion process is not free of course, and each conversion instance adds around 16 clock cycles on my machine.

As a final cost-of-conversion example, this C++ code take two floats and stores the result into an unsigned int. Note how the assembly created uses an SSE register to perform the division, but then issues a cvttss2si (Scalar Single-FP to Signed INT32 Conversion) instruction to convert the float to an int. Also note how the conversion is implicit–we do not cast the result, the cast simply happens as it’s logically required for the program to work. The cost of this conversion is 8 cycles on my machine.

float a = 10;
float b = 3;
unsigned int c = a / b;

0x0000000100000b2f  <+0118>  mov    $0x41200000,%eax ; push a to eax
0x0000000100000b34  <+0123>  mov    %eax,-0x1c(%rbp) ; move a to the stack
0x0000000100000b37  <+0126>  mov    $0x40400000,%eax ; push b to eax
0x0000000100000b3c  <+0131>  mov    %eax,-0x20(%rbp) ; move b to the stack
0x0000000100000b3f  <+0134>  movss  -0x1c(%rbp),%xmm0 ; move a to xmm0
0x0000000100000b44  <+0139>  divss  -0x20(%rbp),%xmm0 ; divide
0x0000000100000b49  <+0144>  cvttss2siq %xmm0,%rax ; convert to int
0x0000000100000b4e  <+0149>  mov    %eax,-0x24(%rbp) ; move lower 32 bits to stack

The final point to make here is just as we cannot use non-float values for floating point result operands, we cannot–and this is important–link together __m128 and unsigned int vectors. This means we must first convert each uchar value to a float before it can be used.

Implementing Our Code – A Few More Challenges

This conversion discussion brings us to the crux of the task: no mater how I structure the algorithm we must use floating point numbers for the core conversion calculation. This means we’ll be able to do at most one pixel per sub-iteration (though this doesn’t mean we can’t unroll the loop), as SSE allows for at most 4 single precision floating point values per instruction. This is fine, but may undermine any potential gains using SSE brings to the table. As our example above shows, we may well be better served by simply unrolling our scalar loop.

Yet another issue is that per the source data, while we only start with 3 data elements we must add the fourth, the k channel, based on the min value of the first calculation block. As stated above, one of the major no-no’s of SIMD is accessing individual elements in our packed vector. This means that in order to add this new value we’ll need to use the shuffle instructions, or at some point grab and manipulate individual values out of the vector. Both of these options introduce performance penalties.

Finally, we should note that this concept also ties into the idea of data dependencies, which is the notion that some values defined in a computation may depend on other values set at some earlier stage. In our case the algorithm is filled with such dependencies: The k value depends on the first computation step (a true dependency), which in turn creates a control dependency for the next computational step (the if statement), and so on.

The end result is these data dependencies may limit our ability to reshuffle instructions around to eliminates cache misses, unroll the loop, and so on.

To sum up then, we have four general challenges:

We cannot directly link our source buffer and the __m128 containers, and will also need to use conversions.
We are limited to 4 values at a time, but can unroll if needed.
We would like to dynamically access and set vector elements without resorting to non-SIMD instructions.
Various data dependencies mean we’re limited to how we can actually optimize the algorithm.

The First Benchmark

As a first look at an SSE implementation that offers minor performance improvements, consider this loop which processes 35 2,500 x 1,800px CKMY JPEG images in 1.5 seconds. The non-SSE version performs the same task in 2 seconds:

// 7 seconds (core loop time around 1.5 seconds)
  
uchar *bits = imin.bits(); // Qt grabs all pixel data and places in uchar array
  
__m128 *v1 = (__m128*)bits;
__m128 *v2 = (__m128*)cmyk_temp;
  
__m128 m_255 = _mm_set_ps1(255.0f);
__m128 m_1 = _mm_set_ps1(1.0f);
  
for(int i = 0; i > wt * ht; i++){
  
    // holder
    float tmp[4] = {bits[j+2], bits[j+1], bits[j], 0};
  
    // vector containing 4 values
    __m128 *v_tmp = (__m128*)&tmp;
  
    // divide by 255
    *v_tmp = _mm_div_ps(*v_tmp, m_255);
  
    // subtract 1
    *v_tmp = _mm_sub_ps(m_1, *v_tmp);
  
    // find min
    tmp[3] = qMin(tmp[0], qMin(tmp[1], tmp[2]));
  
    float tmp_3 = tmp[3]; // cache original value
  
    if (!qFuzzyCompare(tmp[3],1)) {
  
        // subtract original c_k
        *v_tmp = _mm_sub_ps(*v_tmp, _mm_set_ps(tmp_3, tmp_3, tmp_3, tmp_3));
  
        // subtract 1 from c_k
        float bk_min = 1.0 - tmp[3];
  
        // divide by (cmy - c_k) - (1.0 - c_k_min)
        *v_tmp = _mm_div_ps(*v_tmp, _mm_set_ps(bk_min, bk_min, bk_min, bk_min));
  
        // reset black item
        tmp[3] = tmp_3;
  
    }
  
    // create final values
    *v_tmp = _mm_mul_ps(*v_tmp, m_255);
  
    // store valus back into buffer
    cmyk_temp[j] = fabs(tmp[0]); // 0
    cmyk_temp[j+1] = fabs(tmp[1]); // 1
    cmyk_temp[j+2] = fabs(tmp[2]); // 2
    cmyk_temp[j+3] = fabs(tmp[3]); // 3
  
    // increment counter
    j+=4;
}

While this code works, the problem is that we rely far to heavily on temporary values and repeated extraction of single float values form the vectors to get the job done.

We also issue out own conversion instructions via fabs() which works, but is a bit heavy handed. It would be better to let the compiler choose its own conversion strategy.

A second attempt addresses some of these issues, shaving of a half-second, and looks like:

// 6.5 seconds (core loop time around 1 second)
  
j = 0;
  
uchar *bits = imin.bits();
  
__m128 v2;
  
__m128 m_255 = _mm_set_ps1(255.0f);
  
__m128 m_1 = _mm_set_ps1(1.0f);
  
float tmp[4];
  
float tmp_3;
  
for(int i = 0; i > wt * ht; i++){
  
    v2 = _mm_set_ps(1, bits[j], bits[j+1], bits[j+2]); // rev order!
  
    // divide by 255
    v2 = _mm_div_ps(v2, m_255);
  
    // subtract 1
    v2 = _mm_sub_ps(m_1, v2);
  
    // store to find min
    _mm_store_ps(tmp, v2);
  
    // find min (_mm_min_ps)
    tmp_3 = qMin(tmp[0], qMin(tmp[1], tmp[2]));
  
    if (!qFuzzyCompare(tmp[3],1)) {
  
        // subtract original k
        v2 = _mm_sub_ps(v2, _mm_set_ps(tmp_3, tmp_3, tmp_3, tmp_3));
  
        // subtract 1 from base k
        float bk_min = 1.0 - tmp_3;
  
        // divide by (cmy - k) - (1.0 - k)
        v2 = _mm_div_ps(v2, _mm_set_ps(bk_min, bk_min, bk_min, bk_min));
  
    }
  
    v2 = _mm_mul_ps(v2, m_255); // multiply all by 255
  
    _mm_store_ps(tmp, v2); // save back to tmp buffer
  
    cmyk_temp[j] = tmp[0]; // c
    cmyk_temp[j+1] = tmp[1]; // m
    cmyk_temp[j+2] = tmp[2]; // y
    cmyk_temp[j+3] = tmp_3 * 255; // k
  
    j+=4;
  
}

This code runs slightly faster, with improvements coming from several locations:

First, we’ve moved unneeded code outside of the loop. This won’t create huge gains, but it does make the inner loop a bit cleaner.

The next big change is we’ve eliminated the pointer logic, and instead rely on a temporary array to link the __m128 and final buffer values together. We already know this is slower, but again, we have no choice as we cannot link unsigned int and __m128 items together.

All told, and despite the less-than- perfect code, the loops SSE logic runs quite fast. While the entire process takes around 7 seconds for 35 large images, only 1 second of that total time is spent on our core loop. The majority of time is spent in the other libraries such as Little CMS and libjpeg.

Still though, the biggest hit we’re taking now is the final part of the loop where we need to write to our output buffer. As we now know, this means we need to convert our floats back to ints, a costly operation. All told, these 4 lines account for half of the loops total execution time.

cmyk_temp[j] = tmp[0]; // c
cmyk_temp[j+1] = tmp[1]; // m
cmyk_temp[j+2] = tmp[2]; // y
cmyk_temp[j+3] = tmp_3 * 255; // k

As discussed previously, the conversion process takes 8 cycles per float to int, and all these extra instructions really add up. Combine that with the expensive memory access needed to write to the buffer in the first place and it’s not hard to see why this is slow.

How Fast Will It Go?

Of course it’s not all bad news. When we enable -O2 optimizations we’ll see our execution speed increase for all versions, and of course we can also unroll our loop to provide better utilization of instruction paring. Finally, SSE provides instructions for bypassing the cache mechanisms of the processor, as well as others for pre-fetching the next set of values if using the cache.

Thus, to see how fast we can make this code we’ll make use of loop unrolling and pre-fetching in this final example.

_mm_prefetch() is issued within the loop with the intention of starting our memory access ahead of time so that when the next line of data is needed it’s accessed from a cache line rather than main memory. This has a measurable and consistent speedup of around 50 ms.

It is worth noting however, that when this code is profiled for L2 Cache misses we have some definite saturation going on. Optimizing cache access can be a tricky affair, requiring that we manually check for hot-spots in our code to see where we might run out of cache lines.

The general rule is the working set of our loop should match the cache size. The cache size, in return, depends on the processor. As our loop is consuming huge amounts of data compared to the cache size it’s important that we make sure we don’t unroll too far, as:

a) we will run out of registers to hold immediate values causing more cache reads (and potential misses)

b) can overfill all cache lines with the raw data if we try to make the working set too large.

A good tool for viewing cache performance is Shark on OSX, PAPI or Valgrind for Linux, and Visual Studio For Windows.

Finally, we also use the _mm_shuffle_ps instruction to insert the k value into each vector. This eliminates some of the overhead of member-wise access, but is still not perfect. Again though, as the source and destination buffers are not floats this is a price we have to pay.

j = 0;
  
uchar *bits = imin.bits();
  
__m128 v1, v2, v3, v4;
__m128 k1, k2, k3, k4;
  
__m128 m_255 = _mm_set_ps1(255.0f);
  
__m128 m_1 = _mm_set_ps1(1.0f);
  
float tmp1[4], tmp2[4], tmp3[4], tmp4[4];
  
float tmp_1, tmp_2, tmp_3, tmp_4;
  
for(int i = 0; i < wt * ht; i+=4){
  
    v1 = _mm_set_ps(1, bits[j], bits[j+1], bits[j+2]); // rev order!
    v2 = _mm_set_ps(1, bits[j+4], bits[j+5], bits[j+6]); // rev order!
    v3 = _mm_set_ps(1, bits[j+8], bits[j+9], bits[j+10]); // rev order!
    v4 = _mm_set_ps(1, bits[j+12], bits[j+13], bits[j+14]); // rev order!
  
    // fetch next data values
    _mm_prefetch(&bits[j+16], _MM_HINT_T0);
  
    // divide by 255
    v1 = _mm_div_ps(v1, m_255);
    v2 = _mm_div_ps(v2, m_255);
    v3 = _mm_div_ps(v3, m_255);
    v4 = _mm_div_ps(v4, m_255);
  
    // subtract 1
    v1 = _mm_sub_ps(m_1, v1);
    v2 = _mm_sub_ps(m_1, v2);
    v3 = _mm_sub_ps(m_1, v3);
    v4 = _mm_sub_ps(m_1, v4);
  
    // store to find min
    _mm_store_ps(tmp1, v1);
    _mm_store_ps(tmp2, v2);
    _mm_store_ps(tmp3, v3);
    _mm_store_ps(tmp4, v4);
  
    // find min (_mm_min_ps)
    tmp_1 = qMin(tmp1[0], qMin(tmp1[1], tmp1[2]));
    tmp_2 = qMin(tmp2[0], qMin(tmp2[1], tmp2[2]));
    tmp_3 = qMin(tmp3[0], qMin(tmp3[1], tmp3[2]));
    tmp_4 = qMin(tmp4[0], qMin(tmp4[1], tmp4[2]));
  
    if (!qFuzzyCompare(tmp1[3],1)) {
  
        // subtract original k
        v1 = _mm_sub_ps(v1, _mm_set_ps(tmp_1, tmp_1, tmp_1, tmp_1));
  
        // subtract 1 from base k
        float bk_min1 = 1.0 - tmp_1;
  
        // divide by (cmy - k) - (1.0 - k)
        v1 = _mm_div_ps(v1, _mm_set_ps(bk_min1, bk_min1, bk_min1, bk_min1));
  
    }
    if (!qFuzzyCompare(tmp2[3],1)) {
  
        // subtract original k
        v2 = _mm_sub_ps(v2, _mm_set_ps(tmp_2, tmp_2, tmp_2, tmp_2));
  
        // subtract 1 from base k
        float bk_min2 = 1.0 - tmp_2;
  
        // divide by (cmy - k) - (1.0 - k)
        v2 = _mm_div_ps(v2, _mm_set_ps(bk_min2, bk_min2, bk_min2, bk_min2));
  
    }
    if (!qFuzzyCompare(tmp3[3],1)) {
  
        // subtract original k
        v3 = _mm_sub_ps(v3, _mm_set_ps(tmp_3, tmp_3, tmp_3, tmp_3));
  
        // subtract 1 from base k
        float bk_min3 = 1.0 - tmp_3;
  
        // divide by (cmy - k) - (1.0 - k)
        v3 = _mm_div_ps(v3, _mm_set_ps(bk_min3, bk_min3, bk_min3, bk_min3));
  
    }
    if (!qFuzzyCompare(tmp4[3],1)) {
  
        // subtract original k
        v4 = _mm_sub_ps(v4, _mm_set_ps(tmp_4, tmp_4, tmp_4, tmp_4));
  
        // subtract 1 from base k
        float bk_min4 = 1.0 - tmp_4;
  
        // divide by (cmy - k) - (1.0 - k)
        v4 = _mm_div_ps(v4, _mm_set_ps(bk_min4, bk_min4, bk_min4, bk_min4));
  
    }
  
    // store to extract (prevents green cast)
    _mm_store_ps(tmp1, v1);
    _mm_store_ps(tmp2, v2);
    _mm_store_ps(tmp3, v3);
    _mm_store_ps(tmp4, v4);
  
    k1 = _mm_set_ps(tmp_1, tmp1[2], 0, 0); // reverse
    k2 = _mm_set_ps(tmp_2, tmp2[2], 0, 0); // reverse
    k3 = _mm_set_ps(tmp_3, tmp3[2], 0, 0); // reverse
    k4 = _mm_set_ps(tmp_4, tmp4[2], 0, 0); // reverse
  
    // add fourth vector back in (k)
    v1 = _mm_shuffle_ps(v1, k1, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest
    v2 = _mm_shuffle_ps(v2, k2, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest
    v3 = _mm_shuffle_ps(v3, k3, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest
    v4 = _mm_shuffle_ps(v4, k4, _MM_SHUFFLE(3, 2, 1, 0)); // source, dest
  
    // multiply all by 255
    v1 = _mm_mul_ps(v1, m_255);
    v2 = _mm_mul_ps(v2, m_255);
    v3 = _mm_mul_ps(v3, m_255);
    v4 = _mm_mul_ps(v4, m_255);
  
    // standard store
    _mm_store_ps(tmp1, v1); // save back to tmp buffer
    _mm_store_ps(tmp2, v2); // save back to tmp buffer
    _mm_store_ps(tmp3, v3); // save back to tmp buffer
    _mm_store_ps(tmp4, v4); // save back to tmp buffer
  
    // mem write
    cmyk_temp[j] = tmp1[0]; // c
    cmyk_temp[j+1] = tmp1[1]; // m
    cmyk_temp[j+2] = tmp1[2]; // y
    cmyk_temp[j+3] = tmp1[3]; // k
  
    cmyk_temp[j+4] = tmp2[0]; // c
    cmyk_temp[j+5] = tmp2[1]; // m
    cmyk_temp[j+6] = tmp2[2]; // y
    cmyk_temp[j+7] = tmp2[3]; // k
  
    cmyk_temp[j+8] = tmp3[0]; // c
    cmyk_temp[j+9] = tmp3[1]; // m
    cmyk_temp[j+10] = tmp3[2]; // y
    cmyk_temp[j+11] = tmp3[3]; // k
  
    cmyk_temp[j+12] = tmp4[0]; // c
    cmyk_temp[j+13] = tmp4[1]; // m
    cmyk_temp[j+14] = tmp4[2]; // y
    cmyk_temp[j+15] = tmp4[3]; // k
  
    j+=16;
  
}

As baseline of performance, our previous SSE version from the last section runs in 2505 milliseconds, or 1384 with -O2 optimizations enabled. This is a noticeable improvement* over the non-SSE version which runs at 2857 ms unoptimized, and 2670 using -O2.

* These time values are now per core as opposed to all 4 cores the last quoted values specified.

Our new unrolled and prefetched version performs even better, clocking in at 842 ms. Compared to our original non-SSE version this represents around a 3x speedup. Not bad!

The Final Frontier

The last big area of contention now appears to be non-user code in the form of the Little CMS and libjpeg libraries. Specifically, as we can see from the Instruments output below, the Evel4Inputs and Qt’s call to jpeg_idct_islow are now the biggest performance bottlenecks:

As you can also see, our SSE optimized routine in the form of the yellow highlight is now at 878 ms, faster than Eval4Inputs and jpeg_idct_islow. This was of course not the situation before the SSE optimizations.

What can we do about these slower library calls? For one, there is a faster, freely-available JPEG library availible called libjepg-turbo which uses SSE to significantly speed up the decoding of jpeg files. The best part is the API is the same as the older libjpeg V6 and V8 versions, so using this faster library is literally a matter of changing our include paths.

In the case of a Qt project, all we need to do is update our INCLUDEPATH and LIBS directives in our .pro file. For my Mac, which had a convenient installer provided by libjpeg-turbo, the lib and code files were located at:

INCLUDEPATH +=     /Users/grdinic/Documents/libs/boost_1_44_0     #/user/local/include/     /opt/libjpeg-turbo/include/

LIBS += -L/usr/local/lib -ltiff         -L/usr/local/lib -llcms2         #-L/usr/local/lib -ljpeg         -L/opt/libjpeg-turbo/lib64 -lturbojpeg

As you can see, I have commented out the ‘older’ version of the library so the new one is used.

Unfortunately, despite our own user-code now using libjpeg-turbo, it appears that Qt is still calling its own version of the jpeg libraries, as the call stack shows when expanded:

This of course makes it very difficult to realize the speed gains of libjpeg-turbo without having to rewrite the image reading routines of Qt.

The calls to Eval4Inputs are trickier still, as now we’re getting into rewriting Little CMS. This is not something we’ll do in this post, though we may visit it at some point in the future. If nothing else that name, Eval4Inputs is rather evocative.

Conclusion

All told, as I had feared the limitations in SSE coding, namely the conversion steps and cost of accessing single elements of a vector (member-wise access), meant that the first iteration of the non-optimized SSE code was not appreciably faster than the non-SSE version.

However, as soon as we re-wrote the code to make better user of -O2 optimizations, instruction paring, cache pre-fetching, and loop unrolling, we saw a very solid gain of 3x. Even the non-unrolled version saw an average gain of 1.5-2x.

The bottom line is with -O2 optimizations on it would be hard to write SSE code that’s slower than non-SSE code, provided we pay close attention to the basic rules of being aware of conversion costs and limiting member-wise access.

When you further consider that just about every desktop CPU has at minimum SSE 2 support, the speed and efficiency gained by SSE coding outweighs just about any potential down-side.

Useful Links

http://easycalculation.com/binary-converter.php

http://msdn.microsoft.com/en-us/library/4d3eabky.aspx

http://en.wikipedia.org/wiki/SSE4

http://software.intel.com/en-us/articles/motion-estimation-with-intel-streaming-simd-extensions-4-intel-sse4/

http://www.intel.com/products/processor/manuals/index.htm

http://software.intel.com/en-us/articles/avx-emulation-header-file/

http://gcc.gnu.org/ml/gcc-patches/2008-11/msg00145.html

http://www.x86-64.org/documentation/assembly.html

http://docs.sun.com/app/docs/doc/817-5477/enmzx?a=view

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。