The cost of dynamic (virtual calls) vs. static (CRTP) dispatch in C++

A couple of years ago I wrote an article about the Curiously Recurring Template Pattern in C++, focusing on the motivation behind it and how to implement it.

That article mentioned runtime performance as the main reason for employing CRTP instead of the more traditional runtime polymorphism (dispatch via virtual functions). While some rationale for the cost of virtual calls was given, I didn't go too deep into it. Today I want to fix that by carefully analyzing the performance of virtual calls as opposed to the static calls made possible by CRTP.

Mandatory precaution about benchmarks

Benchmarking in 2013 is really hard. Today's CPUs are super-pipelined, branch-predicting out-of-order executing beasts. The memory hierarchy is very deep and the caches have complex behavior. All of this makes detailed performance analysis devilishly complex, and the results are sometimes baffling. We're clearly long past counting MIPS. Add to that overly-clever optimizing compilers that occasionally produce not quite the code you expected, and it's apparent why so many online resources and articles provide bad benchmarks.

So any benchmarks need to be taken with a large grain of salt, including the one posted here. Personally, I'm trying to validate the benchmarks I'm running by attacking them with the scientific method:

First, create a hypothesis about the relative speed of two approaches.
Take a detailed look at the code generated by the compiler to verify the hypothesis w.r.t. code generation - is this the machine code you expected to see?
Run the benchmark and compare the runtime to the initial hypothesis, as well as to (2) - while not perfect, performance is easier to correlate to machine code than to original source code.
If anything doesn't feel right, or just to make (3) more careful, use low-level counters to make sure that the amount of instructions executed and other such details makes sense given (2).

Hypothesis - what makes virtual calls slower

The previous article listed the following components in the runtime cost of virtual calls:

Extra indirection (pointer dereference) for each call to a virtual method.
Virtual methods usually can’t be inlined, which may be a significant cost hit for some small methods.
Additional pointer per object. On 64-bit systems which are prevalent these days, this is 8 bytes per object. For small objects that carry little data this may be a serious overhead.

While the third component can definitely play a role in some scenarios (i.e. a lot of small objects where the additional memory means less of them fit into L1 data cache), I'll focus on the first two in this article, because they are easier to expose in a simple synthetic benchmark.

The source code - what are we comparing?

There's a plethora of uses for polymorphism in C++. Here I'll focus on a basic one that will let me expose the performance characteristics of virtual calls. I'll define a simple interface with a couple of methods and one implementation of it:

class DynamicInterface {public:  virtual void tick(uint64_t n) = 0;  virtual uint64_t getvalue() = 0;};class DynamicImplementation : public DynamicInterface {  uint64_t counter;public:  DynamicImplementation()    : counter(0) {  }  virtual void tick(uint64_t n) {    counter += n;  }  virtual uint64_t getvalue() {    return counter;  }};

The following code runs the actual benchmark:

const unsigned N = 40000;void run_dynamic(DynamicInterface* obj) {  for (unsigned i = 0; i < N; ++i) {    for (unsigned j = 0; j < i; ++j) {      obj->tick(j);    }  }}

What this does is simply invoke the virtual method tick on the base pointer obj in the order of O(N^2) times.

The alternative statically-polymorphic implementation is this :

template <typename Implementation>class CRTPInterface {public:  void tick(uint64_t n) {    impl().tick(n);  }  uint64_t getvalue() {    return impl().getvalue();  }private:  Implementation& impl() {    return *static_cast<Implementation*>(this);  }};class CRTPImplementation : public CRTPInterface<CRTPImplementation> {  uint64_t counter;public:  CRTPImplementation()    : counter(0) {  }  void tick(uint64_t n) {    counter += n;  }  uint64_t getvalue() {    return counter;  }};template <typename Implementation>void run_crtp(CRTPInterface<Implementation>* obj) {  for (unsigned i = 0; i < N; ++i) {    for (unsigned j = 0; j < i; ++j) {      obj->tick(j);    }  }}

Generated code - how virtual calls look under the hood

Now let's spend some time studying the machine code generated by gcc -O2 (version 4.8) from the code above. The code for DynamicImplementation::tick is very compact:

0000000000400cf0 <_ZN21DynamicImplementation4tickEm>:  400cf0:       add    %rsi,0x8(%rdi)  400cf4:       retq

To understand what this means, some familiarity with the Itanium C++ ABI is required. The ABI in this case mandates both the name mangling that produces the weird symbol name, and the layout of the object in memory, which mandates how its fields are accessed. Here's a short description for the code above:

Since DynamicInterface has virtual methods, the class hierarchy it begets comes with a virtual method table, a pointer to which resides in each object. This is the way the compiler arranges for the runtime code to call the correct method when an actual object is used. The address of the virtual method table (vptr) is in the beginning of the object, and the actual class members come afterwards. So counter lives at offset 8 in DynamicImplementation objects.

add    %rsi,0x8(%rdi)

%rdi is the first argument to tick, which is the hidden this pointer - the address of the object. Hence 0x8(%rdi) is the address of this->counter. The instruction, then, adds n (passed in %rsi according to the calling convention) to this->counter.

By the way, whenever you're curious about object layouts and want to verify your understanding of the ABI, I find Clang's ability to dump the class record layouts very helpful. In this case:

*** Dumping AST Record Layout   0 | class DynamicImplementation   0 |   class DynamicInterface (primary base)   0 |     (DynamicInterface vtable pointer)   8 |   uint64_t counter     | [sizeof=16, dsize=16, align=8     |  nvsize=16, nvalign=8]*** Dumping AST Record Layout   0 | class CRTPImplementation   0 |   class CRTPInterface<class CRTPImplementation> (base) (empty)   0 |   uint64_t counter     | [sizeof=8, dsize=8, align=8     |  nvsize=8, nvalign=8]

On to the invocation of tick now. This is the disassembly for run_dynamic, annotated with comments:

0000000000400c10 <_Z11run_dynamicP16DynamicInterface>:  400c10:       push   %r13  400c12:       mov    $0x1,%r13d  400c18:       push   %r12        // r12d holds i, initialized to 0  400c1a:       xor    %r12d,%r12d  400c1d:       push   %rbp        // Place obj in %rbp  400c1e:       mov    %rdi,%rbp  400c21:       push   %rbx  400c22:       sub    $0x8,%rsp  400c26:       nopw   %cs:0x0(%rax,%rax,1)  400c30:       test   %r12d,%r12d        // when i is 0, the body of the loop won't run, so increment        // both i and j and try again.  400c33:       je     400c5e        // rbx holds j, initialized to 0  400c35:       xor    %ebx,%ebx  400c37:       nopw   0x0(%rax,%rax,1)        // Place the address of obj's vtable in rax  400c40:       mov    0x0(%rbp),%rax        // j is the second argument of tick  400c44:       mov    %rbx,%rsi        // j++  400c47:       add    $0x1,%rbx        // obj is the first argument of tick ('this' pointer)  400c4b:       mov    %rbp,%rdi        // tick is the first entry in the vtable.        // This calls obj->tick(obj, j)  400c4e:       callq  *(%rax)        // Compare j < i and perform inner loop  400c50:       cmp    %ebx,%r12d  400c53:       ja     400c40        // Compare i == 40000 and perform outer loop  400c55:       cmp    $0x9c40,%r13d  400c5c:       je     400c68  400c5e:       add    $0x1,%r13d  400c62:       add    $0x1,%r12d  400c66:       jmp    400c30  400c68:       add    $0x8,%rsp  400c6c:       pop    %rbx  400c6d:       pop    %rbp  400c6e:       pop    %r12  400c70:       pop    %r13  400c72:       retq  400c73:       data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)

The interesting parts here are:

How obj->tick is actually invoked. Since tick is the first method in DynamicInterface, it sits in the first slot in the vtable. So to actually call it, we have a double indirection from obj - one to get to the vtable, the other to get to the method in the vtable.
The constituents of the inner loop - the part that the program spends the vast majority of its time executing. We'll get back to it for a more careful analysis later.

How CRTP calls look under the hood

Now it's time to disassemble the equivalent code that uses CRTP for static polymorphism. Again, we'll want to start with CRTPImplementation::tick, but we won't find it in the disassembly because it was fully inlined into run_crtp. The compiler was able to inline it because it could know statically (at compile time) which method is called. Such inlining is an important tenet of the "zero-cost abstractions" philosophy of modern C++.

Let's go straight to run_crtp, then:

0000000000400d00 <_Z8run_crtpI18CRTPImplementationEvP13CRTPInterfaceIT_E>:        // Place obj->counter into rdx  400d00:       mov    (%rdi),%rdx  400d03:       mov    $0x1,%esi        // rcx holds i, initialized to 0  400d08:       xor    %ecx,%ecx  400d0a:       nopw   0x0(%rax,%rax,1)  400d10:       test   %ecx,%ecx  400d12:       je     400d36        // rax holds j, initialized to 0  400d14:       xor    %eax,%eax  400d16:       nopw   %cs:0x0(%rax,%rax,1)        // counter += j  400d20:       add    %rax,%rdx        // j++ and perform inner loop  400d23:       add    $0x1,%rax  400d27:       cmp    %eax,%ecx  400d29:       ja     400d20  400d2b:       cmp    $0x9c40,%esi        // when we're done, put the final value back into obj->counter  400d31:       mov    %rdx,(%rdi)  400d34:       je     400d3e  400d36:       add    $0x1,%esi  400d39:       add    $0x1,%ecx  400d3c:       jmp    400d10  400d3e:       repz retq

It's not hard to see we'd expect this code to run much faster, for two main reasons:

Since the tick dispatch was inlined, the compiler was free to see that all it does is a simple member increment. The member is then saved in rdx and the loop can then simply bump a register, instead of having a call on each iteration.
As there's no call involved, the inner loop is shorter.

Performance numbers

As expected, the CRTP approach is much faster. The benchmark above takes 1.25 seconds on my i7-4771 CPU for run_dynamic and 0.21 seconds for run_crtp This is a huge difference, and it's much larger than I expected. I was looking for a 2x boost, not 6x . So here comes the 4th bullet of the benchmarking methodology I outlined above. Let's look more carefully at the numbers.

I'll start with producing a trace of the inner loop for both cases, to see the sequence of instructions executed. Since the loop is short, this can be easily done with basic disassembly reading, and also verifying with gdb by stepping through the execution for a few iterations.

Here is the inner loop for run_dynamic:

400c40:     mov    0x0(%rbp),%rax400c44:     mov    %rbx,%rsi400c47:     add    $0x1,%rbx400c4b:     mov    %rbp,%rdi400c4e:     callq  *(%rax) ... calls tick    400ce0: add    %rsi,0x8(%rdi)    400ce4: retq400c50:     cmp    %ebx,%r12d400c53:     ja     400c40

How many times we'd expect it to run? The double loop has a simple summing pattern so we can compute it's in the vicinity of N/2 * N, which in our case means 800e6 (800 million times).

Since the loop above is 9 instructions long, it means 7.2e9 instructions total. Let's look at detailed perf stat numbers for this run:

Performance counter stats for 'build/vcall-benchmark d':      1253.807247 task-clock                #    0.999 CPUs utilized              107 context-switches          #    0.085 K/sec                0 cpu-migrations            #    0.000 K/sec              318 page-faults               #    0.254 K/sec    4,807,848,980 cycles                    #    3.835 GHz  <not supported> stalled-cycles-frontend  <not supported> stalled-cycles-backend    7,203,771,146 instructions              #    1.50  insns per cycle    2,400,716,784 branches                  # 1914.742 M/sec           58,358 branch-misses             #    0.00% of all branches      1.255560284 seconds time elapsed

Indeed, the amount of instructions fits our expectation.

Now, let's turn to run_crtp. Its inner loop is this:

400d20:     add    %rax,%rdx400d23:     add    $0x1,%rax400d27:     cmp    %eax,%ecx400d29:     ja     400d20

So only 4 instructions. In other words, we'd expect the total amount of instructions executed to be in the area of 3.2e9. Let's see:

Performance counter stats for 'build/vcall-benchmark c':       215.919352 task-clock                #    0.997 CPUs utilized               18 context-switches          #    0.083 K/sec                0 cpu-migrations            #    0.000 K/sec              318 page-faults               #    0.001 M/sec      809,355,502 cycles                    #    3.748 GHz  <not supported> stalled-cycles-frontend  <not supported> stalled-cycles-backend    3,202,645,106 instructions              #    3.96  insns per cycle      800,522,521 branches                  # 3707.507 M/sec           53,684 branch-misses             #    0.01% of all branches      0.216596060 seconds time elapsed

Bingo!

But wait, a 2.25x difference in the amount of instructions should not have translated to a 6x difference in runtime, right? Note the amount of branches, though. While the CRTP run has one branch per inner loop, the numbers for the dynamic run show 3 branches per inner loop (for a total of 2.4e9). What gives?

The CPU considers indirect calls and returns as branches for this purpose, and if you think about it, this makes sense. An indirect branch or return transfer control to a location the CPU cannot determine statically (unlike a direct call, for instance) - it depends on the contents of registers & stack. So the CPU doesn't know where to fetch instructions ahead-of-time in order to satisfy its eternally hungry super-pipeline. True, the branch predictor alleviates most of that cost, but such instructions are still more expensive for the CPU than, say, simple adds, because they cannot pump through the pipeline as quickly.

Moreover, the call and ret instructions push and pop data to the stack, which resides in memory. It's almost certainly in L1 cache, but that's still more expensive to access than registers.

Variation: -O3 compilation

Vigilant readers might have noticed that I did not set the highest optimization level of gcc for this benchmark. This was done on purpose, to make the results simpler to explain.

When compiled with -O3, the dynamic version runs as before (and the code produced for it is the same), but the CRTP version runs even faster and finishes within 0.17 seconds, which is 7.2x faster than the dynamic version.

The extra boost comes from auto-vectorization. When one looks at the code produced by the compiler for run_crtp, one can see SIMD instructions in there. The inner loop was unrolled 4x and the operations are performed on whole quad-words, combining several inner loop iterations at a time.

So this is an example where previous optimizations (inlining) enabled the compiler to apply even more advanced optimizations such as vectorization to make the code faster yet.

Variation: disabling inlining

It's also interesting to build the benchmark with -fno-inline and compare the results. Curiously, in this case the CRTP approach runs 1.5x slower than virtual calls. Before you read on, can you guess why?

The reason is quite simple. Note that for proper CRTP, the interface class implements the interface methods and calls through to the implementation. So to actually invoke tick, run_crtp calls:

CRTPInterface<CRTPImplementation>::tick, which calls

CRTPInterface<CRTPImplementation>::impl, and then calls
CRTPImplementation::tick

This is a lot of calls, which all have to be executed when the inliner is turned off. When it's turned on, all of these calls get inlined and the actual instructions of the leaf call are embedded into run_crtp.

There are two lessons here:

Be careful with benchmarking, but you already knew that ;-)
When implementing inlining in a compiler, it's super important to make the inliner iterative - doing multiple passes on the code and discovering new inlining opportunities in each iteration.

Devirtualization

A brand new optimization that I recently heard about is devirtualization. The idea is to find cases of dynamic dispatch where the actual type at a given call site can always to proven to be known at compile time, and specialize those call sites to dispatch statically. This carries the promise of making virtual calls as fast as static dispatch in some special cases.

While this definitely sound interesting, at the time of writing this article devirtualization is still experimental (support in gcc started trickling in version 4.7). In any case, the example examined in this article is probably simple enough to trigger the optimization, but as you can see it didn't happen, even though the -fdevirtualize flag should be on in gcc with optimization levels -O2 and -O3. It will be interesting to follow the development of this optimization and see what cases of virtual calls it can detect and optimize in the future.

Conclusions

There are a lot of lessons to be learned here, so I'll just list them in an arbitrary order:

Benchmarking is an art - if everything is too easy, you're either doing something trivial or wrong. Always cross-verify your assumptions and results with hard data like disassembly listings and detailed performance numbers.
Beware of different compilers and different targets. The above discusses gcc 4.8 for x86-64. Elsewhere, you may expect slightly or considerably different results. Ah, if only programming was easy. But then I guess programmers wouldn't get paid a lot for clicking in front of computers all day.
Compiler optimizations are, by definition, a multi-layered affair. Each is simple but they enable each other. Inlining enables some additional optimizations (such as moving hot code out of inner loops). Other optimizations may enable inlining (by making the leaf methods smaller).
CRTP, when implemented correctly, is recognized by the compiler as static dispatch and optimized accordingly.
CRTP can thus be significantly more efficient than virtual calls, mostly due to inlining. This also means that inlining is crucial to its performance (as it is to many performance features of C++).

This is a degenerate use of CRTP, for sure. It's not here to be realistic - just to demonstrate the same mechanism used in a simple scenario. See the previous article for a more use-focused discussion of CRTP.

These numbers depend on the CPU, of course. When I tried the same benchmark on a Xeon E5-2690 (Sandy Bridge) with gcc 4.6.3 (same code generated) the speed difference is just 3x (0.46 vs 1.39 sec).

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。