A couple of years ago I wrote an article about the Curiously Recurring Template Pattern in C++, focusing on the motivation behind it and how to implement it.
That article mentioned runtime performance as the main reason for employing CRTP instead of the more traditional runtime polymorphism (dispatch via virtual functions). While some rationale for the cost of virtual calls was given, I didn't go too deep into it. Today I want to fix that by carefully analyzing the performance of virtual calls as opposed to the static calls made possible by CRTP.
Benchmarking in 2013 is really hard. Today's CPUs are super-pipelined, branch-predicting out-of-order executing beasts. The memory hierarchy is very deep and the caches have complex behavior. All of this makes detailed performance analysis devilishly complex, and the results are sometimes baffling. We're clearly long past counting MIPS. Add to that overly-clever optimizing compilers that occasionally produce not quite the code you expected, and it's apparent why so many online resources and articles provide bad benchmarks.
So any benchmarks need to be taken with a large grain of salt, including the one posted here. Personally, I'm trying to validate the benchmarks I'm running by attacking them with the scientific method:
The previous article listed the following components in the runtime cost of virtual calls:
While the third component can definitely play a role in some scenarios (i.e. a lot of small objects where the additional memory means less of them fit into L1 data cache), I'll focus on the first two in this article, because they are easier to expose in a simple synthetic benchmark.
There's a plethora of uses for polymorphism in C++. Here I'll focus on a basic one that will let me expose the performance characteristics of virtual calls. I'll define a simple interface with a couple of methods and one implementation of it:
class DynamicInterface {public: virtual void tick(uint64_t n) = 0; virtual uint64_t getvalue() = 0;};class DynamicImplementation : public DynamicInterface { uint64_t counter;public: DynamicImplementation() : counter(0) { } virtual void tick(uint64_t n) { counter += n; } virtual uint64_t getvalue() { return counter; }};
The following code runs the actual benchmark:
const unsigned N = 40000;void run_dynamic(DynamicInterface* obj) { for (unsigned i = 0; i < N; ++i) { for (unsigned j = 0; j < i; ++j) { obj->tick(j); } }}
What this does is simply invoke the virtual method tick on the base pointer obj in the order of O(N^2) times.
The alternative statically-polymorphic implementation is this :
template <typename Implementation>class CRTPInterface {public: void tick(uint64_t n) { impl().tick(n); } uint64_t getvalue() { return impl().getvalue(); }private: Implementation& impl() { return *static_cast<Implementation*>(this); }};class CRTPImplementation : public CRTPInterface<CRTPImplementation> { uint64_t counter;public: CRTPImplementation() : counter(0) { } void tick(uint64_t n) { counter += n; } uint64_t getvalue() { return counter; }};template <typename Implementation>void run_crtp(CRTPInterface<Implementation>* obj) { for (unsigned i = 0; i < N; ++i) { for (unsigned j = 0; j < i; ++j) { obj->tick(j); } }}
Now let's spend some time studying the machine code generated by gcc -O2 (version 4.8) from the code above. The code for DynamicImplementation::tick is very compact:
0000000000400cf0 <_ZN21DynamicImplementation4tickEm>: 400cf0: add %rsi,0x8(%rdi) 400cf4: retq
To understand what this means, some familiarity with the Itanium C++ ABI is required. The ABI in this case mandates both the name mangling that produces the weird symbol name, and the layout of the object in memory, which mandates how its fields are accessed. Here's a short description for the code above:
Since DynamicInterface has virtual methods, the class hierarchy it begets comes with a virtual method table, a pointer to which resides in each object. This is the way the compiler arranges for the runtime code to call the correct method when an actual object is used. The address of the virtual method table (vptr) is in the beginning of the object, and the actual class members come afterwards. So counter lives at offset 8 in DynamicImplementation objects.
add %rsi,0x8(%rdi)
%rdi is the first argument to tick, which is the hidden this pointer - the address of the object. Hence 0x8(%rdi) is the address of this->counter. The instruction, then, adds n (passed in %rsi according to the calling convention) to this->counter.
By the way, whenever you're curious about object layouts and want to verify your understanding of the ABI, I find Clang's ability to dump the class record layouts very helpful. In this case:
*** Dumping AST Record Layout 0 | class DynamicImplementation 0 | class DynamicInterface (primary base) 0 | (DynamicInterface vtable pointer) 8 | uint64_t counter | [sizeof=16, dsize=16, align=8 | nvsize=16, nvalign=8]*** Dumping AST Record Layout 0 | class CRTPImplementation 0 | class CRTPInterface<class CRTPImplementation> (base) (empty) 0 | uint64_t counter | [sizeof=8, dsize=8, align=8 | nvsize=8, nvalign=8]
On to the invocation of tick now. This is the disassembly for run_dynamic, annotated with comments:
0000000000400c10 <_Z11run_dynamicP16DynamicInterface>: 400c10: push %r13 400c12: mov $0x1,%r13d 400c18: push %r12 // r12d holds i, initialized to 0 400c1a: xor %r12d,%r12d 400c1d: push %rbp // Place obj in %rbp 400c1e: mov %rdi,%rbp 400c21: push %rbx 400c22: sub $0x8,%rsp 400c26: nopw %cs:0x0(%rax,%rax,1) 400c30: test %r12d,%r12d // when i is 0, the body of the loop won't run, so increment // both i and j and try again. 400c33: je 400c5e // rbx holds j, initialized to 0 400c35: xor %ebx,%ebx 400c37: nopw 0x0(%rax,%rax,1) // Place the address of obj's vtable in rax 400c40: mov 0x0(%rbp),%rax // j is the second argument of tick 400c44: mov %rbx,%rsi // j++ 400c47: add $0x1,%rbx // obj is the first argument of tick ('this' pointer) 400c4b: mov %rbp,%rdi // tick is the first entry in the vtable. // This calls obj->tick(obj, j) 400c4e: callq *(%rax) // Compare j < i and perform inner loop 400c50: cmp %ebx,%r12d 400c53: ja 400c40 // Compare i == 40000 and perform outer loop 400c55: cmp $0x9c40,%r13d 400c5c: je 400c68 400c5e: add $0x1,%r13d 400c62: add $0x1,%r12d 400c66: jmp 400c30 400c68: add $0x8,%rsp 400c6c: pop %rbx 400c6d: pop %rbp 400c6e: pop %r12 400c70: pop %r13 400c72: retq 400c73: data32 data32 data32 nopw %cs:0x0(%rax,%rax,1)
The interesting parts here are:
Now it's time to disassemble the equivalent code that uses CRTP for static polymorphism. Again, we'll want to start with CRTPImplementation::tick, but we won't find it in the disassembly because it was fully inlined into run_crtp. The compiler was able to inline it because it could know statically (at compile time) which method is called. Such inlining is an important tenet of the "zero-cost abstractions" philosophy of modern C++.
Let's go straight to run_crtp, then:
0000000000400d00 <_Z8run_crtpI18CRTPImplementationEvP13CRTPInterfaceIT_E>: // Place obj->counter into rdx 400d00: mov (%rdi),%rdx 400d03: mov $0x1,%esi // rcx holds i, initialized to 0 400d08: xor %ecx,%ecx 400d0a: nopw 0x0(%rax,%rax,1) 400d10: test %ecx,%ecx 400d12: je 400d36 // rax holds j, initialized to 0 400d14: xor %eax,%eax 400d16: nopw %cs:0x0(%rax,%rax,1) // counter += j 400d20: add %rax,%rdx // j++ and perform inner loop 400d23: add $0x1,%rax 400d27: cmp %eax,%ecx 400d29: ja 400d20 400d2b: cmp $0x9c40,%esi // when we're done, put the final value back into obj->counter 400d31: mov %rdx,(%rdi) 400d34: je 400d3e 400d36: add $0x1,%esi 400d39: add $0x1,%ecx 400d3c: jmp 400d10 400d3e: repz retq
It's not hard to see we'd expect this code to run much faster, for two main reasons:
As expected, the CRTP approach is much faster. The benchmark above takes 1.25 seconds on my i7-4771 CPU for run_dynamic and 0.21 seconds for run_crtp This is a huge difference, and it's much larger than I expected. I was looking for a 2x boost, not 6x . So here comes the 4th bullet of the benchmarking methodology I outlined above. Let's look more carefully at the numbers.
I'll start with producing a trace of the inner loop for both cases, to see the sequence of instructions executed. Since the loop is short, this can be easily done with basic disassembly reading, and also verifying with gdb by stepping through the execution for a few iterations.
Here is the inner loop for run_dynamic:
400c40: mov 0x0(%rbp),%rax400c44: mov %rbx,%rsi400c47: add $0x1,%rbx400c4b: mov %rbp,%rdi400c4e: callq *(%rax) ... calls tick 400ce0: add %rsi,0x8(%rdi) 400ce4: retq400c50: cmp %ebx,%r12d400c53: ja 400c40
How many times we'd expect it to run? The double loop has a simple summing pattern so we can compute it's in the vicinity of N/2 * N, which in our case means 800e6 (800 million times).
Since the loop above is 9 instructions long, it means 7.2e9 instructions total. Let's look at detailed perf stat numbers for this run:
Performance counter stats for 'build/vcall-benchmark d': 1253.807247 task-clock # 0.999 CPUs utilized 107 context-switches # 0.085 K/sec 0 cpu-migrations # 0.000 K/sec 318 page-faults # 0.254 K/sec 4,807,848,980 cycles # 3.835 GHz <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 7,203,771,146 instructions # 1.50 insns per cycle 2,400,716,784 branches # 1914.742 M/sec 58,358 branch-misses # 0.00% of all branches 1.255560284 seconds time elapsed
Indeed, the amount of instructions fits our expectation.
Now, let's turn to run_crtp. Its inner loop is this:
400d20: add %rax,%rdx400d23: add $0x1,%rax400d27: cmp %eax,%ecx400d29: ja 400d20
So only 4 instructions. In other words, we'd expect the total amount of instructions executed to be in the area of 3.2e9. Let's see:
Performance counter stats for 'build/vcall-benchmark c': 215.919352 task-clock # 0.997 CPUs utilized 18 context-switches # 0.083 K/sec 0 cpu-migrations # 0.000 K/sec 318 page-faults # 0.001 M/sec 809,355,502 cycles # 3.748 GHz <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 3,202,645,106 instructions # 3.96 insns per cycle 800,522,521 branches # 3707.507 M/sec 53,684 branch-misses # 0.01% of all branches 0.216596060 seconds time elapsed
Bingo!
But wait, a 2.25x difference in the amount of instructions should not have translated to a 6x difference in runtime, right? Note the amount of branches, though. While the CRTP run has one branch per inner loop, the numbers for the dynamic run show 3 branches per inner loop (for a total of 2.4e9). What gives?
The CPU considers indirect calls and returns as branches for this purpose, and if you think about it, this makes sense. An indirect branch or return transfer control to a location the CPU cannot determine statically (unlike a direct call, for instance) - it depends on the contents of registers & stack. So the CPU doesn't know where to fetch instructions ahead-of-time in order to satisfy its eternally hungry super-pipeline. True, the branch predictor alleviates most of that cost, but such instructions are still more expensive for the CPU than, say, simple adds, because they cannot pump through the pipeline as quickly.
Moreover, the call and ret instructions push and pop data to the stack, which resides in memory. It's almost certainly in L1 cache, but that's still more expensive to access than registers.
Vigilant readers might have noticed that I did not set the highest optimization level of gcc for this benchmark. This was done on purpose, to make the results simpler to explain.
When compiled with -O3, the dynamic version runs as before (and the code produced for it is the same), but the CRTP version runs even faster and finishes within 0.17 seconds, which is 7.2x faster than the dynamic version.
The extra boost comes from auto-vectorization. When one looks at the code produced by the compiler for run_crtp, one can see SIMD instructions in there. The inner loop was unrolled 4x and the operations are performed on whole quad-words, combining several inner loop iterations at a time.
So this is an example where previous optimizations (inlining) enabled the compiler to apply even more advanced optimizations such as vectorization to make the code faster yet.
It's also interesting to build the benchmark with -fno-inline and compare the results. Curiously, in this case the CRTP approach runs 1.5x slower than virtual calls. Before you read on, can you guess why?
The reason is quite simple. Note that for proper CRTP, the interface class implements the interface methods and calls through to the implementation. So to actually invoke tick, run_crtp calls:
- CRTPInterface<CRTPImplementation>::impl, and then calls
- CRTPImplementation::tick
This is a lot of calls, which all have to be executed when the inliner is turned off. When it's turned on, all of these calls get inlined and the actual instructions of the leaf call are embedded into run_crtp.
There are two lessons here:
A brand new optimization that I recently heard about is devirtualization. The idea is to find cases of dynamic dispatch where the actual type at a given call site can always to proven to be known at compile time, and specialize those call sites to dispatch statically. This carries the promise of making virtual calls as fast as static dispatch in some special cases.
While this definitely sound interesting, at the time of writing this article devirtualization is still experimental (support in gcc started trickling in version 4.7). In any case, the example examined in this article is probably simple enough to trigger the optimization, but as you can see it didn't happen, even though the -fdevirtualize flag should be on in gcc with optimization levels -O2 and -O3. It will be interesting to follow the development of this optimization and see what cases of virtual calls it can detect and optimize in the future.
There are a lot of lessons to be learned here, so I'll just list them in an arbitrary order:
This is a degenerate use of CRTP, for sure. It's not here to be realistic - just to demonstrate the same mechanism used in a simple scenario. See the previous article for a more use-focused discussion of CRTP. |
These numbers depend on the CPU, of course. When I tried the same benchmark on a Xeon E5-2690 (Sandy Bridge) with gcc 4.6.3 (same code generated) the speed difference is just 3x (0.46 vs 1.39 sec). |
联系客服