Julien Jorge's Personal Website

Weight Gain and Perf Loss

Tue Sep 5, 2023

I have recently encountered an incredible development puzzle, the kind I had not seen since a very long time: by an unfortunate series of events my C++ application got a 5% performance loss with the removal of a single line, in the form of #include <utils.hpp>.

Hold on, as it turned out the cause of the higher processing time was only due to an increase of the size of the binary. But why did the size increased when the include directive was removed? Grab a cup of your favorite beverage, here comes the full investigation.

Side effect by inclusion

Facing an unexpected behavior, without any visible cause, we must start by reducing the possibilities by isolating the change that introduced the regression. There is no secret recipe here, I just took the 42 files (not kidding) I had modified and I have restored them using bisections until the problem disappeared. Let’s say the problematic file was bar.cpp.

The only change in this file was the removal of the inclusion of utils.hpp, you know, the kind of file that includes many things in order to reach the highest coupling. When I restored the inclusion the problem disappeared, but since there was no direct link between utils.hpp and bar.cpp I removed one level by replacing this inclusion by another, formerly indirectly done in utils.hpp. Let’s call this other header base.hpp. The performance is still good with this change. Nice, it makes sense.

Seeing that nothing in base.hpp could explain the performance loss, and since this header also includes many headers, I took a shortcut by looking at the code processed by the compiler once the preprocessor has been applied:

  1. Retrieve the command used to compile bar.cpp.o with ninja -v. It looks like g++ -I… -f… -D… -o …/bar.cpp.o -c …/bar.cpp
  2. Remove the output file and add the -E flag to display the preprocessor output: g++ -I… -f… -D… -E -c …/bar.cpp.

I kept the preprocessor output with and without the inclusion of base.hpp and I sent both in diff. There were many differences. Trying to get a hint from the source, I went back into bar.cpp and saw that it uses std::abs(int). I focused on this function and observed that the preprocessed code with base.hpp contains many definitions of abs. One of them comes from cstdlib (i.e. the function from the C library):

inline long
abs(long __i) { return __builtin_labs(__i); }

The other definition is the one from cmath (i.e. the function from the STL):

template<typename _Tp>
inline constexpr
typename __gnu_cxx::__enable_if<__is_integer<_Tp>::__value,
                                    double>::__type
abs(_Tp __x)
{ return __builtin_fabs(__x); }

As you can see, the version of std::abs provided by GCC 4.8 uses fabs even if the argument is an integer. Consequently, there are two conversions: int -> float then float -> int. It is only when cstdlib is included that the version with a long argument is available and preferred over the template.

At this point I have to make it clear that the fault does not originate from GCC but rather from an ambiguity in the C++ standard which was resolved at the time of GCC 7 (cf. LWG 2192 and LWG 2294). Unlucky me, I am stuck with GCC 4.8 :'(

Confirming the cause

In order to confirm that the problem arises from this header I replaced the inclusion of base.hpp by cstdlib in bar.cpp. The performance is still good \o/

So, I have isolated the problem to a single header, now I have to confirm that this is indeed due to the calls to std::abs. For this, I begin by comparing the assembly of the program with and without the inclusion of cstdlib. Note that producing a diff-able form of a disassembled program is not immediate, most notably due to the addresses for the jump and call instructions that will be different as soon as an instruction is added. To work around this issue I tinkered with sed:

objdump --demangle \
        --disassemble \
        --no-show-raw-insn \
        -M intel \
        my_program \
    | sed 's/ \+#.\+$//' \
    | sed 's/0x[a-f0-9]\+/HEX/g' \
    | sed 's/\(\(call\|j..\) \+\)[0-9a-f]\+/\1HEX/' \
    | sed 's/^\([ ]\+\)[0-9a-f]\+:/\1  HEX:/'

Some explanations on these commands. Executing objdump with these options will display the assembly of the given program. I send this assembly in sed to replace by a generic information everything that prevents a nice diff (here in multiple instantiations of sed for clarity):

  • End of line comments are removed,
  • hexadecimal values (e.g. for memory accesses) are replaced by HEX,
  • addresses for jumps and calls are replaced by HEX,
  • addresses of the instructions themselves are replaced by HEX.

Transformed like this the outputs can be passed to diff, showing clearly that the only meaningful difference between the versions with and without cstdlib is the additional int <-> float conversions in the calls to std::abs:

cvtsi2sd  xmm0, edi    ; integer to float
andps     xmm0, XMMWORD PTR .L_2il0floatpacket.0[rip]
cvttsd2si eax, xmm0    ; float to integer

Side effect of the side effect

This is a nice result but it does not explain the performance loss yet. See, these functions with all these int <-> float conversions are not executed in the benchmark as it uses an alternate implementation with AVX2 intrinsics. Thus the loss is not directly linked to the change in the implementation of abs.

On the other hand when I look at the assembly diff I can see that there are many extra instructions when cstdlib is omitted; around 100'000 additional ones for 500 KB on the size of the final binary. Now, if everything happens in non-executed code, our best guess is that it prevents efficient retrieval of the instructions of the remaining code. I try to confirm that with a pass of the perf tool to measure the instruction cache misses (the L1-icache-load-misses event), and hurrah: 11 to 20% extra cache misses when cstdlib is missing. A-ha! Finally!

Interesting point, I observe this loss when the final binary goes from 12.1 MB to 12.6 MB (the binaries are both compiled with LTO), but there is no loss going from 13.2 to 13.7 MB (binaries compiled without LTO). Explanations for this is left as an exercise for the reader.

Final thoughts

I wanted to clean up my inclusions and because I use an old compiler it resulted in a different generated binary. This change caused an increase in the size of my binary, size that went over a threshold causing a huge increase in the cache misses and thus a performance loss.

Phew, it was not much but it took me weeks. All because the removal of a single include directive. It shows that it is much better to be excessively inclusive… Wait, what? Someone is telling me that I am mixing things up. Never mind, it’s over.

P.S.: On the topic of the size of the executable binary you can read a nice series of posts by Sandor Dargo, for example: https://www.sandordargo.com/blog/2023/07/19/binary-sizes-and-compiler-flags