I just tested this on my machine (gcc 5.4.0). At -O2, gcc produced normal looking assembly code. At -O3, gcc produced a monstrosity [0] that I don't feel like fully deciphering.
However, from a brief glance, it does not appear to have created a closed form solution. Instead, it contains a single loop:
which seems to be using a SIMD instruction (paddd[1]) that adds does 4 32-bit integer additions in parallel.
After this loop, it does some "housekeeping" (read, something I don't understand) before proceeding to an unwound version of the last iterations of the loop:
I assume that this is just some form of return, but the documentation I could find [2] seems to suggest that rep is a prefix for string operations, which doesn't make sense.
the rep in rep ret is ignored, is just used for alignment;
the 'housekeeping' code is to handle non-multiple of 8 loop counts.
Still, unless I'm missing something, the code should be executing 8 adds per clock[2]; at 4ghz, that still above 1us for 500k adds.
GCC doesn't seem to be able to fold the loop given a constant expression, unless the function is explicitly declared constexpr; in which case it will complain about the accumulator overflowing, but gcc doesn't seem to be taking advantage of it.
Clang does not vectorize the loop but will replace it with a constant given a constant parameter.
Bottom line, I'm not sure what's going on with the article's measurements.
[2] potentially 12 for skylake or even 24 with avx.
However, from a brief glance, it does not appear to have created a closed form solution. Instead, it contains a single loop:
which seems to be using a SIMD instruction (paddd[1]) that adds does 4 32-bit integer additions in parallel.After this loop, it does some "housekeeping" (read, something I don't understand) before proceeding to an unwound version of the last iterations of the loop:
Where .L2 is just: I assume that this is just some form of return, but the documentation I could find [2] seems to suggest that rep is a prefix for string operations, which doesn't make sense.[0]https://pastebin.com/raw/Y55gQG7p
[1] http://x86.renejeschke.de/html/file_module_x86_id_226.html
[2] https://c9x.me/x86/html/file_module_x86_id_279.html