I just tested this on my machine (gcc 5.4.0). At -O2, gcc produced normal lookin...

gpderetta · on July 11, 2017

the rep in rep ret is ignored, is just used for alignment; the 'housekeeping' code is to handle non-multiple of 8 loop counts.

Still, unless I'm missing something, the code should be executing 8 adds per clock[2]; at 4ghz, that still above 1us for 500k adds.

GCC doesn't seem to be able to fold the loop given a constant expression, unless the function is explicitly declared constexpr; in which case it will complain about the accumulator overflowing, but gcc doesn't seem to be taking advantage of it.

Clang does not vectorize the loop but will replace it with a constant given a constant parameter.

Bottom line, I'm not sure what's going on with the article's measurements.

[2] potentially 12 for skylake or even 24 with avx.

implr · on July 11, 2017

If you pass -march=skylake it will get even more monstrous, with AVX: https://godbolt.org/g/v7iKF3