Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I just tested this on my machine (gcc 5.4.0). At -O2, gcc produced normal looking assembly code. At -O3, gcc produced a monstrosity [0] that I don't feel like fully deciphering.

However, from a brief glance, it does not appear to have created a closed form solution. Instead, it contains a single loop:

    .L4:
        addl    $1, %edx
        paddd   %xmm1, %xmm0
        paddd   %xmm2, %xmm1
        cmpl    %edx, %eax
        ja  .L4 
which seems to be using a SIMD instruction (paddd[1]) that adds does 4 32-bit integer additions in parallel.

After this loop, it does some "housekeeping" (read, something I don't understand) before proceeding to an unwound version of the last iterations of the loop:

    leal    4(%rdx), %ecx
    addl    %edx, %eax
    cmpl    %ecx, %edi
    jl  .L2 
    addl    %ecx, %eax
    leal    8(%rdx), %ecx
    cmpl    %ecx, %edi
    ...
    jl  .L2 
    addl    %ecx, %eax
    leal    28(%rdx), %ecx
    cmpl    %ecx, %edi
    jl  .L2
    addl    %ecx, %eax
    addl    $32, %edx
    leal    (%rax,%rdx), %ecx
    cmpl    %edx, %edi
    cmovge  %ecx, %eax
    ret
Where .L2 is just:

    .L2:
        rep ret
I assume that this is just some form of return, but the documentation I could find [2] seems to suggest that rep is a prefix for string operations, which doesn't make sense.

[0]https://pastebin.com/raw/Y55gQG7p

[1] http://x86.renejeschke.de/html/file_module_x86_id_226.html

[2] https://c9x.me/x86/html/file_module_x86_id_279.html



the rep in rep ret is ignored, is just used for alignment; the 'housekeeping' code is to handle non-multiple of 8 loop counts.

Still, unless I'm missing something, the code should be executing 8 adds per clock[2]; at 4ghz, that still above 1us for 500k adds.

GCC doesn't seem to be able to fold the loop given a constant expression, unless the function is explicitly declared constexpr; in which case it will complain about the accumulator overflowing, but gcc doesn't seem to be taking advantage of it.

Clang does not vectorize the loop but will replace it with a constant given a constant parameter.

Bottom line, I'm not sure what's going on with the article's measurements.

[2] potentially 12 for skylake or even 24 with avx.


If you pass -march=skylake it will get even more monstrous, with AVX: https://godbolt.org/g/v7iKF3




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: