Skip to content

Instantly share code, notes, and snippets.

@RealNeGate
Last active September 14, 2023 14:52

Revisions

  1. RealNeGate revised this gist Sep 14, 2023. No changes.
  2. RealNeGate revised this gist Sep 14, 2023. 1 changed file with 6 additions and 6 deletions.
    12 changes: 6 additions & 6 deletions tb_comparison.md
    Original file line number Diff line number Diff line change
    @@ -61,17 +61,17 @@ if (cond) {
    }
    ```
    the big win is making it easier to do later loop optimizations.
    their codegen generates `inc rax` instead of `add rax, 1` which doesn't really matter to me
    oh and the other one is that my register allocator does not coalesce a copy of the loop index
    in practice these loops run at about the same speed but the gap grows in more complex examples
    loop rotation eventually becomes helpful once you need to hoist memory ops from loops
    their codegen generates `inc rax` instead of `add rax, 1` which doesn't really matter to me.
    oh and the other one is that my register allocator does not coalesce a copy of the loop index.
    In practice these loops run at about the same speed but the gap grows in more complex examples.
    Loop rotation eventually becomes helpful once you need to hoist memory ops from loops:
    ```c
    // broadcast item across all elements
    for (int i = 0; i < n; i++) {
    dst[i] = src[0];
    }
    ```
    given the src is a pointer i haven't yet accessed and n can be 0, i can't hoist that load because just placing it outside of the loop means it'll force a load even if n == 0 which could be a problem
    given the src is a pointer i haven't yet accessed and n can be 0, i can't hoist that load because just placing it outside of the loop means it'll force a load even if n == 0 which could be a problem.
    but when we do loop rotation...
    ```c
    size_t i = 0;
    @@ -83,7 +83,7 @@ if (i < n) {
    }
    ```
    we can now guarentee that once we're in the loop body we're doing at least one trip
    so hoisting is now fair game
    so hoisting is now fair game:
    ```diff
    + // if logic proved we do one trip which means we're met the condition to read src[0]
    + size_t hoisted = src[0];
  3. RealNeGate revised this gist Sep 14, 2023. 1 changed file with 2 additions and 2 deletions.
    4 changes: 2 additions & 2 deletions tb_comparison.md
    Original file line number Diff line number Diff line change
    @@ -60,8 +60,8 @@ if (cond) {
    } while (cond);
    }
    ```
    the big win is making it easier to do later loop optimizations
    their tiling generates inc rax instead of add rax, 1 which doesn't really matter to me
    the big win is making it easier to do later loop optimizations.
    their codegen generates `inc rax` instead of `add rax, 1` which doesn't really matter to me
    oh and the other one is that my register allocator does not coalesce a copy of the loop index
    in practice these loops run at about the same speed but the gap grows in more complex examples
    loop rotation eventually becomes helpful once you need to hoist memory ops from loops
  4. RealNeGate revised this gist Sep 14, 2023. 1 changed file with 10 additions and 10 deletions.
    20 changes: 10 additions & 10 deletions tb_comparison.md
    Original file line number Diff line number Diff line change
    @@ -30,18 +30,18 @@ L0000021FE98F0540:
    # clang -O1
    max_array:
    test rdi, rdi
    je .LBB0_3
    xor eax, eax
    test rdi, rdi
    je .LBB0_3
    xor eax, eax
    .LBB0_2:
    movss xmm0, dword ptr [rsi + 4*rax]
    maxss xmm0, dword ptr [rdx + 4*rax]
    movss dword ptr [rsi + 4*rax], xmm0
    inc rax
    cmp rdi, rax
    jne .LBB0_2
    movss xmm0, dword ptr [rsi + 4*rax]
    maxss xmm0, dword ptr [rdx + 4*rax]
    movss dword ptr [rsi + 4*rax], xmm0
    inc rax
    cmp rdi, rax
    jne .LBB0_2
    .LBB0_3:
    ret
    ret
    ```

    you'll notice two checks instead of one this is because they do loop rotation simple rundown is this
  5. RealNeGate revised this gist Sep 14, 2023. 1 changed file with 4 additions and 4 deletions.
    8 changes: 4 additions & 4 deletions tb_comparison.md
    Original file line number Diff line number Diff line change
    @@ -45,7 +45,7 @@ max_array:
    ```

    you'll notice two checks instead of one this is because they do loop rotation simple rundown is this
    ```
    ```c
    while (cond) {
    ...
    }
    @@ -65,15 +65,15 @@ their tiling generates inc rax instead of add rax, 1 which doesn't really matter
    oh and the other one is that my register allocator does not coalesce a copy of the loop index
    in practice these loops run at about the same speed but the gap grows in more complex examples
    loop rotation eventually becomes helpful once you need to hoist memory ops from loops
    ```
    ```c
    // broadcast item across all elements
    for (int i = 0; i < n; i++) {
    dst[i] = src[0];
    }
    ```
    given the src is a pointer i haven't yet accessed and n can be 0, i can't hoist that load because just placing it outside of the loop means it'll force a load even if n == 0 which could be a problem
    but when we do loop rotation...
    ```
    ```c
    size_t i = 0;
    if (i < n) {
    do {
    @@ -84,7 +84,7 @@ if (i < n) {
    ```
    we can now guarentee that once we're in the loop body we're doing at least one trip
    so hoisting is now fair game
    ```
    ```diff
    + // if logic proved we do one trip which means we're met the condition to read src[0]
    + size_t hoisted = src[0];
    do {
  6. RealNeGate revised this gist Sep 14, 2023. 1 changed file with 1 addition and 7 deletions.
    8 changes: 1 addition & 7 deletions tb_comparison.md
    Original file line number Diff line number Diff line change
    @@ -92,10 +92,4 @@ so hoisting is now fair game
    + dst[i] = hoisted;
    } while (i < n);
    ```

    yea that's my ted talk.

    I started implementing this a while back and stopped because i only do optimizer stuff on the side, it's lower priority than other crap.
    it won't be ready until far later.

    - NeGate
    Yea that's my ted talk. I started implementing this a while back and stopped because I only do optimizer stuff on the side, it's low priority for now.
  7. RealNeGate revised this gist Sep 14, 2023. 1 changed file with 6 additions and 3 deletions.
    9 changes: 6 additions & 3 deletions tb_comparison.md
    Original file line number Diff line number Diff line change
    @@ -93,6 +93,9 @@ so hoisting is now fair game
    } while (i < n);
    ```

    yea that's my ted talk
    i started implementing this a while back and stopped because i only do optimizer stuff on the side, it's lower priority than other crap
    it won't be ready until far later
    yea that's my ted talk.

    I started implementing this a while back and stopped because i only do optimizer stuff on the side, it's lower priority than other crap.
    it won't be ready until far later.

    - NeGate
  8. RealNeGate revised this gist Sep 14, 2023. 1 changed file with 3 additions and 4 deletions.
    7 changes: 3 additions & 4 deletions tb_comparison.md
    Original file line number Diff line number Diff line change
    @@ -44,9 +44,8 @@ max_array:
    ret
    ```

    you'll notice two checks instead of one
    this is because they do loop rotation
    simple rundown is this
    you'll notice two checks instead of one this is because they do loop rotation simple rundown is this
    ```
    while (cond) {
    ...
    }
    @@ -60,7 +59,7 @@ if (cond) {
    ...
    } while (cond);
    }
    ```
    the big win is making it easier to do later loop optimizations
    their tiling generates inc rax instead of add rax, 1 which doesn't really matter to me
    oh and the other one is that my register allocator does not coalesce a copy of the loop index
  9. RealNeGate created this gist Sep 14, 2023.
    99 changes: 99 additions & 0 deletions tb_comparison.md
    Original file line number Diff line number Diff line change
    @@ -0,0 +1,99 @@
    Let's compare Clang and Cuik/TB a bit to get a picture of what i need to do.

    Given this C code:
    ```c
    void max_array(size_t n, float* x, float* y) {
    for (size_t i = 0; i < n; i++) {
    x[i] = x[i] > y[i] ? x[i] : y[i];
    }
    }
    ```
    Clang and I generate:
    ```x86
    # cuik -O1
    max_array:
    xor RAX, RAX
    L0000021FE98F03C0:
    mov R9, RAX
    cmp R9, RCX
    jnb L0000021FE98F0540
    L0000021FE98F0440:
    movss XMM0, xmmword [R8 + RAX*4]
    maxss XMM0, xmmword [RDX + RAX*4]
    movss xmmword [RDX + RAX*4], XMM0
    add RAX, 1
    jmp L0000021FE98F03C0
    L0000021FE98F0540:
    ret
    # clang -O1
    max_array:
    test rdi, rdi
    je .LBB0_3
    xor eax, eax
    .LBB0_2:
    movss xmm0, dword ptr [rsi + 4*rax]
    maxss xmm0, dword ptr [rdx + 4*rax]
    movss dword ptr [rsi + 4*rax], xmm0
    inc rax
    cmp rdi, rax
    jne .LBB0_2
    .LBB0_3:
    ret
    ```

    you'll notice two checks instead of one
    this is because they do loop rotation
    simple rundown is this
    while (cond) {
    ...
    }

    // into...

    if (cond) {
    // loop invariants are placed here
    // none in this example :p
    do {
    ...
    } while (cond);
    }

    the big win is making it easier to do later loop optimizations
    their tiling generates inc rax instead of add rax, 1 which doesn't really matter to me
    oh and the other one is that my register allocator does not coalesce a copy of the loop index
    in practice these loops run at about the same speed but the gap grows in more complex examples
    loop rotation eventually becomes helpful once you need to hoist memory ops from loops
    ```
    // broadcast item across all elements
    for (int i = 0; i < n; i++) {
    dst[i] = src[0];
    }
    ```
    given the src is a pointer i haven't yet accessed and n can be 0, i can't hoist that load because just placing it outside of the loop means it'll force a load even if n == 0 which could be a problem
    but when we do loop rotation...
    ```
    size_t i = 0;
    if (i < n) {
    do {
    dst[i] = src[0];
    i += 1;
    } while (i < n);
    }
    ```
    we can now guarentee that once we're in the loop body we're doing at least one trip
    so hoisting is now fair game
    ```
    + // if logic proved we do one trip which means we're met the condition to read src[0]
    + size_t hoisted = src[0];
    do {
    - dst[i] = src[0];
    + dst[i] = hoisted;
    } while (i < n);
    ```

    yea that's my ted talk
    i started implementing this a while back and stopped because i only do optimizer stuff on the side, it's lower priority than other crap
    it won't be ready until far later