Friday, 7 February 2020

Rust zero cost abstractions in action

One of my colleagues was experimenting with Rust. He started by writing a sudoku solver which he has already written in C before. Once he was completed writing it in Rust, he was very disappointed because Rust version was twice as fast than the C version which was hand-optimised by pulling off all the tricks he knew to make it perform well. He eventually managed to make the C version as fast as the Rust version by removing the intrinsics.

This piqued my interest in looking into what kind of assembly code is being generated in the final binary.

Sum numbers up to N

The following code is summing up numbers from 1 to n.

pub fn sum(n: i32) -> i32 {
    let mut sum = 0;
    for i in 1..n {
      sum += i;
    }
    sum
}

Generated assembly for this is like this:

example::sum:
        xor     eax, eax
        cmp     edi, 2
        jl      .LBB0_2
        lea     eax, [rdi - 2]
        lea     ecx, [rdi - 3]
        imul    rcx, rax
        shr     rcx
        lea     eax, [rcx + 2*rdi]
        add     eax, -3
.LBB0_2:
        ret

I don’t know anything about reading x86 opcodes other the basics but I can confidently tell that I don’t see any loops here.

After reading a little bit about the opcodes and what various registers do, I was able to make sense of the generated code.

RDI register is where our argument n is stored. RAX register is where the return result is stored. So if we replace RAX with RETURN and RDI with N then the generated assembly roughly looks like this:

example::sum:
        xor     eax, eax
        cmp     edi, 2
        jl      .LBB0_2
        lea     eax, [N - 2]
        lea     ecx, [N - 3]
        imul    rcx, rax
        shr     rcx
        lea     RETURN, [rcx + 2*N]
        add     RETURN, -3
.LBB0_2:
        ret

which essentially calculates the formula: (N-2)*(N-3)/2 + 2*N - 3 which can be simplified to N*(N-1)/2. That’s the formula used for summing numbers between 1 to N!

That’s surprising that the compiler was smart enough to recognise this and replaced it with the actual formula.

Zero cost abstractions

One of the promises of Rust is zero-cost abstractions. To bluntly put it, your flavour of code doesn’t affect the performance. It doesn’t matter if you use loops or closures, they all compile down to the same assembly.

So if I rewrite the same code by using fold, then the generated assembly should be roughly the same.

pub fn sum2(n: i32) -> i32 {
   (1..n).fold(0, |x,y| x + y)
}

generated the following:

example::sum2:
        xor     eax, eax
        cmp     edi, 2
        jl      .LBB0_2
        lea     eax, [rdi - 2]
        lea     ecx, [rdi - 3]
        imul    rcx, rax
        shr     rcx
        lea     eax, [rcx + 2*rdi]
        add     eax, -3
.LBB0_2:
        ret

It’s the same assembly!

How about using sum instead?

pub fn sum3(n: i32) -> i32 {
   (1..n).sum()
}

Same result.

I am really impressed with what I have seen so far, but the following just blew my mind off.

pub fn main() -> i32 {
    return sum(5) + sum2(5) + sum3(5);
}

compiled down to the following:

example::main:
        mov     eax, 30
        ret

Doesn’t even call the functions. It’s 30.

Conclusion

I knew about const folding that compilers perform to generate fast code but I didn’t know it could go this far.

Rust is known to be very performant but the best thing about writing Rust code that you never really care about the performance while writing your code. It just happens to be fast!.