Branching vs. branchless add with data dependency
Branching vs. branchless add with data dependency
Introduction
Modern CPUs implement branch prediction to achieve massive performance gains, even when facing data dependencies that would otherwise expose the Von Neumann bottleneck. The weakness of branch prediction only becomes apparent in the relatively rare case where a condition in a hot loop is randomly determined and not overwhelmingly favoring a specific branch (>99%). In such cases, the solution can be a refactor to eliminate the branch, but branchless code comes with its own trade-offs.
Comparing
Assembly
Assembly Explanation
branching_add
First perform setup steps (function prologue) and test if the length of v1
is zero, in which case early return.
If rsi
(length of v1
) is not 0 continue.
Some more length checking (bounds checking), this time on b
.
Now check whether to add v1
or v2
to the final result, and more bounds checking.
At this point either the base address of v1
or v2
is stored in r11
. So with the loop counter r10
as the offset, the i’th element of r11
is added to rax
and a check if the loop counter matches the length of v1
is performed, which would mean the end of the loop.
LBB0_7
is entered if a bounds check failed on b
and debug information is provided to the error routine.
This is the function epilogue that marks the end of a successful call to branching_add
. This is called regardless of the length of v1
.
Simplified branching_add
If we ignore function prologue, epilogue, error handling, and early returns on a zero-length v1
, we can focus on the expected happy path which looks as follows:
branchless_add
Function prologue includes allocating stack space for the 2 element vals
array, clearing return value register, checking if v1
has length of zero (early return) and clearing the loop counter register.
Now entering the body of the loop, starting with a bounds check on v2. This is where the branching is erased by loading both the values of v1[r10]
and v2[r10]
and using the value of b[r10]
to calculate the offset when performing the add
to al
The function epilogue deallocates the stack space for the vals
array and restores rbx
before returning.
Finally there’s error sections for a failed bounds check on v2
and b
respectively.
LBB0_6
is called if the bounds check fails on v2
.
LBB0_7
is entered if a bounds check failed on b
and debug information is provided to the error routine.
Simplified branchless_add
If we ignore function prologue, epilogue, error handling, and early returns on a zero-length v1
, we can focus on the expected happy path which is simply setting eax
(retval
) to zero and going through the loop.
The loop starts by implementing line 4 from the branchless_add
:
as seen here (bounds check instructions replaced by […]):
Then line 5:
is implemented as follows:
Finally the condition for jumping to the start of the loop or continuing to the epilogue is the equality of the loop counter and the length of v1
.
LLVM Machine Code Analysis
The analysis is performed by first generating AT&T style assembly: feeding each set of instructions into llvm-mca
Then saving the relevant assembly from the output and feeding it into llvm-mca
Dividing instructions and cycles with the iterations gets us:
Branching |
Branchless |
|
---|---|---|
Instructions | 28 | 30 |
Total Cycles | 9.08 | 10.03 |
So there’s more instructions and it takes more CPU cycles to go through the Branchless
scenario, that’s not suprising but what does the trade-off of removing the branching mean?
Benchmarking with perf stat
The rust-perf-comp project was used to carry out benchmarks and gather statistics about each run.
The values of the vectors v1
, v2
, and b
, were randomly determined and the benchmarks were carried out with various ratios of true/false values in the b
vector. The randomness of the b
values defeats the branch predictor’s ability to detect a pattern, and the varying ratios lets us examine the result of varying degrees of branch misprediction ratios.
The measurements of numbers of CPU instructions can be seen in the graph below.
CPU core
is the name of the standard micro-architecture for intel CPUs, whileCPU atom
is the low power micro-architecture.
While there appears to be a lot of noise in the data, it is clear that branching requires more total CPU instructions when the ratio is nearer 50/50 which is where the highest degree of branch mispredictions is seen. The branchless implementation is much more stable but produces more instructions in the extremeties where the ratio is 100% true or 100% false.
Timing measurements and percent of branch misses are depicted below.
The cost of branch misses is very clear in the timing measurements, at worst the branching code is 7x slower while the branchless implementation is very stable around 500-600 ms. However we see that the branching code beats the branchless in the extremeties again.
Conclusion
Writing code with a high degree of branching does not rule out highly efficient code, due to the nature of modern CPU architectures with advanced branch prediction capabilities. There is however scenarios where branching code incurs a significant performance penalty, these scenarios can be resolved by refactoring branching to branchless code. It is important to measure before and after refactoring to branchless code as it is not a pure win, there’s a trade-off and without measuring there’s a risk of introducing a performance regression instead of an improvement.