tags:
  - Cpp

Branchless Programming Intro in C++ (ENG)

Do this at first: Know Your Hardware - Modern CPU Architecture

It's All about Performance

Every programmer working with C++ is, in some way, obsessed with performance. Unlike Java or Python, C++ doesn't provide too many abstractions, making such language a true systems programming language, which is all about performance.

To optimize performance, a Cpper not only needs to understand the language itself (efficient use of language and finding optimal algorithm), but also the underlying computing system (efficient use of hardware). This include how data moves in and out of memory, how cache works, and how CPU or even GPU executes calculations efficiently.

In today's topic, we will focus on branchless programming, to see how you can boost CPU executions with C++.

Know Your Hardware

Inside a modern CPU, we of course have execution units, caches, decoder and microcode, besides that, we got branch prediction, out-of-order execution, speculative execution, register renaming, and multi-level instruction pipelines.

Back in the day, a CPU core could only perform one instructions at a time, making it impossible to do the optimization with hardware. But nowadays, the pipeline technique allows CPU core to do multiple things at the same time. Instructions can be fetched, decoded, executed, and written back in parallel stages.

When pipelining, branches become critical factors that can be slow down the CPU's instruction pipelining factory. This happens because the CPU must wait for a branch instruction to resolve before it can proceed down the correct path. While modern CPUs include a branch predictor in the front-end, allowing pipeline to continue speculatively without stalling, but again, if the guess is wrong, the CPU thus need to flushes the entire pipeline of the front-end and reload the correct branch, which could be painful (the back-end may need 20 instruction cycles to reload the instructions).

Most modern CPUs implement dynamic branch prediction using a table that records branch history. And the predictor consults this history to make educated guesses about which path a branch would take. While the predictor itself doesn't have infinite buffer capacity, its goal is to use recent execution outcomes to. optimize future decisions.

In general, the branch misprediction rate is below 10%, and when the branch behavior in your code follows a consistent pattern, the misprediction rate can dorp significantly - sometimes even less than 0.1%.

// Example of well-patterned branching:
for (size_t i = 0; i < 10000; i++) {
	//...
}

What is Branches

From different perspectives, the concept of a "branch" can be understood differently. Assuming we have the code down below:

if (x || y) {
	// Branch a
} else {
	// Branch b
}

To programmers, x || y is seen as a single compound condition, and it's very natural that we may assume that CPU evaluates the entire logical expression as one unit.

However, from the perspective of the compiler and the CPU, this involves two distinct conditions: x and y. Here's how CPU unfolds the evaluation:

If x's true, the short-circuit behavior of || kicks in, and y isn't even been considered. The flow proceeds directly to the branch always.
If x's false, only then will the machine evaluate the value of y. This is when the branch b been considered by the hardware.

In this example, it's this short-circuit of the execution that causes branch misprediction. If we can evaluate x and y at the same time, the prediction would be unnecessary. So, what can we do? We can use integer or bitwise arithmetic logic on boolean expressions, which allows the CPU evaluate both operands unconditionally. For example:

if (bool(x) + bool(y)) {
	//...
} else {
	//...
}

// or

if (bool(x) | bool(y)) {
	//...
} else {
	//...
}

// or

auto flag = bool(x) + bool(y);
if (flag) {
	//...
} else {
	//...
}

But you may ask: isn't the CPU doing a lot more job this way? Yes, it is evaluating both operands - but as discussed in Know Your Hardware - Modern CPU Architecture, modern CPUs are built to handle multiple things at once. Thanks to instruction pipelining, out-of-order execution, evaluating simple expressions in parallel is trivially fast, you get those extra computation for free, but a pipeline stall IS a big deal.

Another Common Example

In the last example, we reduced two branches into one. In the example down below, we can use some tricks to reduce one branch to zero, achieving a true branchless execution.

Instead of relying condition branching code:

auto sum += cond ? expr1 : expr2;

We could restructure the code to this:

auto terms[2] = {expr2, expr1};
sum += term[bool(cond)]

Here, we pre-compute both options and select the correct one based on the boolean value. This tricks avoids branching entirely and allows compiler to generate straight-line code with no jumps. However, not all branch substations lead to actual performance gains. Some optimization would not work, and result in what's known as spectacular pessimization. For example:

if (cond) {
	f1();
} else {
	f2();
}

which can be written as:

auto fptr f[2] = {&f2, &f1};
(f[cond])();

While this does eliminate the explicit branch, but it introduces new problems. The use of function pointers prevents the compiler from applying optimizations, such as inlining, and this may lead to cache misses due to indirection call. As a result, performance might degrade rather than improve.

This is it

So this is branchless programming, instead of introducing branches, you just do a little more work. (Yes, this is a trade-off)