Dreamscape Sep 13, 2024

CPU Psychic

“Just put an empty else there” – Said to me my, older, more experienced, colleague.
”Why, what good would that do?!” - I asked.
-”Just do it!”
I did what he said, and lo and behold, the performance of algorithm that we tried to optimize changed. Dafuq happened? Branch predictor happened.
Now, that was a long time ago, and I don't actually remember how empty else block affected the code (and why wasn't it optimized out), but that event prompted me to learn more about branch predictor.

What are branches?

When a CPU executes instructions, it sometimes encounters decision points, like for example "if" statements or loops. These decision points, known as branches, direct the CPU to take one of two possible paths: continue with the next sequential instruction or jump to a different part of the code.
For instance, in an if-else statement:

    
        if (x > 0)
        {
            // Do something
        }
        else
        {
            // Do something else
        }
    
The CPU must decide whether to execute the code block under if or else. This decision can potentially disrupt the smooth flow of instruction execution. Why, you ask? Because modern processors execute instruction in pipelines, where multiple instructions are processed in parallel. Take a quick look at this Wikipedia article if you wanna know more about instruction pipelines: Instruction pipelining

Why Are Branch Predictors Necessary?

As we said, modern CPUs are designed with deep pipelines, meaning they work on multiple instructions simultaneously, rather than waiting for one to complete before starting the next. When the CPU encounters a branch, it needs to decide which instructions to load into the pipeline next, before the condition that determines which branch needs to be executed is known. When this happens, CPU can do one of two things:

  1. Wait until the condition instruction is executed up to the point when we know the condition, and thus know which branch to take.
  2. Try to guess which branch will be executed, and load next instruction from that branch into the pipeline.
The second option usually happens. If the CPU guesses incorrectly, it has to discard the incorrectly loaded instructions, leading to performance penalties—a process known as a pipeline stall. To minimize these stalls, CPUs use branch predictors to guess the outcome of a branch instruction before it is fully evaluated. By accurately predicting the outcome, the CPU can continue loading the correct instructions into the pipeline, maintaining smooth execution flow.

How Do Branch Predictors Work?

Branch predictors use various algorithms and strategies to predict the outcome of a branch. Here are some possible (rather simple) types:

Static Prediction:

The simplest form of branch prediction. Based on a fixed strategy, like always predicting that forward branches are not taken and backward branches (like loops) are taken. While simple, static prediction often lacks the accuracy needed for complex, real-world code.

Dynamic Prediction:

More advanced and adapts based on the program's behavior. Uses historical data to make predictions. Common dynamic prediction techniques include:
1-Bit Predictor: Uses a single bit to record whether the branch was taken last time. If it was taken, the predictor guesses it will be taken again.
2-Bit Predictor: More sophisticated; allows two consecutive incorrect predictions before changing its prediction, reducing the chance of incorrect predictions for branches that occasionally alternate.

Branch History Table (BHT)

A table storing the history of branch outcomes. The CPU checks this table to predict the branch's outcome based on past occurrences.

Global History Predictor

Tracks the outcomes of multiple recent branches to predict the current branch, considering the context in which branches typically occur.

Hybrid Predictors:

Combines multiple predictors to take advantage of their strengths. For example, it might use a simple predictor for certain types of branches and a more complex one for others, selecting the most accurate predictor for each scenario.

These are just some possible simple implementation, modern CPUs of today can use much more complex predictors, for example Ryzen architecture uses predictors based on neural networks.

The Impact of Accurate Branch Prediction

Accurate branch prediction significantly improves CPU performance by reducing the number of pipeline stalls. In modern processors, where pipelines are deep and complex, even small improvements in branch prediction accuracy can lead to substantial performance gains. However, perfect prediction is impossible due to the inherently unpredictable nature of certain branches, particularly those dependent on user input or complex computations. Therefore, branch predictors are continuously evolving, with CPU designers implementing increasingly sophisticated algorithms to push the boundaries of performance. Here is an example of just how much (in)correct prediction can impact performance:

    
        for (int i = 0; i < ARRAY_SIZE; ++i)
        {
            if (numbers[i] > ARRAY_SIZE / 2)
                ++count;
        }
    

When measuring execution time of this loop when the array is unsorted, it was 1306 microseconds. When the array was sorted, it was only 181 microseconds. Do you realize why?
Here is the link to the full demo source code: branch predictor demo