Branch Prediction Deep Dive: From Bimodal to GShare

Jan 20, 2025·3 min read

Branch Prediction Deep Dive: From Bimodal to GShare

Computer ArchitectureBranch PredictionC++

Branch prediction is one of the most studied problems in computer architecture. Modern out-of-order processors execute dozens of instructions ahead of where they actually are in program flow, which means a mispredicted branch can flush a pipeline with 15–20 stages and waste significant work. Understanding prediction strategies is fundamental to understanding processor performance.

The Cost of Misprediction

On a modern superscalar processor:

Pipeline depth: 14–20 stages
Misprediction penalty: 15–20 cycles
Branch frequency: ~1 in every 5–7 instructions

At 3 GHz with 5% misprediction rate, a 15-cycle penalty costs you ~450 million wasted cycles per second per core — roughly 15% IPC loss for a typical workload.

Smith Predictor: The Baseline

The simplest useful predictor is the Smith n-bit counter. Each branch maps to a saturating counter with 2ⁿ states. For a 2-bit counter:

States: Strongly Not-Taken (00) → Weakly Not-Taken (01) → Weakly Taken (10) → Strongly Taken (11)

The advantage over a 1-bit counter is that a single anomalous branch outcome (e.g., a loop exit) doesn't flip the prediction. The "strongly" states provide hysteresis.

Limitation: A single counter for the entire program ignores that different branches have different behavior. The Bimodal predictor fixes this.

Bimodal Predictor

The Bimodal predictor uses the lower bits of the Program Counter (PC) to index a table of 2-bit counters. Each branch gets its own counter (modulo table size).

index = PC[k:1]  // lower k bits of PC
prediction = table[index][1]  // MSB of counter = taken/not-taken

Aliasing: Two different branches can map to the same table entry and interfere with each other. Larger tables reduce aliasing but increase hardware cost.

GShare: Exploiting Global History

GShare improves on Bimodal by incorporating global branch history into the index:

index = (PC[k:1]) XOR (global_history_register[k:1])

The Global History Register (GHR) is a shift register that records the last k branch outcomes. XORing it with the PC means that the same branch instruction maps to different table entries depending on what branches preceded it. This is powerful for correlated branches — common in loops and if-else chains.

Results on SPECint95

Simulating on SPECint95 traces (from my BranchPredictionSim project):

| Predictor | Bits | Avg Misprediction Rate | |-----------|------|----------------------| | Smith 2-bit | — | ~9.5% | | Bimodal | 13-bit index | ~4.8% | | GShare | 13-bit history+PC | ~3.2% | | Hybrid (chooser) | 13-bit | ~2.9% |

The hybrid predictor dynamically selects between Bimodal and GShare using a meta-predictor, slightly improving over GShare alone.

Key Insight

No single predictor dominates all workloads. GShare excels on correlated branches (loops, nested conditions) but can underperform Bimodal on branches with strong local patterns. The hybrid predictor hedges, but adds hardware cost. Modern processors use more sophisticated TAGE (Tagged Geometric history length) predictors, but the intuitions from these classical designs remain foundational.

← Back to Blog