Chapter 2
Internal Architecture
19
20695H/0—March 1998
AMD-K6
Processor Data Sheet
Preliminary Information
2.7
Branch-Prediction Logic
Sophisticated branch logic that can minimize or hide the impact
of changes in program flow is designed into the AMD-K6
processor. Branches in x86 code fit into two categories—
unconditional branches, which always change program flow (that
is, the branches are always taken) and conditional branches,
which may or may not divert program flow (that is, the branches
are taken or not-taken). When a conditional branch is not taken,
the processor simply continues decoding and executing the next
instructions in memory.
Typical applications have up to 10% of unconditional branches
and another 10% to 20% conditional branches. The AMD-K6
branch logic has been designed to handle this type of program
behavior and its negative effects on instruction execution, such
as stalls due to delayed instruction fetching and the draining of
the processor pipeline. The branch logic contains an 8192-entry
branch history table, a 16-entry by 16-byte branch target cache,
a 16-entry return address stack, and a branch execution unit.
Branch History Table
The AMD-K6 processor handles unconditional branches
without any penalty by redirecting instruction fetching to the
target address of the unconditional branch. However,
conditional branches require the use of the dynamic
branch-prediction mechanism built into the AMD-K6. A
two-level adaptive history algorithm is implemented in an
8192-entry branch history table. This table stores executed
branch information, predicts individual branches, and predicts
the behavior of groups of branches. To accommodate the large
branch history table, the AMD-K6 processor does not store
predicted target addresses. Instead, the branch target
addresses are calculated on-the-fly using ALUs during the
decode stage. The adders calculate all possible target addresses
before the instructions are fully decoded and the processor
chooses which addresses are valid.
Branch Target Cache
To avoid a one clock cache-fetch penalty when a branch is
predicted taken, a built-in branch target cache supplies the first
16 bytes of instructions directly to the instruction buffer
(assuming the target address hits this cache). (See Figure 3 on
page 13.) The branch target cache is organized as 16 entries of
16 bytes. In total, the branch prediction logic achieves branch
prediction rates greater than 95%.