CPU Architecture of the DEC PWS
500au:The Alpha 21164 has the following
Scheduling and Issuing
- 64 bit RISC architecture
- fully pipelined
- superscalar 4-way instruction issue (2 integer
pipelines, 2 floiting-point pipelines)
- 32 integer registers (+ 8 PALshadow
- 32 entry, 64 bit floating-point register
- 8 KB, direct-mapped, L1 instruction cache
- 8 KB, direct-mapped, write through L1 data
- 96 KB, 3-way, set-associative, write-bake L2
data and instruction cache (onchip)
- supports optional board-level L3 cache (1 MB -
Sources of Latency (Processor
- There are 2 integer pipilines (E0 and E1), a
floating-point add (FA) and a floating point
multiply pipeline (FM).
- Load are executed in E0 or E1. Stores in E0.
Loads and Stores cannot be issued simultaneously.
IBranches are issued in E1, IAdds in E0 or E1,
IMults and shifts in E0. Fadds are issued in FA,
Fdiss in FA, Fmul in FM, FBranches in FA.
- It is possible to issue simultaneously 4
instructions in a INT16 (natually aligned block of
16 Bytes) to the 4 pipelines as far as the
resources for the instructions are available.
- Out-of-Order Issues are not performed. Therfore
if the issue of one of the instructions in the
INT16 block is not possible the following
instructions in the block are also delayed, even if
their resources are available.
- The next INT16 block is issued when all
instructions of the previous INT16 block are
- Cache misses
- TLB misses
- Register dependencies
- Branch mispredictions
- Memory barrier instructions
- Replay traps
- Cache coherence protocol (mainly in
Memory Architecture of the DEC PWS
500au (Alpha 21164):
Memory Architecture of the DEC PWS 500au
Instruction Translation Buffer
Data Translation Buffer
- Fully associative TLB.
- 48 entries, not-last-used replacement.
- Each entry can map 1, 8, 64, or 512 contiguous
pages of 8 kB size.
L1 Instruction Cache:
- Fully associative TLB: 43-bit VA --> 40-bit
- 64 entries, not-last-used replacement.
- Each entry can map 1, 8, 64, or 512 contiguous
pages of 8 kB size (i.e., 8 kB, 64 kB, 512 kB, 4
MB). The size of the each mapping is specified by
hint bits stored in the entry.
- 1 cycle for address translation: pipeline stage
- Address translation is done in parallel with
data cache access.
- Dual-ported, i.e., 2 address translations per
cycle are possible.
- ITB and DTB implement 7-bit address space
numbers (ASNs) per entry to indicate the context
for which the address translation entry is
- ITB and DTB misses have significant penalties,
i.e., PALcode entry and, potentially, memory
- TLB miss latency: > 20 cycles! (Assumption
of a study on a 300 MHz Aplha )
L1 Data Cache:
- On chip, direct mapped, 8 kB, 32-byte
- Part of instruction unit.
Memory Load and Store Merge
- On chip, direct mapped, 8 kB, 32-byte blocks,
- Dual-read-ported, i.e., 2 reads per cycle are
- Non-blocking, i.e., does not block to serve a
- Latency: 2 cycles! Load: cache read is done in
S4 and S5, data is loaded into register in S6.
Store: cache hit is determined in S4 and S5, data
is stored into cache in S6.
- This latency figure assumes that the
instruction after a load is an operate instruction;
the data fetched from the L1 cache is directly fed
into the I or FP pipeline. The data is available in
a register 3 cycles after the load.
- Example of instruction stream and optimal
execution time (L1 cache hit):
Cycle i: LDL R2, 0 (R1)
Cycle i+1: NOP
Cycle i+2: ADDL R2, R3, R4
Memory Barrier Instructions and Replay
- Connects (or, decouples) L1 caches and L2
- 2 components: MAF and WB.
- Miss Address File (MAF):
- Buffers loads that missed in the L1
- Merges multiple loads into same 32-byte block
(under certain restrictions), up to 2 loads per
6 entries (32-bytes each) for 21 different data
4 entries for instruction fetches.
- Write Buffer (WB):
- Buffers stores that missed in the L1 data
- Merges multiple stores into same 32-byte
block, 1 store per cycle.
- Capacity: 6 entries (32-bytes).
- Stops merging when block write sent to L2
- Holds entry until completion is signaled by
- Connection to L2 cache:
- MAF and WB send one (merged) load or store
as a 32-byte block access to L2 cache every 2
- Priorities: data load, store, instruction
- Effects of these components:
- Some L1 cache misses may see only part of
effective L2 cache access latency.
- L2 cache bandwidth is saved through
buffering and merging.
- Data accesses may be finished out of order,
however, access ordering as given by the
instruction stream is guaranteed.
- Memory Barriers cause all outstanding memory
accesses to be finished.
- Replay Traps are generated if an access would
cause a buffer to overflow and cause the access to
be repeated later on.
- Effect: noticeable memory access latency
Bus Interface Unit File
- On chip, 3-way set associative, 96 kB, write
back, unified I&D, single-ported.
- Configurable to handle 64-byte blocks or
- Writes back to off-chip cache only the modified
16-byte data words.
- 128-bit read and write paths to execution units
and L1 caches; 32 bytes are transferred in 2
- Load access: data are directly transferred to
the I or FP pipeline, then to L1 data cache, then
written into register (called "fill").
- A "fill" of the I pipeline, L1 cache and
registers from L2 cache causes 2 or 3 empty cycles
to be allocated so that the L1 cache port is
- Minimum load latency: 8 cycles! (I.e., the next
instruction depending on the register value can be
issued 8 cycles later.)
- Minimum store latency: 5 cycles!(?) (Due to
store buffering, the CPU does not encounter this
- Latencies are higher if there are arbitration
conflicts for the L2 cache.
- Between L2 and L3 cache.
- Buffers memory accesses that missed in L2
- Merges read requests to 32-byte blocks within
the same 64-byte block.
- 2 entries.
- Effect: may lower average L3 cache access
- Off chip, control is on chip.
- Direct mapped, write back.
- 64 MB, typical 2 MB; optional in some
- Block size configurable to 32 or 64 bytes.
- Minimum load latency: 12 cycles!(?)
- Example: Personal Workstation with 21174 Memory
- SDRAM memory, 66 MHz system clock.
- 4 cycles of the 128-bit data bus for a 64-byte
L2 cache line fill.
- Minimum load latency: 75 cycles @ 500 MHz (10
bus cycles = 150 ns)!(?)
|Type of access
||> 20 cycles
||> 8 cycles
||> 12 cycles
||21 cycles @ 300 MHz
||> 75 cycles
||80 cycles @ 300 MHz
||125 cycles @ 300 MHz
- The Alpha instruction set has several cache
related instructions :
- FETCH - Prefetch Data:
- Hint to implementation. Implementation may
optinally attempt to move all or part of a 512
byte block of data to a faster-access part of
the memory hierarchy. These instructions are
"architecturally optional" . In the 21164,
"partial hardware implementation is provided"
- FETCH_M - Prefetch Data, Modify
- Same as FETCH, but gives the additional
hint that modifications to the data block are
- ECB - Evict Data Cache Block:
- Makes a particular cache location available
for reuse by evicting and invalidating its
- WH64 - Write Hint 64 Bytes:
- Provides performance hint that the contents
of the 64-byte block will never be read again,
but will be overwritten in the near
- Causes for stall cycles in commercial workloads
- Stall cycles are about evenly divided
- and data-related.
- About half of memory system stalls are hits
in L2 and L3 caches.
- L2 cache hits mostly account for > 20%
of overall execution time.
- Branch mispredictions and reply traps
account for about 10%-20% of ex. time.
Related Information on the Web:
- Digital Equipment Corporation. Digital
Semiconductor 21164 Alpha Microprocessor Hardware
Reference Manual. EC-QP99B-TE. Feb. 1997.
- J.E. Edmondson et al. "Internal Organization of
the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS
RISC Microprocessor". Digital Technical Journal,
vol. 7, no. 1, 1995, pp. 119-132.
- K.M. Weiss, K.A. House. "DIGITAL Personal
Workstations: The Design of High-performance,
Low-cost Alpha Systems". Digital Technical Journal,
vol. 9, no. 2, 1997, pp. 45-56.
- R.C. Schumann. "Design of the 21174 Memory
Controller for DIGITAL Personal Workstations".
Digital Technical Journal, vol. 9, no. 2, 1997, pp.
- L.A. Barroso, K. Gharachorloo, E. Bugnion.
"Memory System Characterization of Commercial
Workloads". To appear in: Proc. 25th Int'l. Symp.
on Computer Architecture (ISCA-25), June 1998.
- Digital Equipment Corporation. Alpha
Architecture Handbook. EC-QD2KB-TE. Oct. 1996