Data Local Iterative Methods For The Efficient Solution of Partial Differential Equations

A cooperation
between

and
.

Funded by
.

Digitial A21164 Hardware Overview

CPU Architecture of the DEC PWS 500au:

The Alpha 21164 has the following features:

64 bit RISC architecture
fully pipelined
superscalar 4-way instruction issue (2 integer pipelines, 2 floiting-point pipelines)
32 integer registers (+ 8 PALshadow registers)
32 entry, 64 bit floating-point register file
8 KB, direct-mapped, L1 instruction cache (onchip)
8 KB, direct-mapped, write through L1 data cache (onchip)
96 KB, 3-way, set-associative, write-bake L2 data and instruction cache (onchip)
supports optional board-level L3 cache (1 MB - 64 MB)

Scheduling and Issuing Rules:

There are 2 integer pipilines (E0 and E1), a floating-point add (FA) and a floating point multiply pipeline (FM).
Load are executed in E0 or E1. Stores in E0. Loads and Stores cannot be issued simultaneously. IBranches are issued in E1, IAdds in E0 or E1, IMults and shifts in E0. Fadds are issued in FA, Fdiss in FA, Fmul in FM, FBranches in FA.
It is possible to issue simultaneously 4 instructions in a INT16 (natually aligned block of 16 Bytes) to the 4 pipelines as far as the resources for the instructions are available.
Out-of-Order Issues are not performed. Therfore if the issue of one of the instructions in the INT16 block is not possible the following instructions in the block are also delayed, even if their resources are available.
The next INT16 block is issued when all instructions of the previous INT16 block are issued.

Sources of Latency (Processor Stalls):

Cache misses
TLB misses
Register dependencies
Branch mispredictions
Memory barrier instructions
Replay traps
Cache coherence protocol (mainly in multiprocessor systems)

Memory Architecture of the DEC PWS 500au (Alpha 21164):

Memory Architecture of the DEC PWS 500au Instruction Translation Buffer (ITB):

Fully associative TLB.
48 entries, not-last-used replacement.
Each entry can map 1, 8, 64, or 512 contiguous pages of 8 kB size.

Data Translation Buffer (DTB):

Fully associative TLB: 43-bit VA --> 40-bit PA.
64 entries, not-last-used replacement.
Each entry can map 1, 8, 64, or 512 contiguous pages of 8 kB size (i.e., 8 kB, 64 kB, 512 kB, 4 MB). The size of the each mapping is specified by hint bits stored in the entry.
1 cycle for address translation: pipeline stage S4.
Address translation is done in parallel with data cache access.
Dual-ported, i.e., 2 address translations per cycle are possible.
ITB and DTB implement 7-bit address space numbers (ASNs) per entry to indicate the context for which the address translation entry is valid.
ITB and DTB misses have significant penalties, i.e., PALcode entry and, potentially, memory accesses.
TLB miss latency: > 20 cycles! (Assumption of a study on a 300 MHz Aplha [5])

L1 Instruction Cache:

On chip, direct mapped, 8 kB, 32-byte blocks.
Part of instruction unit.

L1 Data Cache:

On chip, direct mapped, 8 kB, 32-byte blocks, write through.
Dual-read-ported, i.e., 2 reads per cycle are possible; single-write-ported.
Non-blocking, i.e., does not block to serve a miss.
Latency: 2 cycles! Load: cache read is done in S4 and S5, data is loaded into register in S6. Store: cache hit is determined in S4 and S5, data is stored into cache in S6.
This latency figure assumes that the instruction after a load is an operate instruction; the data fetched from the L1 cache is directly fed into the I or FP pipeline. The data is available in a register 3 cycles after the load.
Example of instruction stream and optimal execution time (L1 cache hit):
Cycle i: LDL R2, 0 (R1)
Cycle i+1: NOP
Cycle i+2: ADDL R2, R3, R4

Memory Load and Store Merge Logic:

Connects (or, decouples) L1 caches and L2 cache.
2 components: MAF and WB.
Miss Address File (MAF):
Buffers loads that missed in the L1 caches.
Merges multiple loads into same 32-byte block (under certain restrictions), up to 2 loads per cycle.
Capacity:
6 entries (32-bytes each) for 21 different data loads;
4 entries for instruction fetches.
Write Buffer (WB):
- Buffers stores that missed in the L1 data cache.
- Merges multiple stores into same 32-byte block, 1 store per cycle.
- Capacity: 6 entries (32-bytes).
- Stops merging when block write sent to L2 cache.
- Holds entry until completion is signaled by L2 cache.
Connection to L2 cache:
- MAF and WB send one (merged) load or store as a 32-byte block access to L2 cache every 2 cycles.
- Priorities: data load, store, instruction fetch.
Effects of these components:
- Some L1 cache misses may see only part of effective L2 cache access latency.
- L2 cache bandwidth is saved through buffering and merging.
- Data accesses may be finished out of order, however, access ordering as given by the instruction stream is guaranteed.

Memory Barrier Instructions and Replay Traps:

Memory Barriers cause all outstanding memory accesses to be finished.
Replay Traps are generated if an access would cause a buffer to overflow and cause the access to be repeated later on.
Effect: noticeable memory access latency penalty!

L2 Cache:

On chip, 3-way set associative, 96 kB, write back, unified I&D, single-ported.
Configurable to handle 64-byte blocks or 32-byte subblocks.
Writes back to off-chip cache only the modified 16-byte data words.
128-bit read and write paths to execution units and L1 caches; 32 bytes are transferred in 2 successive cycles.
Load access: data are directly transferred to the I or FP pipeline, then to L1 data cache, then written into register (called "fill").
A "fill" of the I pipeline, L1 cache and registers from L2 cache causes 2 or 3 empty cycles to be allocated so that the L1 cache port is available!
Minimum load latency: 8 cycles! (I.e., the next instruction depending on the register value can be issued 8 cycles later.)
Minimum store latency: 5 cycles!(?) (Due to store buffering, the CPU does not encounter this latency, though.)
Latencies are higher if there are arbitration conflicts for the L2 cache.

Bus Interface Unit File (BAF):

Between L2 and L3 cache.
Buffers memory accesses that missed in L2 cache.
Merges read requests to 32-byte blocks within the same 64-byte block.
2 entries.
Effect: may lower average L3 cache access latency.

L3 Cache:

Off chip, control is on chip.
Direct mapped, write back.
1
64 MB, typical 2 MB; optional in some machines.
Block size configurable to 32 or 64 bytes.
Minimum load latency: 12 cycles!(?)

Main Memory:

Example: Personal Workstation with 21174 Memory Controller.
SDRAM memory, 66 MHz system clock.
4 cycles of the 128-bit data bus for a 64-byte L2 cache line fill.
Minimum load latency: 75 cycles @ 500 MHz (10 bus cycles = 150 ns)!(?)

Latency Summary:

Type of access	[1]-[4]	[5]
TLB miss	?	> 20 cycles
L1 cache	2 cycles	2 cycles
L2 cache	> 8 cycles	7 cycles
L3 cache	> 12 cycles	21 cycles @ 300 MHz
Main memory	> 75 cycles	80 cycles @ 300 MHz
C-to-C transfer	?	125 cycles @ 300 MHz

Miscellaneous:

The Alpha instruction set has several cache related instructions [6]:

FETCH - Prefetch Data:

Hint to implementation. Implementation may optinally attempt to move all or part of a 512 byte block of data to a faster-access part of the memory hierarchy. These instructions are "architecturally optional" [1]. In the 21164, "partial hardware implementation is provided" [1].

FETCH_M - Prefetch Data, Modify Intent:

Same as FETCH, but gives the additional hint that modifications to the data block are anticipated.

ECB - Evict Data Cache Block:

Makes a particular cache location available for reuse by evicting and invalidating its contents.

WH64 - Write Hint 64 Bytes:

Provides performance hint that the contents of the 64-byte block will never be read again, but will be overwritten in the near future.
Causes for stall cycles in commercial workloads [5]:
- Stall cycles are about evenly divided between instruction
- and data-related.
- About half of memory system stalls are hits in L2 and L3 caches.
- L2 cache hits mostly account for > 20% of overall execution time.
- Branch mispredictions and reply traps account for about 10%-20% of ex. time.

Related Information on the Web:

References:

[1]: Digital Equipment Corporation. Digital Semiconductor 21164 Alpha Microprocessor Hardware Reference Manual. EC-QP99B-TE. Feb. 1997.
[2]: J.E. Edmondson et al. "Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor". Digital Technical Journal, vol. 7, no. 1, 1995, pp. 119-132.
[3]: K.M. Weiss, K.A. House. "DIGITAL Personal Workstations: The Design of High-performance, Low-cost Alpha Systems". Digital Technical Journal, vol. 9, no. 2, 1997, pp. 45-56.
[4]: R.C. Schumann. "Design of the 21174 Memory Controller for DIGITAL Personal Workstations". Digital Technical Journal, vol. 9, no. 2, 1997, pp. 57-70.
[5]: L.A. Barroso, K. Gharachorloo, E. Bugnion. "Memory System Characterization of Commercial Workloads". To appear in: Proc. 25th Int'l. Symp. on Computer Architecture (ISCA-25), June 1998.
[6]: Digital Equipment Corporation. Alpha Architecture Handbook. EC-QD2KB-TE. Oct. 1996

cs10-dime@fau.de

Last Modified: 10 January 2008