Test Environment and
Table Explanation
Environment:
Several different programs are presented. All of
them implement a 5-point stencil red-black
Gauss-Seidel relaxation in 2D in different ways.
Since all data dependencies of the standard red-black
Gauss-Seidel method are observed they all produce
bitwise identical results.
The programs were compiled with DEC Fortran77 on a
DEC PWS 500au. The number of accesses which hit into
a certain level of the memory hierarchy and the
number of cycles spend for a certain reason were
measured with the profiling tool DCPI.
DEC PWS 500au:
Hardware:
- CPU: 500 MHz Alpha A21164 (ev56)
- AlphaPC 164LX Motherboard
- 2 MB 3. Level Cache (board level)
- 128 MByte SDRAM
Software:
- Digital UNIX V4.0
- DEC C V5.2-033 (Compiler driver: 3.11)
used options: -O5 -ansi_alias -fast -tune
host
- DEC F77 Driver X4.1-134-335G (Compiler driver:
X4.1-6)
used options: -extend_source -fast
-notransform_loops -O5 -tune host
- DCPI V2.11 (more info)
Programs:
Table: Memory Access Behaviour
The table below describs the memory behaviour of
the "Red-Black Gaus-Seidel Relaxation" program in the
context of different data set sizes. Especielly, the
fact which part of the memory hierarchy and how
frequently is used for data acceses is examined.
Memory access behaviour
Size |
MBytes
/sec |
% of all access which go
into |
± |
1. Level |
2. Level |
3. Level |
Memory |
|
|
|
|
|
|
|
- Size:
- The program operates on a (Size+1) x (Size+1)
grid in 2D. The number of unknown is
(Size-1)².
- MBytes/sec:
- Average memory bandwidth achieved during
execution. Accesses to registers are not counted.
The number is calculated by dividing the actually
performed number of acceses to the first level
cache by the execution time of the program.
- ±:
- The theoretical number of memory acceses is
proportional to the number of relaxations performed
by the algorithmn (6 loads and 1 store). The
expected amount of memory acceses is assumed to be
100%. The fraction of memory acceses which are
expected but not actually with DCPI
observed is presented in this column. The number is
positiv if there are less acceses observed than
expected. If the number is very high it is likely
that the compiler was able to reuse data resided in
registers insteed of reloading it from main memory
or cache. (DCPI is able to count the amount of
acceses to the first level cache but not to
registers.) E.g. if the column is "25" then one
quarter of all expected memory accesses was
satisfied by register accesses.
- 1. Level:
- Represents the fraction of all expected memory
acceses which can be served by the first level of
the memory hierachy. The amount of accesses which
can be served by the L1 Cache is measured with
DCPI by counting the performed acceses
into the L1 cache and counting the amount of L1
cache misses (= accesses which cannot be served by
the L1 cache).
- 2. Level:
- Represents the fraction of all expected memory
acceses which can be served by the second level of
the memory hierachy. The amount of accesses which
can be served by the L2 Cache is measured with
DCPI by counting the L1 misses (= accesses
which go into L2 cache) and counting the amount of
L2 cache misses (= accesses which cannot be served
by the L2 cache).
- 3. Level:
- Represents the fraction of all expected memory
acceses which can be served by the third level of
the memory hierachy. The amount of accesses which
can be served by the L3 Cache is measured with
DCPI by counting the L2 misses (= accesses
which go into L3 cache) and counting the amount of
L3 cache misses (= accesses which cannot be served
by the L3 cache).
- Memory:
- Represents the fraction of all expected memory
acceses which are not served by any of the caches
and must be served by a main memory access. The
amount of memory accesses which must be server by
the main memory is measured with DCPI by
counting the amount of L3 cache misses.
Table: Runtime Behaviour
The table below describs the memory behaviour of
the "Red-Black Gaus-Seidel Relaxation" program in the
context of different data set sizes. Especielly,
Runtime behaviour
Size |
MFlops
/sec |
% of cycles used for |
± |
Base |
Exec |
Cache |
DTB |
Branch |
R dep |
Nops |
|
|
|
|
|
|
|
|
|
|
- Size:
- The program operates on a (Size+1) x (Size+1)
grid in 2D. The number of unknown is
(Size-1)².
- MFlops/sec:
- Achieved MFlops/sec rate. It is assumed that 10
floating-point operations are required for the
relaxation of each node. The programs are slightly
obscured so that the used version of the DEC
Fortran77 compiler is not able to eliminate
operations.
- ±:
- Shows the residium of the cycles not spend for
any of the shown reasons. The residium may include
measurement errors or one of the following reasons
for cpu stalls: instruction cache misses,
instruction TLB misses, slotting problems, busy
functional units (IMUL or FDIV), etc.
- Base:
- The measurements are performed with
DCPI which assumes that 2 instructions are
issued per cycle. Hence, if two instructions per
cycle are issued the CPU performes 100% execution.
However, the DEC PWS 500au is able to issue up to
four instruction per cycle. If the CPU is able to
issue more than 2 instruction this is reported as
"unexplained gain" and added to the 100%. So, when
the CPU is able to issue 3 instruction each cycle
this means that the column Base is
150%.
- Exec:
- Per cent of cycles which are spend for
execution of instructions. If 100% execution is
achieved then on average 2 instructions are issued
per cycle.
- Cache:
- Per cent of cycles which are spend for data
cache miss stalls. If 100% stalls occure then on
average 2 instructions are stalled each cycle.
- DTB:
- Per cent of cycles which are spend for data
table lookaside buffer miss stalls. If 100% stalls
occure then on average 2 instructions are stalled
each cycle.
- Branch:
- Per cent of cycles which are spend for branch
missprediction stalls. If 100% stalls occure then
on average 2 instructions are stalled each
cycle.
- R dep:
- Per cent of cycles stalled. The stall was
caused by register dependencies on previous
instructions. Possible reasons for this are
write-read (causes a stall) and
write-write (prohibits dual issue of the
instructions) conflicts.
- Nops:
- Per cent of cycles spend for executing
Nops. If 100% nops occure then on
average 2 nops per cycle are executed.
|