Data Local Iterative Methods For The Efficient Solution of Partial Differential Equations

A cooperation
between

and
.

Funded by
.

Test Environment and Table Explanation

Environment:

Several different programs are presented. All of them implement a 5-point stencil red-black Gauss-Seidel relaxation in 2D in different ways. Since all data dependencies of the standard red-black Gauss-Seidel method are observed they all produce bitwise identical results.

The programs were compiled with DEC Fortran77 on a DEC PWS 500au. The number of accesses which hit into a certain level of the memory hierarchy and the number of cycles spend for a certain reason were measured with the profiling tool DCPI.

DEC PWS 500au:

Hardware:

CPU: 500 MHz Alpha A21164 (ev56)
AlphaPC 164LX Motherboard
2 MB 3. Level Cache (board level)
128 MByte SDRAM

Software:

Digital UNIX V4.0
DEC C V5.2-033 (Compiler driver: 3.11)
used options: -O5 -ansi_alias -fast -tune host
DEC F77 Driver X4.1-134-335G (Compiler driver: X4.1-6)
used options: -extend_source -fast -notransform_loops -O5 -tune host
DCPI V2.11 (more info)

Programs:

Table: Memory Access Behaviour

The table below describs the memory behaviour of the "Red-Black Gaus-Seidel Relaxation" program in the context of different data set sizes. Especielly, the fact which part of the memory hierarchy and how frequently is used for data acceses is examined.

Memory access behaviour
Size	MBytes /sec	% of all access which go into
Size	MBytes /sec	±	1. Level	2. Level	3. Level	Memory

Size:: The program operates on a (Size+1) x (Size+1) grid in 2D. The number of unknown is (Size-1)².
MBytes/sec:: Average memory bandwidth achieved during execution. Accesses to registers are not counted. The number is calculated by dividing the actually performed number of acceses to the first level cache by the execution time of the program.
±:: The theoretical number of memory acceses is proportional to the number of relaxations performed by the algorithmn (6 loads and 1 store). The expected amount of memory acceses is assumed to be 100%. The fraction of memory acceses which are expected but not actually with DCPI observed is presented in this column. The number is positiv if there are less acceses observed than expected. If the number is very high it is likely that the compiler was able to reuse data resided in registers insteed of reloading it from main memory or cache. (DCPI is able to count the amount of acceses to the first level cache but not to registers.) E.g. if the column is "25" then one quarter of all expected memory accesses was satisfied by register accesses.
1. Level:: Represents the fraction of all expected memory acceses which can be served by the first level of the memory hierachy. The amount of accesses which can be served by the L1 Cache is measured with DCPI by counting the performed acceses into the L1 cache and counting the amount of L1 cache misses (= accesses which cannot be served by the L1 cache).
2. Level:: Represents the fraction of all expected memory acceses which can be served by the second level of the memory hierachy. The amount of accesses which can be served by the L2 Cache is measured with DCPI by counting the L1 misses (= accesses which go into L2 cache) and counting the amount of L2 cache misses (= accesses which cannot be served by the L2 cache).
3. Level:: Represents the fraction of all expected memory acceses which can be served by the third level of the memory hierachy. The amount of accesses which can be served by the L3 Cache is measured with DCPI by counting the L2 misses (= accesses which go into L3 cache) and counting the amount of L3 cache misses (= accesses which cannot be served by the L3 cache).
Memory:: Represents the fraction of all expected memory acceses which are not served by any of the caches and must be served by a main memory access. The amount of memory accesses which must be server by the main memory is measured with DCPI by counting the amount of L3 cache misses.

Table: Runtime Behaviour

The table below describs the memory behaviour of the "Red-Black Gaus-Seidel Relaxation" program in the context of different data set sizes. Especielly,

Runtime behaviour
Size	MFlops /sec	% of cycles used for
Size	MFlops /sec	±	Base	Exec	Cache	DTB	Branch	R dep	Nops

Size:: The program operates on a (Size+1) x (Size+1) grid in 2D. The number of unknown is (Size-1)².
MFlops/sec:: Achieved MFlops/sec rate. It is assumed that 10 floating-point operations are required for the relaxation of each node. The programs are slightly obscured so that the used version of the DEC Fortran77 compiler is not able to eliminate operations.
±:: Shows the residium of the cycles not spend for any of the shown reasons. The residium may include measurement errors or one of the following reasons for cpu stalls: instruction cache misses, instruction TLB misses, slotting problems, busy functional units (IMUL or FDIV), etc.
Base:: The measurements are performed with DCPI which assumes that 2 instructions are issued per cycle. Hence, if two instructions per cycle are issued the CPU performes 100% execution. However, the DEC PWS 500au is able to issue up to four instruction per cycle. If the CPU is able to issue more than 2 instruction this is reported as "unexplained gain" and added to the 100%. So, when the CPU is able to issue 3 instruction each cycle this means that the column Base is 150%.
Exec:: Per cent of cycles which are spend for execution of instructions. If 100% execution is achieved then on average 2 instructions are issued per cycle.
Cache:: Per cent of cycles which are spend for data cache miss stalls. If 100% stalls occure then on average 2 instructions are stalled each cycle.
DTB:: Per cent of cycles which are spend for data table lookaside buffer miss stalls. If 100% stalls occure then on average 2 instructions are stalled each cycle.
Branch:: Per cent of cycles which are spend for branch missprediction stalls. If 100% stalls occure then on average 2 instructions are stalled each cycle.
R dep:: Per cent of cycles stalled. The stall was caused by register dependencies on previous instructions. Possible reasons for this are write-read (causes a stall) and write-write (prohibits dual issue of the instructions) conflicts.
Nops:: Per cent of cycles spend for executing Nops. If 100% nops occure then on average 2 nops per cycle are executed.

cs10-dime@fau.de

Last Modified: 10 January 2008