banner image

Data Local Iterative Methods For The Efficient Solution of Partial Differential Equations

logo
home
staff
coorperations
publications
talks
tutorials
software
results
contact

A cooperation
between
lss logo
and
lrr logo.

Funded by
dfg logo.

rb2.F Program Description

File Name:

rb2.F

Description:

Data Dependencies in Red-Black Relaxation Algorithm When implementing the standard red-black Gauss-Seidel algorithm, the usual practice is to do one complete sweep of the grid from bottom to top updating all the red nodes and then one complete sweep updating all of the black nodes. The first point to note is that as a red node in row i is updated the black node in row i-1 directly underneath it may also be updated. If a 5-point stencil is placed over one of the black nodes (see figure) then we can see that all of the red points it touches are up to date (as long as the red node above it is up to date). Consequently, we update a red node in a row i and the black node in row i-1 directly underneath it in pairs in all rows i (except the first row of the grid) before moving to the next red node in row i.

Comment:

Instead of doing a red sweep and then a black sweep, we just do one sweep of the grid updating the red and black nodes as we move through. The consequence is that the grid must be transfered from main memory to cache only once per update sweep instead of twice, as long as at least four lines of the grid fit in the cache. Three lines must fit in the cache to provide the data for the update of the black points and one additional line for the update of the red points in the line above the three lines.

An additinal effect which can improve the data locality is that the compiler might keep the value of the two updated nodes in a register, so that the update of the black node saves 2 load operations. Since 14 memory operations are required for the two update operations (6 loads and 1 store for each update) about 15% of all memory operations can be saved.

Results:

Memory access behaviour
Size MBytes
/sec
% of all access which go into
± 1. Level 2. Level 3. Level Memory
16 1813.7 19.6 80.4 0.0 0.0 0.0
32 2025.4 21.0 62.8 16.1 0.0 0.0
64 2579.5 18.4 70.4 10.9 0.3 0.0
128 1579.7 22.3 64.7 6.1 6.8 0.0
256 1567.0 21.6 42.6 30.2 5.4 0.0
512 544.3 21.1 31.5 41.9 1.9 3.6
1024 499.6 20.9 28.9 43.1 3.4 3.6
2048 349.3 20.7 29.2 34.5 12.0 3.6

Runtime behaviour
Size MFlops
/sec
% of cycles used for
± Base Exec Cache DTB Branch R dep Nops
16 402.6 3.5 114.8 62.3 0.6 5.2 6.2 29.6 7.4
32 458.0 2.1 126.1 76.6 14.4 4.0 3.5 13.5 12.0
64 564.2 -0.8 154.8 112.6 7.4 14.2 4.3 0.0 17.1
128 363.0 -1.2 127.7 64.4 46.0 5.1 2.5 0.0 10.9
256 357.0 -2.2 121.5 62.7 41.3 6.7 2.4 0.0 10.6
512 123.2 -0.9 105.3 21.2 68.8 11.8 0.8 0.0 3.6
1024 112.8 -0.4 103.6 19.1 74.3 6.7 0.7 0.0 3.2
2048 78.6 -0.1 102.1 12.9 85.3 1.4 0.5 0.0 2.1
Table explanation

cs10-dime@fau.de
Last Modified: 10 January 2008
Valid HTML 4.01! Powered by vim