rb2.F Program
Description
File Name:rb2.F
Description:
When implementing the
standard red-black Gauss-Seidel algorithm, the usual
practice is to do one complete sweep of the grid from
bottom to top updating all the red nodes and then one
complete sweep updating all of the black nodes. The
first point to note is that as a red node in row
i is updated the black node in row
i-1 directly underneath it may also be
updated. If a 5-point stencil is placed over one of
the black nodes (see figure) then we can see that all
of the red points it touches are up to date (as long
as the red node above it is up to date).
Consequently, we update a red node in a row
i and the black node in row i-1
directly underneath it in pairs in all rows
i (except the first row of the grid) before
moving to the next red node in row i.
Comment:
Instead of doing a red sweep and then a black
sweep, we just do one sweep of the grid updating the
red and black nodes as we move through. The
consequence is that the grid must be transfered from
main memory to cache only once per update sweep
instead of twice, as long as at least four lines of
the grid fit in the cache. Three lines must fit in
the cache to provide the data for the update of the
black points and one additional line for the update
of the red points in the line above the three
lines.
An additinal effect which can improve the data
locality is that the compiler might keep the value of
the two updated nodes in a register, so that the
update of the black node saves 2 load operations.
Since 14 memory operations are required for the two
update operations (6 loads and 1 store for each
update) about 15% of all memory operations can be
saved.
Results:
Memory access behaviour
Size |
MBytes
/sec |
% of all access which go
into |
± |
1. Level |
2. Level |
3. Level |
Memory |
16 |
1813.7 |
19.6 |
80.4 |
0.0 |
0.0 |
0.0 |
32 |
2025.4 |
21.0 |
62.8 |
16.1 |
0.0 |
0.0 |
64 |
2579.5 |
18.4 |
70.4 |
10.9 |
0.3 |
0.0 |
128 |
1579.7 |
22.3 |
64.7 |
6.1 |
6.8 |
0.0 |
256 |
1567.0 |
21.6 |
42.6 |
30.2 |
5.4 |
0.0 |
512 |
544.3 |
21.1 |
31.5 |
41.9 |
1.9 |
3.6 |
1024 |
499.6 |
20.9 |
28.9 |
43.1 |
3.4 |
3.6 |
2048 |
349.3 |
20.7 |
29.2 |
34.5 |
12.0 |
3.6 |
Runtime behaviour
Size |
MFlops
/sec |
% of cycles used for |
± |
Base |
Exec |
Cache |
DTB |
Branch |
R dep |
Nops |
16 |
402.6 |
3.5 |
114.8 |
62.3 |
0.6 |
5.2 |
6.2 |
29.6 |
7.4 |
32 |
458.0 |
2.1 |
126.1 |
76.6 |
14.4 |
4.0 |
3.5 |
13.5 |
12.0 |
64 |
564.2 |
-0.8 |
154.8 |
112.6 |
7.4 |
14.2 |
4.3 |
0.0 |
17.1 |
128 |
363.0 |
-1.2 |
127.7 |
64.4 |
46.0 |
5.1 |
2.5 |
0.0 |
10.9 |
256 |
357.0 |
-2.2 |
121.5 |
62.7 |
41.3 |
6.7 |
2.4 |
0.0 |
10.6 |
512 |
123.2 |
-0.9 |
105.3 |
21.2 |
68.8 |
11.8 |
0.8 |
0.0 |
3.6 |
1024 |
112.8 |
-0.4 |
103.6 |
19.1 |
74.3 |
6.7 |
0.7 |
0.0 |
3.2 |
2048 |
78.6 |
-0.1 |
102.1 |
12.9 |
85.3 |
1.4 |
0.5 |
0.0 |
2.1 |
Table
explanation
|