Impact of data
locality optimizations on red-black Gauss Seidel
performance
Overview:
The smoother is one of the most computationally
expensive parts of a multigrid algorithm. Therefore,
improving the performance of this relatively small
module can cause a significant improvement of the
performance of the whole multigrid method.
We started our data locality studies with a
straightforward implementation of red-black
Gauss-Seidel in Fortran. We consider Poisson's
equation delta u=f in two dimensions, assuming
boundary conditions of Dirichlet type. We chose a
uniform grid and discretise the Laplacian operator
using the standard 5-point stencil with constant
coefficients.
We performed several transformations on the source
code and the best performances of all transformations
compared to the standard red-black Gauss-Seidel
algorithm for different grid sizes on a DEC PWS 500au
are shown in the figure below.
Benchmark Results:

MFlops/sec rates for Red-Black Gauss-Seidel
The figure clearly shows that the performance can
be improved for every grid size; even for the small
ones. However, the greatest improvements can be
achieved when the grid does not fit completely in one
of the caches of the memory hierachy and the data
must be fetched from main memory. In this case the
improvement in the MFlops/sec rates can be more than
a factor of 4. The table below shows the MFlops/sec
rates of different code transformations for several
grid sizes.
Program |
Melt |
Pad |
16 |
32 |
64 |
128 |
256 |
512 |
1024 |
2048 |
rb1.F |
1 |
no |
346.9 |
354.8 |
453.9 |
205.5 |
182.9 |
63.7 |
58.8 |
55.9 |
rb2.F |
1 |
no |
402.6 |
458.0 |
564.2 |
363.0 |
357.0 |
123.2 |
112.8 |
78.6 |
rb3.F |
1 |
no |
337.0 |
402.4 |
519.2 |
314.6 |
248.9 |
108.3 |
98.9 |
83.2 |
rb4.F |
2 |
no |
332.1 |
420.6 |
541.9 |
359.5 |
294.3 |
161.7 |
143.0 |
109.1 |
3 |
no |
341.8 |
400.3 |
517.5 |
372.0 |
311.4 |
197.3 |
156.3 |
119.9 |
rb5.F |
2 |
no |
391.3 |
489.2 |
609.9 |
404.1 |
364.6 |
180.1 |
148.6 |
89.6 |
3 |
no |
397.4 |
444.3 |
561.0 |
404.3 |
368.2 |
226.9 |
150.9 |
92.5 |
3 |
yes |
395.4 |
524.8 |
481.7 |
406.5 |
376.9 |
222.2 |
167.8 |
129.3 |
rb6.F |
2 |
no |
286.7 |
282.3 |
237.0 |
231.2 |
206.5 |
139.0 |
115.5 |
85.8 |
3 |
no |
325.7 |
317.4 |
269.9 |
264.3 |
248.1 |
177.6 |
120.6 |
91.5 |
rb7.F |
2 |
no |
355.3 |
378.0 |
362.5 |
191.1 |
276.7 |
158.3 |
138.3 |
85.4 |
3 |
no |
354.2 |
376.0 |
362.1 |
192.7 |
275.0 |
163.5 |
138.7 |
86.6 |
rb8.F |
2 |
no |
348.8 |
373.2 |
363.5 |
304.6 |
271.1 |
162.4 |
141.8 |
91.4 |
3 |
no |
348.7 |
372.3 |
363.6 |
305.1 |
271.2 |
162.8 |
142.5 |
91.8 |
rb9.F |
4 |
no |
337.0 |
331.5 |
361.3 |
314.6 |
299.3 |
45.6 |
87.3 |
57.4 |
4 |
yes |
327.7 |
406.3 |
412.9 |
388.6 |
391.7 |
264.7 |
267.9 |
250.6 |
MFlops/sec Rates for Different Implementations
of Red-Black Gauss-Seidel
Working Set Sizes:
All optimized programs try to reuse the data as
soon as possible to reduce the size of the
working set. The working set is the
data which is needed for the current computation plus
the data which was loaded for an earlier computation
and is needed again later. In general the working set
changes during the computation of the program since
new data may be needed for computation and old data
can be discarded because it will not be reused again
in the program.
In the optimal case the whole working set fits in
L1 cache during the whole computation. However, in
general the working set is to big to completely fit
in any of the caches. The figure below shows an
overview what amount of data fits in the caches of
the DEC PWS 500au. In the case of standard red-black
Gauss-Seidel, the working set of the program is
equivalent to the whole grid plus the right-hand side
of the equation.
Size |
Memory
Consumption of |
Data which fits
in |
Line |
Grid |
Grid + RHS |
L1 cache |
L2 cache |
L3 cache |
8 |
72 Byte |
648 Byte |
1.3 KByte |
Grid+RHS |
16 |
136 Byte |
2.2 KByte |
4.5 KByte |
Grid+RHS |
32 |
264 Byte |
8.5 KByte |
17.0 KByte |
31.0 lines |
Grid+RHS |
64 |
520 Byte |
33.0 KByte |
66.0 KByte |
15.8 lines |
Grid+RHS |
128 |
1 KByte |
130.0 KByte |
260.0 KByte |
7.9 lines |
95.2 lines |
Grid+RHS |
256 |
2 KByte |
516.0 KByte |
1.0 MByte |
3.9 lines |
47.8 lines |
Grid+RHS |
512 |
4 KByte |
2.0 MByte |
4.0 MByte |
1.9 lines |
23.9 lines |
511.0 lines |
1024 |
8 KByte |
8.0 MByte |
16.0 MByte |
0.9 lines |
11.9 lines |
255.7 lines |
2048 |
16 KByte |
32.0 MByte |
64.0 MByte |
0.4 lines |
5.9 lines |
127.9 lines |
Memory Consumption of Red-Black Gauss-Seidel
|