banner image

Data Local Iterative Methods For The Efficient Solution of Partial Differential Equations

logo
home
staff
coorperations
publications
talks
tutorials
software
results
contact

A cooperation
between
lss logo
and
lrr logo.

Funded by
dfg logo.

Impact of data locality optimizations on red-black Gauss Seidel performance

Overview:

The smoother is one of the most computationally expensive parts of a multigrid algorithm. Therefore, improving the performance of this relatively small module can cause a significant improvement of the performance of the whole multigrid method.

We started our data locality studies with a straightforward implementation of red-black Gauss-Seidel in Fortran. We consider Poisson's equation delta u=f in two dimensions, assuming boundary conditions of Dirichlet type. We chose a uniform grid and discretise the Laplacian operator using the standard 5-point stencil with constant coefficients.

We performed several transformations on the source code and the best performances of all transformations compared to the standard red-black Gauss-Seidel algorithm for different grid sizes on a DEC PWS 500au are shown in the figure below.

Benchmark Results:

MFlops/sec rates for Red-Black Gauss-Seidel
MFlops/sec rates for Red-Black Gauss-Seidel

The figure clearly shows that the performance can be improved for every grid size; even for the small ones. However, the greatest improvements can be achieved when the grid does not fit completely in one of the caches of the memory hierachy and the data must be fetched from main memory. In this case the improvement in the MFlops/sec rates can be more than a factor of 4. The table below shows the MFlops/sec rates of different code transformations for several grid sizes.

Program Melt Pad 16 32 64 128 256 512 1024 2048
rb1.F 1 no 346.9 354.8 453.9 205.5 182.9 63.7 58.8 55.9
rb2.F 1 no 402.6 458.0 564.2 363.0 357.0 123.2 112.8 78.6
rb3.F 1 no 337.0 402.4 519.2 314.6 248.9 108.3 98.9 83.2
rb4.F 2 no 332.1 420.6 541.9 359.5 294.3 161.7 143.0 109.1
3 no 341.8 400.3 517.5 372.0 311.4 197.3 156.3 119.9
rb5.F 2 no 391.3 489.2 609.9 404.1 364.6 180.1 148.6 89.6
3 no 397.4 444.3 561.0 404.3 368.2 226.9 150.9 92.5
3 yes 395.4 524.8 481.7 406.5 376.9 222.2 167.8 129.3
rb6.F 2 no 286.7 282.3 237.0 231.2 206.5 139.0 115.5 85.8
3 no 325.7 317.4 269.9 264.3 248.1 177.6 120.6 91.5
rb7.F 2 no 355.3 378.0 362.5 191.1 276.7 158.3 138.3 85.4
3 no 354.2 376.0 362.1 192.7 275.0 163.5 138.7 86.6
rb8.F 2 no 348.8 373.2 363.5 304.6 271.1 162.4 141.8 91.4
3 no 348.7 372.3 363.6 305.1 271.2 162.8 142.5 91.8
rb9.F 4 no 337.0 331.5 361.3 314.6 299.3 45.6 87.3 57.4
4 yes 327.7 406.3 412.9 388.6 391.7 264.7 267.9 250.6
MFlops/sec Rates for Different Implementations of Red-Black Gauss-Seidel

Working Set Sizes:

All optimized programs try to reuse the data as soon as possible to reduce the size of the working set. The working set is the data which is needed for the current computation plus the data which was loaded for an earlier computation and is needed again later. In general the working set changes during the computation of the program since new data may be needed for computation and old data can be discarded because it will not be reused again in the program.

In the optimal case the whole working set fits in L1 cache during the whole computation. However, in general the working set is to big to completely fit in any of the caches. The figure below shows an overview what amount of data fits in the caches of the DEC PWS 500au. In the case of standard red-black Gauss-Seidel, the working set of the program is equivalent to the whole grid plus the right-hand side of the equation.

Size Memory Consumption of Data which fits in
Line Grid Grid + RHS L1 cache L2 cache L3 cache
8 72 Byte 648 Byte 1.3 KByte Grid+RHS
16 136 Byte 2.2 KByte 4.5 KByte Grid+RHS
32 264 Byte 8.5 KByte 17.0 KByte 31.0 lines Grid+RHS
64 520 Byte 33.0 KByte 66.0 KByte 15.8 lines Grid+RHS
128 1 KByte 130.0 KByte 260.0 KByte 7.9 lines 95.2 lines Grid+RHS
256 2 KByte 516.0 KByte 1.0 MByte 3.9 lines 47.8 lines Grid+RHS
512 4 KByte 2.0 MByte 4.0 MByte 1.9 lines 23.9 lines 511.0 lines
1024 8 KByte 8.0 MByte 16.0 MByte 0.9 lines 11.9 lines 255.7 lines
2048 16 KByte 32.0 MByte 64.0 MByte 0.4 lines 5.9 lines 127.9 lines
Memory Consumption of Red-Black Gauss-Seidel

cs10-dime@fau.de
Last Modified: 10 January 2008
Valid HTML 4.01! Powered by vim