Data Local Iterative Methods For The Efficient Solution of Partial Differential Equations

A cooperation
between

and
.

Funded by
.

More Background Information

Computer Architecture:

Over the last 10 years the speed of microprocessors has increased at a rate of around 80% per year [3]. Through the use of pipelining, fast add-and-multiply operations and other techniques a single processor can reach a theoretical speed of around 1000 Mflops (March 1997). The development of processors with higher clock cycles and more internal parallelism leads to the projection that they will reach speeds as high as 4 Gflops in the year 2000 [12].

Relative to the computational speed, the memory access times and throughput have increased very slowly, only at a rate of 5-10% per year [3]. Therefore, in modern computers the main bottleneck is the time it takes to retrieve/write data from/to memory. As a concrete example, consider the Alpha Sever 8400 from DEC [8]. It's maximum computational speed is over 600 Mflops and would thus require a memory bandwidth of 14400 Mbytes per second in a daxpy [10] operation, but the main memory throughput is only 200 Mbytes per seconds [11].

A cost effective and realisable solution to such memory bottlenecks is the use of a hierarchical memory systems. Near the bottom of the hierarchy sits the main memory usually composed of the cheaper DRAM (dynamic random access memory) memory while at the top sits the fast access cache composed of SRAM (static-random access memory) memory. There are many different variations of this theme with some manufacturers offering several layers of cache memory, both on and off chip. Note that a hierarchical view of the memory is also very important in parallel computing and will become even more so as the concept of virtual memory takes a stronger hold. If programmers are not aware of the underlying memory structure then they will not be able to use the full potential of the system.

When looking at the system as a whole, we also have to take the software, such as compilers and optimisers, into account as they can have a considerable effect on the performance. A good compiler can organise the code to make efficient use of pipelining and the memory hierarchy by using such techniques as loop-unrolling, blocking and prefetch operations [4, 5]. Performance tools and profilers also play an important role.

Numerical Algorithms:

The effectiveness of data locality has been well demonstrated in LAPACK [1], a linear algebra package for dense matrices, and FFT (fast Fourier Transform) algorithms (see for example [2]). However, very little has been done with iterative techniques and research into more advanced iterative methods such as multigrid has only just begun [7].

The most efficient (in terms of number of operations) method for the solution of elliptic partial differential equations are the multilevel type of methods [9]. Other methods which also do well include the Conjugate Gradient method with hierarchical preconditions [13, 14]. In general, these methods converge at a rate independent of the number of unknowns, or at a rate only slightly dependent (logarithmic) on the number of unknowns.

The theory behind multigrid methods in the case of elliptic partial differential equations is well developed. The same comment does not apply when talking about multigrid software. There is a definite lack of software which is both efficient and can handle complex problems.

The difficulty with iterative methods can be shown through the following example. The program mgd9v2 [6] is a robust multigrid program designed to solve 2D elliptic partial differential equations and is optimised for vector computers. Around 35% of the computations involve the calculation of the residual, however on the HP 720 the residual calculation only gives 6 Mflops out of a possible 50 Mflops and the efficiency is worse on the HP 735 with 9 Mflops out of a possible 198 Mflops.

The limiting factor for iterative methods is not the numerical computations but rather memory access and as the above example implies, if the processor speed is increased the efficiency of these methods decreases. The picture looks even more bleak when one considers that the data structures for most adaptive refinement algorithms use indirect addressing.

Iterative methods are conventionally written so that a global operation is called during every iteration. Such global operations mean that the whole data structure must be transported through the memory hierarchy and thus does not allow for efficient use of the cache.

As a counter example lets look at the efficiency of the solution of a dense matrix. For the solution of a system of dimension 1000 the standard implementation gives 8.6 Mflops on the HP 720 and 13.0 Mflops on the HP 735. However, when the program is optimised for the cache the efficiency increases to 34.8 Mflops (max = 50 Mflops) for the HP 720 and 103.0 Mflops (max = 198 Mflops) for the HP 735.

Despite the problems mentioned above we believe that iterative methods should also be able to benefit from cache optimisation. The aim of this project to find out to what extent this is true.

References:

[1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorenson. LAPACK User's Guide. Society for Industrial and Applied Methematics, Philadelphia, PA, USA, 1992.

[2] D. Bailey. RISC microprocessors and scientific computing. RNR Technical Report 93-004, NASA Ames Research Center, March 1993.

[3] D. C. Burger, J. R. Goodman, and A. Kägi. The declining effectiveness of dynamic caching for general-purpose microprocessors. Technical Report TR-95-1261, University of Wisconsin, Dept. of Computer Science, Madison, 1995.

[4] S. Carr. Combining optimization for cache and instruction-level parallelism. Technical Report 95-05, Michigan Technological University, September 1995.

[5] S. Carr and K. Kennedy. Compiler blockability of numerical algorithms. In Proceedings of Supercomputing '92, pages 114-124, Minneapolis, MN, November 1992.

[6] P. M. De Zeeuw. Matrix-dependent prolongations and restrictions in a blackbox multigrid solver. J. Comput. Appl. Math., 33:1-27, 1990.

[7] C. Douglas. Caching in with multigrid algorithms: Problems in two dimensions. IBM Research Report RC 20091, IBM, Yorktown Heights, NY, 1995.

[8] D. M. Fenwick, D. J. Foley, W. B. Gist, S. R. VanDoren, and D. Wissel. The AlphaServer 8000 series: High-end server platform development. Digital Technical Journal, 3(1), 1995.

[9] W. Hackbusch. Multigrid Methods and Applications. Springer Verlag, Berlin, 1985.

[10] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Software, 5:308-323, September 1979.

[11] J. D. McCalpin. STREAM benchmark results., 1996.

[12] H. Simon. Parallel supercomputing 1987-2000. In H. W. Meuer, editor, Supercomputer 1995: Anwendungen, Architekturen, Trends. K.G. Saur Verlag, 1995.

[13] J. Xu. A new class of iterative methods for nonselfadjoint or indefinite problems. SIAM J. Numer. Anal., 29(2):303-319, April 1992.

[14] H. Yserentant. Old and new convergence proofs for multigrid methods. Acta Numerica, pages 285-326, 1993.

cs10-dime@fau.de

Last Modified: 10 January 2008