More Background
Information
Computer Architecture:
Over the last 10 years the speed of
microprocessors has increased at a rate of around 80%
per year [3]. Through the use of pipelining, fast
add-and-multiply operations and other techniques a
single processor can reach a theoretical speed of
around 1000 Mflops (March 1997). The development of
processors with higher clock cycles and more internal
parallelism leads to the projection that they will
reach speeds as high as 4 Gflops in the year 2000
[12].
Relative to the computational speed, the memory
access times and throughput have increased very
slowly, only at a rate of 5-10% per year [3].
Therefore, in modern computers the main bottleneck is
the time it takes to retrieve/write data from/to
memory. As a concrete example, consider the Alpha
Sever 8400 from DEC [8]. It's maximum computational
speed is over 600 Mflops and would thus require a
memory bandwidth of 14400 Mbytes per second in a
daxpy [10] operation, but the main memory throughput
is only 200 Mbytes per seconds [11].
A cost effective and realisable solution to such
memory bottlenecks is the use of a hierarchical
memory systems. Near the bottom of the hierarchy sits
the main memory usually composed of the cheaper DRAM
(dynamic random access memory) memory while at the
top sits the fast access cache composed of SRAM
(static-random access memory) memory. There are many
different variations of this theme with some
manufacturers offering several layers of cache
memory, both on and off chip. Note that a
hierarchical view of the memory is also very
important in parallel computing and will become even
more so as the concept of virtual memory takes a
stronger hold. If programmers are not aware of the
underlying memory structure then they will not be
able to use the full potential of the system.
When looking at the system as a whole, we also
have to take the software, such as compilers and
optimisers, into account as they can have a
considerable effect on the performance. A good
compiler can organise the code to make efficient use
of pipelining and the memory hierarchy by using such
techniques as loop-unrolling, blocking and prefetch
operations [4, 5]. Performance tools and profilers
also play an important role.
Numerical Algorithms:
The effectiveness of data locality has been well
demonstrated in LAPACK [1], a linear algebra package
for dense matrices, and FFT (fast Fourier Transform)
algorithms (see for example [2]). However, very
little has been done with iterative techniques and
research into more advanced iterative methods such as
multigrid has only just begun [7].
The most efficient (in terms of number of
operations) method for the solution of elliptic
partial differential equations are the multilevel
type of methods [9]. Other methods which also do well
include the Conjugate Gradient method with
hierarchical preconditions [13, 14]. In general,
these methods converge at a rate independent of the
number of unknowns, or at a rate only slightly
dependent (logarithmic) on the number of
unknowns.
The theory behind multigrid methods in the case of
elliptic partial differential equations is well
developed. The same comment does not apply when
talking about multigrid software. There is a definite
lack of software which is both efficient and can
handle complex problems.
The difficulty with iterative methods can be shown
through the following example. The program
mgd9v2 [6] is a robust multigrid program
designed to solve 2D elliptic partial differential
equations and is optimised for vector computers.
Around 35% of the computations involve the
calculation of the residual, however on the HP 720
the residual calculation only gives 6 Mflops out of a
possible 50 Mflops and the efficiency is worse on the
HP 735 with 9 Mflops out of a possible 198
Mflops.
The limiting factor for iterative methods is not
the numerical computations but rather memory access
and as the above example implies, if the processor
speed is increased the efficiency of these methods
decreases. The picture looks even more bleak when one
considers that the data structures for most adaptive
refinement algorithms use indirect addressing.
Iterative methods are conventionally written so
that a global operation is called during every
iteration. Such global operations mean that the whole
data structure must be transported through the memory
hierarchy and thus does not allow for efficient use
of the cache.
As a counter example lets look at the efficiency
of the solution of a dense matrix. For the solution
of a system of dimension 1000 the standard
implementation gives 8.6 Mflops on the HP 720 and
13.0 Mflops on the HP 735. However, when the program
is optimised for the cache the efficiency increases
to 34.8 Mflops (max = 50 Mflops) for the HP 720 and
103.0 Mflops (max = 198 Mflops) for the HP 735.
Despite the problems mentioned above we believe
that iterative methods should also be able to benefit
from cache optimisation. The aim of this project to
find out to what extent this is true.
References:
[1] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J.
Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A.
McKenney, S. Ostrouchov, and D. Sorenson. LAPACK
User's Guide. Society for Industrial and Applied
Methematics, Philadelphia, PA, USA, 1992.
[2] D. Bailey. RISC microprocessors and scientific
computing. RNR Technical Report 93-004, NASA Ames
Research Center, March 1993.
[3] D. C. Burger, J. R. Goodman, and A. Kägi.
The declining effectiveness of dynamic caching for
general-purpose microprocessors. Technical Report
TR-95-1261, University of Wisconsin, Dept. of
Computer Science, Madison, 1995.
[4] S. Carr. Combining optimization for cache and
instruction-level parallelism. Technical Report
95-05, Michigan Technological University, September
1995.
[5] S. Carr and K. Kennedy. Compiler blockability
of numerical algorithms. In Proceedings of
Supercomputing '92, pages 114-124, Minneapolis,
MN, November 1992.
[6] P. M. De Zeeuw. Matrix-dependent prolongations
and restrictions in a blackbox multigrid solver.
J. Comput. Appl. Math., 33:1-27, 1990.
[7] C. Douglas. Caching in with multigrid
algorithms: Problems in two dimensions. IBM Research
Report RC 20091, IBM, Yorktown Heights, NY, 1995.
[8] D. M. Fenwick, D. J. Foley, W. B. Gist, S. R.
VanDoren, and D. Wissel. The AlphaServer 8000 series:
High-end server platform development. Digital
Technical Journal, 3(1), 1995.
[9] W. Hackbusch. Multigrid Methods and
Applications. Springer Verlag, Berlin, 1985.
[10] C. L. Lawson, R. J. Hanson, D. R. Kincaid,
and F. T. Krogh. Basic linear algebra subprograms for
fortran usage. ACM Trans. Math. Software,
5:308-323, September 1979.
[11] J. D. McCalpin. STREAM benchmark results., 1996.
[12] H. Simon. Parallel supercomputing 1987-2000.
In H. W. Meuer, editor, Supercomputer 1995:
Anwendungen, Architekturen, Trends. K.G. Saur
Verlag, 1995.
[13] J. Xu. A new class of iterative methods for
nonselfadjoint or indefinite problems. SIAM J.
Numer. Anal., 29(2):303-319, April 1992.
[14] H. Yserentant. Old and new convergence proofs
for multigrid methods. Acta Numerica, pages
285-326, 1993.
|