Status:
The DiME project has been completed. If you are
interested, you can have a look into the final
DiME report.
You can read about performance optimization
activities at the Lehrstuhl für
Systemsimulation, Erlangen, in the Performance
Optimization for Future Hardware blog.
Overview:
In order to mitigate the impact of the growing gap
between CPU speed and main memory performance,
today's computer architectures implement hierarchical
memory structures. The idea behind this approach is
to hide both the low main memory bandwidth and the
latency of main memory accesses which is slow in
contrast to the floating-point performance of the
CPUs. Usually, there is a small and expensive high
speed memory sitting on top of the hierarchy which is
usually integrated within the processor chip to
provide data with low latency and high bandwidth;
i.e., the CPU registers. Moving further away from the
CPU, the layers of memory successively become larger
and slower. The memory components which are located
between the processor core and main memory are called
cache memories or caches. They are
intended to contain copies of main memory blocks to
speed up accesses to frequently needed data. The next
lower level of the memory hierarchy is the main
memory which is large but also comparatively slow.
External memory such as hard disk drives or remote
memory components in a distributed computing
environment represent the lower end of any common
hierarchical memory design.
Efficient program execution can only be expected,
if the codes respect the underlying hierarchical
memory design. Unfortunately, today's compilers
cannot introduce highly sophisticated cache-based
transformations and, consequently, much of this
optimization effort is left to the programmer.
This is particularly true for numerically
intensive codes. Such codes occur in almost all
science and engineering disciplines; e.g.,
computational fluid dynamics, computational physics,
and mechanical engineering. They are characterized
both by a large portion of floating-point operations
as well as by the fact that most of their execution
time is spent in small computational kernels based on
loop nests. Thus, instruction cache misses have no
significant impact on execution performance. However,
the underlying data sets are typically by far too
large to be kept in a higher level of the memory
hierarchy; i.e., in cache.
We focus on numerical methods for partial
differential equations, particularly multigrid
algorithms, which have been shown to be among the
fastest algorithms for the solution of elliptic
boundary value problems, for example. Our work covers
data layout transformations as well as data access
transformations to enhance the cache performance of
multigrid codes. These transformations do not change
the numerical properties (e.g., convergence speed) of
the algorithm and yield results that are identical to
the original code. Furthermore, we are currently
investigating new multigrid algorithms which
inherently contain a higher potential for data
locality due to their non-deterministic formulation.
Our work presently focuses on the Fully Adaptive
Multigrid Method which has been introduced by U.
Rüde.
Recent research has shown that our techniques can
also be applied in order to speed up the performance
of the lattice Boltzmann method (LBM). The LBM is a
particle-oriented technique which is mainly used in
computational fluid dynamics (CFD). It is based on a
microscopic model of the moving fluid particles. From
an abstract point of view, the LBM is similar to
Jacobi-like iterative linear solvers on regular mesh
structures, where the iteration loop of the linear
solver corresponds to the time loop in the LBM.
|