banner image

Data Local Iterative Methods For The Efficient Solution of Partial Differential Equations


A cooperation
lss logo
lrr logo.

Funded by
dfg logo.


The DiME project has been completed. If you are interested, you can have a look into the final DiME report.

You can read about performance optimization activities at the Lehrstuhl für Systemsimulation, Erlangen, in the Performance Optimization for Future Hardware blog.


In order to mitigate the impact of the growing gap between CPU speed and main memory performance, today's computer architectures implement hierarchical memory structures. The idea behind this approach is to hide both the low main memory bandwidth and the latency of main memory accesses which is slow in contrast to the floating-point performance of the CPUs. Usually, there is a small and expensive high speed memory sitting on top of the hierarchy which is usually integrated within the processor chip to provide data with low latency and high bandwidth; i.e., the CPU registers. Moving further away from the CPU, the layers of memory successively become larger and slower. The memory components which are located between the processor core and main memory are called cache memories or caches. They are intended to contain copies of main memory blocks to speed up accesses to frequently needed data. The next lower level of the memory hierarchy is the main memory which is large but also comparatively slow. External memory such as hard disk drives or remote memory components in a distributed computing environment represent the lower end of any common hierarchical memory design.

Efficient program execution can only be expected, if the codes respect the underlying hierarchical memory design. Unfortunately, today's compilers cannot introduce highly sophisticated cache-based transformations and, consequently, much of this optimization effort is left to the programmer.

This is particularly true for numerically intensive codes. Such codes occur in almost all science and engineering disciplines; e.g., computational fluid dynamics, computational physics, and mechanical engineering. They are characterized both by a large portion of floating-point operations as well as by the fact that most of their execution time is spent in small computational kernels based on loop nests. Thus, instruction cache misses have no significant impact on execution performance. However, the underlying data sets are typically by far too large to be kept in a higher level of the memory hierarchy; i.e., in cache.

We focus on numerical methods for partial differential equations, particularly multigrid algorithms, which have been shown to be among the fastest algorithms for the solution of elliptic boundary value problems, for example. Our work covers data layout transformations as well as data access transformations to enhance the cache performance of multigrid codes. These transformations do not change the numerical properties (e.g., convergence speed) of the algorithm and yield results that are identical to the original code. Furthermore, we are currently investigating new multigrid algorithms which inherently contain a higher potential for data locality due to their non-deterministic formulation. Our work presently focuses on the Fully Adaptive Multigrid Method which has been introduced by U. Rüde.

Recent research has shown that our techniques can also be applied in order to speed up the performance of the lattice Boltzmann method (LBM). The LBM is a particle-oriented technique which is mainly used in computational fluid dynamics (CFD). It is based on a microscopic model of the moving fluid particles. From an abstract point of view, the LBM is similar to Jacobi-like iterative linear solvers on regular mesh structures, where the iteration loop of the linear solver corresponds to the time loop in the LBM.
Last Modified: 16 July 2009
Valid HTML 4.01! Powered by vim