Performance Optimization for Future Hardware
Rapid advances in micro processor development enable scientists to solve their problems faster.
The increasing packing density of both more computational logic per die and more commodity-off-the-shelf
products per supercomputer make large computations possible but raise also the demand for new
programming techniques that consider the different levels of parallelism in modern systems.
One of the objectives of this chair is to develop programming techniques that are easy to
understand, apply and teach, which help exploiting the capabilites of modern systems.
Single core has been improved by higher clock frequencies and better exploitation of instruction
level parallelism through a more complex logic for a long time, but these approaches became less
and less effective and have apparently passed the point of diminishing returns.
Now the higher packing density is particularly used to place several cores on a single die to
increase the capabilites of a single machine.
However, single-threaded program hardly benefit from newer CPU generations, as real parallelism
is necessary to make use of the independent cores.
At the time there is no mentionable advancement in auto-parallelizing compilers that would take
the burden of code parallelization from the programmer. Therefore smaller adaptions and
recompilation are not sufficient to profit by multi-core CPUs, but scientist will have to
rethink and rewrite their programs. This issue will grow in relevance, as the
transition to many-core (more than hundred) seems only a question of few years.
The road maps of major manufacturers show what can be expected further:
Different multiprocessing and multithreading capabilities will evolve,
capable of executing (and requiring) many independent threads or processes at the same time.
Also heterogeneous processors are explored, that provide cores of different type that are
optimized for certain tasks, and put up further requirement on parallelization.
It is even thinkable that systems will have limited abilities of reconfiguration in order
to optimize their structure for the current task.
Sophisticated memory architectures need to be developed to feed the rising number of cores.
There is a trend to interleaved memory and multiple memory channels, which
often also provide non-uniform memory access behaviors.
The memory hierarchy is likely to change, too, with possibly more stages and new topologies,
like e. g. shared caches. Especially local memories could alleviate the expensive data coherence
We therefore investigate hardware that has properties we expect to see in future systems.
This includes, but is not limited to:
Concluding, the major problems concerning future hardware will be:
- emerging multi-core systems
While being the most conservative approach, homogenous multi-core systems will
dominate the market probably also in the future. For both, the manufacturers
and the programmers, this slow evolution based on accustomed designs is the most
Prominent examples are the Intel Core architecture and the Niagara
processor by Sun. The latter already features eight independent computing
branches that are able to execute of up to 64 threads in hardware altogether.
- Cell Broadband Engine Architecture (CBEA)
The Sony-Toshiba-IBM Cell Broadband Engine, which is also heart of Sony's
Playstation 3, is a heterogenous multi-core processor.
One PowerPC core is mainly destined for flow control and execution of
the operating system.
The computational power lies in the so-called Synergistic Processor Elements.
These cores have been optimized for SIMD vector operations and high memory
bandwidth: They have an own instruction set and don't use caches, but copy data
between main memory and fast, private local stores by means of DMA transfers.
The Cell processor is not only an interesting opportunity to investigate the
challenges of heterogenous multi-core systems, but also to study
the potential of heterogenous clusters built open different CPU architectures.
- Graphical Processing Units (GPUs)
Since the need for more performance of modern games and visualization tasks is growing, graphics cards
feature computing power that outpaces most of modern general-purpose CPUs. However, due to the original
field of application in graphics, the arithemtic units of a GPU are arranged in a highly parallel and
independent manner. Disregarding some limitations, that way a GPU is able to process hundreds of
operations in parallel and thus represents a sort of many-core architecture already today.
With the Compute Unified Device Architecture (CUDA) Nvidia made the first attempt to make these powerful
processors available to other applications than from graphics. First studies have shown that the performance
of scientific applications on GPUs depend no longer on the memory bandwidth but on the parallelizability of
- Field Programmable Gate Arrays (FPGAs)
The history of high performance computing was always accompanied by the hassle of adapting the scientific
problem to the architecture available. The programmer has to be aware of the limitations of the computing
resource he intends to use, develops strategies to evade them in order to maximize the performance of
his code. This is the reason why there are calls for using FPGAs for solving scientific problems. FPGAs
are devices, originally used for rapid prototyping by IC developers, that are able to simulate the
behavior of an integrated circuit by executing boolean functions that are stored in lookup tables and connected
by configurable network.
Meanwhile there are advanced devices that feature special computing resources that can be incorporated into
the freely configurable network of lookup tables and interconnections. This makes FPGAs an interesting
candidate for high-performance computing, because they enable the scientist to build the hardware that exactly
fits to the needs of the algorithm.
The drawback of this idea is that broad knowledge in design of microelectronics is required in order to make
a perfect fitting architecture which makes FPGAs harder to use for a standard scientist or programmer.
Generally, new techniques are required to address multi-threading and fine-grained parallelism,
the distribution of work in homegenous and heterogeneous environments, and the optimization of
data movement between main memory, caches and/or local stores. This includes models that
abstract from the complex structure of the hardware and serve a unified interface.
- Sequentiality of algorithms
Many problems only provide a limited degree of parallelism or are even inherently
As real parallelism was an issue only in the HPC sector for years,
algorithms were developed and optimized particularly for serial execution.
- Synchronization overhead
The increasing need for synchronization will be a problem when more and more agents
cooperating on a problem. The actual overhead depends on the degree of synchronization
and support by the hardware.
- Memory bandwidth bottle neck
With the number of cores per socket, the demand for data from memory will increase, too.
A direct approach is to provide the cache hierarchy with more and larger caches, but
the related data coherence management can contradict this at least for fully coherent
- Programming models
Featuring instruction-level parallelism, multi-threading and multi-processing
capabilites, future hardware will show a great complexity.
As soon as the number of cores grows to a few hundred, it cannot be overlooked
by programmers anymore.
Therefore, programming concepts or even languages are necessary
that enable exploitation of the different levels of parallelism with
as less requirements on the programmer as possible.
Especially for FPGA-based applications easy-to-handle frameworks are required that
assist programmers unversed in microelectronics.
In particular, the goals of this project are:
- to identify the factors that will impact performance on future hardware,
- to develop programming techniques that can help evading the bottle necks, and
- to derive simple guide lines that can be easily applied by others.
Dieses Projekt ist in folgende Unterprojekte gegliedert, die getrennt behandelt und verschiedene aktuelle Architekturen adressieren werden, jedoch das selbe Ziel verfolgen:
- Stürmer, Markus; Köstler, Harald, A fast full multigrid solver for applications in image processing, In: Numerical Linear Algebra with Applications 15, pp. 187-200, 2008.
- Köstler, Harald; Stürmer, Markus; Freundl, Christoph; Rüde, Ulrich, PDE based Video Compression in Real-Time, Technical Report 07-11, 2007.
- Stürmer, Markus; Götz, Jan; Richter, Gregor; Rüde, Ulrich, Blood Flow Simulation on the Cell Broadband Engine Using the Lattice Boltzmann Method, Technical Report 07-9, 2007.
- M. Stürmer, Optimierung des Mehrgitteralgorithmus auf IA64 Rechnerarchitekturen, Diplomarbeit, 2006.
- S. Donath, On Optimized Implementations of the Lattice Boltzmann Method on Contemporary Architectures, Bachelor's Thesis, 2004.
- Xilinx Corporation. Vielen Dank für das kostenlose Überlassen von FPGA-Hardware und Software im Rahmen des Xilinx Univsersity Program (XUP).