Hardware–oriented locality optimizations for iterative methods

Markus Kowarschik
markus.kowarschik@informatik.uni-erlangen.de

Lehrstuhl für Systemsimulation (Prof. Dr. U. Rüde)
Institut für Informatik
Friedrich–Alexander–Universität Erlangen–Nürnberg
Germany

This project is being supported by the DFG:
DiME — data–local iterative methods
(together with LRR–TUM, Prof. Dr. A. Bode)
Outline

• Motivating example

• Cache design aspects

• Code profiling — hardware support and tools

• Techniques to enhance cache utilization (preserving numerical properties)
  – Data layout optimizations
  – Data access optimizations

• Performance results

• Cache-oriented “non-standard” algorithms

• Conclusions
Motivating (frustrating?) example

*Theoretically ...*
modern workstations based on superscalar CPUs can do up to 4000 MFLOPS per CPU (i.e., $4 \cdot 10^9$ floating–point operations per second)

*In practice ...*
we often obtain disappointing results

*Example:* 3D Gauss–Seidel on a Digital PWS 500au (1000 MFLOPS peak):

<table>
<thead>
<tr>
<th>grid size</th>
<th># unknowns</th>
<th>MFLOPS</th>
</tr>
</thead>
<tbody>
<tr>
<td>16</td>
<td>4096</td>
<td>415</td>
</tr>
<tr>
<td>32</td>
<td>32768</td>
<td>194</td>
</tr>
<tr>
<td>64</td>
<td>262144</td>
<td>76</td>
</tr>
<tr>
<td>128</td>
<td>$\approx 2.1 \cdot 10^6$</td>
<td>73</td>
</tr>
</tbody>
</table>

⇒ Need to understand cache effects!
Cache design — a memory hierarchy example

Memory architecture of an Intel Itanium2 machine (1 GHz):

<table>
<thead>
<tr>
<th>Capacity</th>
<th>Bandwidth</th>
<th>Latency</th>
</tr>
</thead>
<tbody>
<tr>
<td>16 KB</td>
<td>16 GB/s</td>
<td>1 cycle</td>
</tr>
<tr>
<td>256 KB</td>
<td>32 GB/s</td>
<td>5+ cycles</td>
</tr>
<tr>
<td>1.5/3 MB</td>
<td>32 GB/s</td>
<td>12+ cycles</td>
</tr>
<tr>
<td>2 GB</td>
<td>6.4 GB/s</td>
<td>&gt;100 cycles</td>
</tr>
</tbody>
</table>

Inevitable: exploit the cache architecture as efficiently as possible!
Cache design — associativity

Denotes the number of cache lines where a main memory block may be copied to

- Fully associative
- Direct mapped
- N-way set associative (here: n=2)

Low associativity (n small) $\Rightarrow$ High potential for cache conflict misses, cache thrashing
Hardware performance counters (= special CPU registers) are used to count various events:

- Data cache misses (for different cache levels)
- TLB (translation lookaside buffer) misses
- Branch mispredictions
- Floating-point and/or integer operations
- etc.

Profiling tools (freely available):

- PAPI: Performance Application Programming Interface, Univ. of Tennessee
- PCL: Performance Counter Library, FZ Jülich
- DCPI: Digital Continuous Profiling Infrastructure, Digital/Compaq only
- etc.
We use a Digital PWS 500au machine for demonstration purposes: Alpha A21164 CPU, 500 MHz, 1000 MFLOPS peak, 3 on–chip performance counters.

Hardware performance monitor based on PCL: hpm

Example: 2D “vanilla” multigrid code, structured grids

% hpm --events PCL_CYCLES,PCL_MFLOPS ./mg
hpm: elapsed time: 5.172 s
hpm: counter 0 : 2564941490 PCL_CYCLES
hpm: counter 1 : 19.635955 PCL_MFLOPS

This is < 2% peak!
% dcpiprof ./mg

<table>
<thead>
<tr>
<th>Column</th>
<th>Total</th>
<th>Period (for events)</th>
</tr>
</thead>
<tbody>
<tr>
<td>dmiss</td>
<td>45745</td>
<td>4096</td>
</tr>
</tbody>
</table>

The numbers given below are the number of samples for each listed event type or, for the ratio of two event types, the ratio of the number of samples for the two event types.

<table>
<thead>
<tr>
<th></th>
<th>dmiss</th>
<th>%</th>
<th>cum%</th>
<th>procedure</th>
<th>image</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>33320</td>
<td>72.84%</td>
<td>72.84%</td>
<td>mgSmooth</td>
<td>./mg</td>
</tr>
<tr>
<td></td>
<td>10008</td>
<td>21.88%</td>
<td>94.72%</td>
<td>mgRestriction</td>
<td>./mg</td>
</tr>
<tr>
<td></td>
<td>2411</td>
<td>5.27%</td>
<td>99.99%</td>
<td>mgProlongation</td>
<td>./mg</td>
</tr>
<tr>
<td></td>
<td>3</td>
<td>0.01%</td>
<td>99.99%</td>
<td>mgInitGrid</td>
<td>./mg</td>
</tr>
<tr>
<td></td>
<td>2</td>
<td>0.00%</td>
<td>100.00%</td>
<td>mgDirectSolve</td>
<td>./mg</td>
</tr>
<tr>
<td></td>
<td>1</td>
<td>0.00%</td>
<td>100.00%</td>
<td>mgVcycle</td>
<td>./mg</td>
</tr>
</tbody>
</table>
### Code profiling — DCPI (2)

<table>
<thead>
<tr>
<th>Category</th>
<th>Percentage</th>
</tr>
</thead>
<tbody>
<tr>
<td>I-cache (not ITB)</td>
<td>0.1% to 7.4%</td>
</tr>
<tr>
<td>ITB/I-cache miss</td>
<td>0.0% to 0.0%</td>
</tr>
<tr>
<td>D-cache miss</td>
<td>24.2% to 27.6%</td>
</tr>
<tr>
<td>DTB miss</td>
<td>53.3% to 57.7%</td>
</tr>
<tr>
<td>Write buffer</td>
<td>0.0% to 0.3%</td>
</tr>
<tr>
<td>Synchronization</td>
<td>0.0% to 0.0%</td>
</tr>
<tr>
<td>Branch mispredict</td>
<td>0.0% to 0.0%</td>
</tr>
<tr>
<td>IMUL busy</td>
<td>0.0% to 0.0%</td>
</tr>
<tr>
<td>FDIV busy</td>
<td>0.0% to 0.5%</td>
</tr>
<tr>
<td>Other</td>
<td>0.0% to 0.0%</td>
</tr>
<tr>
<td>Unexplained stall</td>
<td>0.4% to 0.4%</td>
</tr>
<tr>
<td>Unexplained gain</td>
<td>-0.7% to -0.7%</td>
</tr>
<tr>
<td><strong>Subtotal dynamic</strong></td>
<td><strong>85.1%</strong></td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Category</th>
<th>Percentage</th>
</tr>
</thead>
<tbody>
<tr>
<td>Slotting</td>
<td>0.5%</td>
</tr>
<tr>
<td>Ra dependency</td>
<td>3.0%</td>
</tr>
<tr>
<td>Rb dependency</td>
<td>1.6%</td>
</tr>
<tr>
<td>Rc dependency</td>
<td>0.0%</td>
</tr>
<tr>
<td>FU dependency</td>
<td>0.5%</td>
</tr>
<tr>
<td><strong>Subtotal static</strong></td>
<td><strong>5.6%</strong></td>
</tr>
<tr>
<td><strong>Total stall</strong></td>
<td><strong>90.7%</strong></td>
</tr>
<tr>
<td>Useful</td>
<td>7.9%</td>
</tr>
<tr>
<td>Nops</td>
<td>1.3%</td>
</tr>
<tr>
<td><strong>Total execution</strong></td>
<td><strong>9.3%</strong></td>
</tr>
</tbody>
</table>

Jan 13, 2003
Techniques to enhance cache utilization

- **Data layout optimizations:**
  Address data storage schemes in memory

- **Data access optimizations:**
  Address the order in which the data are accessed
Data layout optimizations — cache–aware data structures

Idea: Merge data which are needed together to increase spatial locality: cache lines contain several data items

Example: Gauss–Seidel on $Au = f$, 2D, 5–point stencils:

$$u_i^{(k+1)} = a_{i,i}^{-1} \left( f_i - \sum_{j<i} a_{i,j} u_j^{(k+1)} - \sum_{j>i} a_{i,j} u_j^{(k)} \right), \quad i = 1, \ldots, N$$

typedef struct {
  double f;
  double cCenter, cNorth, cEast, cSouth, cWest;
} eqnData;

double u[N][N]; // Solution vector
eqnData rhsAndCoeff[N][N]; // Right–hand side and coefficients
Data layout optimizations — array padding (1)

Idea: Increase array dimensions to change relative distances between elements ⇒ Avoid severe cache conflict misses; e.g., in stencil computations

Example: 3D arrays

Standard padding in FORTRAN77:

double precision u(xdim + xpad, ydim + ypad, zdim)
Padding approaches:

- Analytic/Algebraic techniques (Rivera/Tseng)
  - Block size (tile size) and paddings depend on array size and cache capacity
  - Often not general enough for realistic problems where several arrays are involved; e.g., CFD: pressure, velocity field, temperature, concentrations of chemical species, etc.

- Exhaustive parameter search
  - AEOS paradigm: Automated Empirical Optimization of Software; e.g.,
    * ATLAS (Automatically Tuned Linear Algebra Software)
    * FFTW (The Fastest Fourier Transform in the West)
  - Searching the parameter space is time-consuming, but currently the most promising cache tuning approach!
Data access optimizations — loop blocking (loop tiling) (1)

Idea: Divide the iteration space into blocks and perform as much work as possible on the data in cache (i.e., on the current block) before switching to the next block ⇒ Enhance spatial and/or temporal locality while respecting data dependencies

Popular textbook example: Matrix multiplication

Before loop blocking:

```plaintext
do J = 1,N
do K = 1,N
do I = 1,N
    C(I,J) = C(I,J) + A(I,K)*B(K,J)
enddo
enddo
enddo
```

After loop blocking:

```plaintext
do KK = 1,N,W // W = tile width
do II = 1,N,H // H = tile height
do J = 1,N
    do K = KK,min(KK+W-1,N)
    do I = II,min(II+H-1,N)
        C(I,J) = C(I,J) + A(I,K)*B(K,J)
    enddo
    enddo
enddo
```
Data access optimizations — loop blocking (2)

Blocking the iteration loop of a linear stationary method means merging successive iterations into a single pass through the data set:
\[ x^{(k+1)} = M x^{(k)} + d, \quad x^{(k+2)} = M(M x^{(k)} + d) + d, \ldots \]

In addition, loops along spatial dimensions can be blocked

*Example:* Red/black Gauss–Seidel (e.g., as a multigrid smoother)
Data access optimizations — other techniques

There is a variety of other data access optimizations

- *Loop interchange*: lessen the impact of non–unit stride accesses

- *Loop fusion*: reduce the number of sweeps through the data set ⇒ Increase temporal locality

- *Data copying*: copy non–contiguous data to contiguous memory locations ⇒ Reduce cache conflicts and/or drops in performance due to limited TLB capacity

- etc.
**DiME project**: data–local iterative methods for the efficient solution of PDEs: [http://www10.informatik.uni-erlangen.de/dime](http://www10.informatik.uni-erlangen.de/dime)

Speedups for Alpha A21164 machine, 500 MHz, 1000 MFLOPS peak:

Left: Gauss–Seidel, 2D, constant 5p stencil (C. Weiβ, formerly LRR–TUM)

Right: V(2,2) MG cycles, 3D, variable 7p stencils
Performance results (2)

Same codes on different architectures: $V(2,2)$ MG cycles, 3D, variable 7p stencils:

left: Intel Itanium2 based HP zx6000, 900 MHz, 1.5 MB L3, 3.6 GFLOPS peak

right: Pentium 4–based Linux PC, 2.4 GHz, 4.8 GFLOPS peak

Preliminary results!!!

MFLOPS rates and speedups in 3D are often disappointing, more work to be done!

Impacts: TLB capacity, dynamic branch prediction, etc.
Performance results (3)

These data layout optimizations and data access optimizations can also be applied to *Lattice Boltzmann Methods*: particle–oriented CFD simulations.

Successive passes through the highly structured data set, Jacobi–like update operation ("stream and collide") of grid cells (*lattice sites*) in each time step.

Code efficiency measured in "million lattice site updates / second"

---

left: 2D LB code, AMD Athlon machine, 700 MHz  
right: 2D LB code, Alpha A21164 machine, 500 MHz
So far: optimizations techniques that do not change numerical properties
(but: might trigger aggressive compiler optimizations, finite precision arithmetic)

One step further: (multilevel) algorithms with different numerical properties;
e.g., Fully Adaptive Multigrid (U. Rüde)

Essential component: adaptive relaxation on \( Ax^* = b \),
\( A = (a_{i,j}) : \) symmetric positive definite

**Definition:** \( \theta_i(x) := a_{i,i}^{-1} e_i^T (b - Ax) \) (scaled residual)

**Theorem:** error reduction for one elementary relaxation step \( x \leftarrow x + \theta_i(x)e_i \) can be
written as follows:

\[
||x^{\text{old}} - x^*||_E^2 - ||x^{\text{new}} - x^*||_E^2 = a_{i,i} \theta_i(x^{\text{old}})^2
\]
Cache–oriented “non–standard” algorithms (2)

\( x \): approximation of \( x^* \),

**ActiveSet:** set of indices of nodes with “large” scaled residuals

**Algorithm:** adaptive relaxation

1. \( \textbf{while} \) ActiveSet \( \neq 0 \) \( \textbf{do} \)
2. \( \textbf{pick} \ i \in \text{ActiveSet} \) // Non–determinism: freedom of choice!
3. \( \textbf{if} \ |\theta_i(x)| > \theta \ \textbf{then} \)
4. \( x \leftarrow x + \theta_i(x)e_i \)
5. \( \text{ActiveSet} \leftarrow \text{ActiveSet} \cup \text{Conn}(i) \)
6. \( \textbf{end if} \)
7. \( \textbf{end while} \)

**Fully Adaptive Multigrid:** adaptive relaxation on the extended positive semidefinite system \( \hat{A}\hat{x}^* = \hat{b} \) which represents the complete grid hierarchy (M. Griebel)

Combine multigrid efficiency with the freedom to choose any node to be updated from the active set (data locality \( \Rightarrow \) higher cache utilization)

Overhead to maintain the active set (\( \approx 25\% \) runtime) motivates *patch–based* instead of *point–based* processing strategies (H. Lötzbeyer)
Conclusions

Gap between CPU speed and main memory performance will continue to increase

“Compiler–oriented” techniques to enhance cache performance:

- Data layout optimizations
- Data access optimizations

Introducing cache optimizations is tedious and error–prone ⇒ source–to–source compilers to automize the introduction of such transformations (e.g., ROSE, LLNL)

Fully Adaptive Multigrid as a “non–standard” multilevel algorithm that provides more flexibility w.r.t. the order of data accesses (i.e., the implementation of the active set)

“Turing machines are not enough”, counting flops is no longer an adequate measure of efficiency! Need for complexity models and locality metrics for hierarchical memories (e.g., Memtropy: measure to quantify the regularity of memory references (D. Keyes))