Hiroshi Okuda, University of Tokyo
Henry Tufo, University of Colorado-Boulder
To obtain good performance on contemporary supercomputing systems requires extracting unprecedented parallelism and confronting a variety of system imbalances. For applications with high byte-per-flop (BPF) requirements, such as low order finite element methods that make extensive use of indirect addressing, it is imperative to incorporate techniques to extract the most one can from every byte transferred. For example, one often resorts to reordering and blocking to improve cache utilization. This objective of this mini symposium is to explore the variety of techniques employed in practice by the FEM community to increase single thread performance with special emphasis on such applications that do so at scale (>10k threads) and/or utilize accelerators (e.g., GPUs/FPGAs).