



## 東北大学新スーパーコンピュータシステムの紹介と 高性能計算に関する研究開発活動

#### Hiroaki Kobayashi

Director and Professor Cyberscience Center Tohoku University koba@cc.tohoku.ac.jp

PC Cluster Workshop in Sendai Feb. 19, 2016



## Missions of Cyberscience Center As a National Supercomputer Center



#### High-Performance Computing Center founded in 1969



- @ 24/7 operations of large-scale vector-parallel and scalarparallel systems
- T500 users registered in AY 2014

User supports

- Benchmarking, analyzing, and tuning users' programs
- Holding seminars and lectures
- Supercomputing R&D, collaborating work with NEC
  - Designing next-generation high-performance computing systems and their applications for highly-productive supercomputing
  - © 57-year history of collaboration between Tohoku University and NEC on High Performance Computing

Education

Teaching and supervising BS, MS and Ph.D. Students as a cooperative laboratory of graduate school of information sciences, Tohoku university





1969

1982













SX-7 in 2003

SX-9 in 2008



#### Tohoku Univ.'s New Supercomputer System (2015.2.20~)





February 19, 2016

Hiroaki Kobayashi, Tohoku University



#### Organization of Tohoku Univ. SX-ACE System





#### Features of the SX-ACE Vector Processor

- 4 Core Configuration, each with High-Performance Vector-Processing Unit and Scalar Processing Unit
  - 272Gflop/s of VPU + 4Gflop/s of SPU per socket
    - 68Gflop/s + 1Gflop/s per core
  - 1MB private ADB per core (4MB per socket)
    - Software-controlled on-chip memory for vector load/store
    - 4x compared with SX-9
    - 4-way set-associative
    - MSHR with 512 entries (address+data)
    - 256GB/s to/from Vec. Reg.
      - 4B/F for Multiply-Add operations
  - 256 GB/s memory bandwidth, Shared with 4 cores
    - 1B/F in 4-core Multiply-Add operations
      - ~ 4B/F in 1-core Multiply-Add operations
    - 128 memory banks per socket
- Other improvement and new mechanisms to enhance vector processing capability, especially for efficient handling of short vectors operations and indirect memory accesses
  - Out of Order execution for vector load/store operations
  - Advanced data forwarding in vector pipes chaining
  - Shorter memory latency than SX-9

#### SX-ACE Processor Architecture



Source: NEC



#### High Demands for Vector Systems in Memory-Intensive, Science and Engineering Applications







# **Performance Evaluations of SX-ACE**



#### Specifications of Modern High End Systems

| System                 | No. of<br>Sockets/<br>Node | Perf./<br>Socket<br>(Gflop/s) | No. of<br>Cores | Perf. /core<br>(Gflop/s) | Mem.<br>BW<br>GB/sec | On-chip<br>mem                  | NW BW<br>(GB/sec)        | Sys.<br>B/F |
|------------------------|----------------------------|-------------------------------|-----------------|--------------------------|----------------------|---------------------------------|--------------------------|-------------|
| SX-ACE                 |                            | 256                           | 4               | 64                       | 256                  | 1MB ADB /core                   | 2 x 4 IXS                | 1.0         |
| SX-9                   | 16                         | 102.4                         | 1               | 102.4                    | 256                  | 256KB<br>ADB/core               | 2 x 128 IXS              | 2.5         |
| ES2                    | 8                          | 102.4                         | 1               | 102.4                    | 256                  | 256KB ADB/core                  | 2 x 64IXS                | 2.5         |
| LX 406<br>(Ivy Bridge) | 2                          | 230.4                         | 12              | 19.2                     | 59.7                 | 256KB L2/core<br>30MB Shared L3 | 5 IB                     | 0.26        |
| FX10<br>(SPARK64IX)    | 1                          | 236.5                         | 16              | 14.78                    | 85                   | 12MB shared L2                  | 5 - 50 Tofu<br>NW        | 0.36        |
| K<br>(SPARK64VIII)     | 1                          | 128                           | 8               | 16                       | 64                   | 6MB Shared L2                   | 5 - 50 Tofu<br>NW        | 0.5         |
| SR16K M1<br>(Power7)   | 4                          | 245.1                         | 8               | 30.6                     | 128                  | 256KB L2/core<br>32MB shared L3 | 2 x 24 - 96<br>custom NW | 0.52        |

Remarks: Listed performances are obtained based on total Multiply-Add performances of individual systems

PC Cluster Workshop in Sendai

February 19, 2016



#### Applications Used for Evaluation

| Applications        | Fields                                       | Methods                     | Mem Access<br>Characteristics                | Mesh Size                                                        | Code<br>B/F | Actual<br>B/F on<br>ACE |
|---------------------|----------------------------------------------|-----------------------------|----------------------------------------------|------------------------------------------------------------------|-------------|-------------------------|
| QSFDM<br>GLOBE      | Seismology                                   | Spherical 2.5D<br>FDM       | Stencil with sequential memory accesses      | 4.3 x 10 <sup>7</sup> grids                                      | 2.16        | 0.78                    |
| Barotropic<br>ocean | OGCM<br>(Ocean General<br>Circulation Model) | Shallow water<br>model      | Stencil with sequential memory accesses      | 4322 x 216                                                       | 1.97        | 1.11                    |
| MHD (FDM)           | MHD                                          | Finite Difference<br>Method | Stencil with sequential memory accesses      | 200 x 1920 x 32                                                  | 3.04        | 1.41                    |
| Seism 3D            | Seismology                                   | Finite Difference<br>Model  | Stencil with sequential memory accesses      | 1024 x 512 x 512 <sup>†</sup><br>4096 x 2048 x 2048 <sup>‡</sup> | 2.15        | 1.68                    |
| MHD<br>(Spectral)   | MHD                                          | Pseudo spectral<br>Method   | Stride memory access                         | 900 x 768 x 96 <sup>†</sup><br>3600 x 3072 x 2048 <sup>‡</sup>   | 2.21        | 2.18                    |
| TURBINE             | CFD                                          | DNS                         | Indirect memory access<br>with short vectors | 91 x 91 x 91 x 13                                                | 1.78        | 5.47                    |
| ВСМ                 | CFD                                          | Navier Stokes<br>Equation   | Stencil and Indirect<br>memory access        | (128 x 128 x 128 cells)<br>x 64 Cubes                            | 7.01        | 5.86                    |
|                     |                                              |                             |                                              | t for aincide reade avalue                                       | +:          |                         |

PC Cluster Workshop in Sendai

10

for single-node evaluation

for multi-node evaluation February 19, 2016



#### Sustained Memory Bandwidth

• STREAM (TRIAD)





#### Sustained Single CPU Performance



PC Cluster Workshop in Sendai

February 19, 2016

#### Performance of Indirect Memory Accesses in TURBINE





QLL(I,J,K,M)=DQL0+COEFB\*DPQM0+COEFC\*DMQN0 QRR(I,J,K,M)=DQR0-COEFB\*DMQP0-COEFC\*DPQN0

200 CONTINUE



#### Performance of Indirect Memory Accesses in TURBINE on Modern HPC Processors





#### Performance of Short-Vector Processing in TURBINE (1/2)





10 continue



#### Performance of Short-Vector Processing in TURBINE (2/2)





#### Performance of Short-Vector Processing in TURBINE on Modern HPC Processors





#### Sustained Performance of Barotropic Ocean Model on Multi-Node Systems





#### Performance Evaluation of SX-ACE by using the HPCG Benchmark

- ★ HPCG (High Performance Conjugate Gradients) is designed to exercise computational and data access patterns that more closely match a broad set of important applications,
  - ✓ HPL for top500 is increasingly unreliable as a true measure of system performance for a growing collection of important science and engineering applications.
- ★ HPCG is a complete, stand-alone code that measures the performance of basic operations in a unified code:
  - ✓ Sparse matrix-vector multiplication.
  - ✓ Sparse triangular solve.
  - ✓ Vector updates.
  - ✓ Global dot products.
  - ✓ Local symmetric Gauss-Seidel smoother.
  - Driven by multigrid preconditioned conjugate gradient algorithm that exercises the key kernels on a nested set of coarse grids.
  - ✓ Reference implementation is written in C++ with MPI and OpenMP support.



#### Optimizations of the the HPCG Benchmark for SX-ACE \*Komatsu et al.@SC15

- Data packing for vector-friendly matrix memory allocation of sparse matrices
- Parallelization by using coloring and hyperplane methods
- Selective data caching and blocking for effective use of ADB









#### Efficiency Comparison in the HPCG Performance (1/2)





#### Peak Performance DOES NOT Track Observed Performance!







## R&D of A Real-Time Tsunami Inundation Forecasting System on SX-ACE





## Background: 2011 East-Japan Great Earthquake

- Main shock at 2:46pm, March 11, & huge Tsunami 30 min later...
- Magnitude 9.0, the Largest in Japan and the 5th largest in the world
- Around 20,000 victims (dead or missing), mainly due to Tsunami, 100,000 people evacuated to shelters in the first several months.
- A huge of debris of houses, cars and buildings remained in the coastal area of Tohoku over one year
- Important infrastructures such as gas/water/electricity/train/road are destroyed and/or stopped their services for one month in sendai city











## Motivation: Serious Damage to Sendai Area Due to 2011 Tsunami Inundation





## It's not End: High Probability of Big Earthquakes in Japan

 Japan may be hit by severe earthquakes and large tsunamis in the next 30 years





#### Objective of Our Work

Make HPC Available as a Social Infrastructure for Homeland Safety in Japan!

Prompt responses to disaster to reduce damages such as warning evacuation from dangerous zones and rescuing survivors as soon as possible.

Detailed and highly accurate analysis and forecasting of Tsunami Inundation soon after the Big Earthquake is mandatory.

Enhancement of social resiliency against natural disasters by precise simulation using HPC to satisfy these demands



#### Design and Development of A Real-Time Tsunami Inundation Forecasting System



#### Fault estimation based on GPS data

#### GPS-Observation Simulation on SX-ACE



10-m mesh models of coastal cities

#### Information Delivery



Just-In-Time access of Visualized information by local governments

< 4 min

< 8 min

.....

< 8 min

PC Cluster Workshop in Sendai

28

< 20 min



#### System Organization





#### Emergency Job Handling by NQSII of SX-ACE





February 19, 2016



## Simulation: Target Code & Areas

#### Target Code TUNAMI: Tohoku University's Numerical Analysis Model for Investigating Tsunami

- Developed by Prof. Koshimura of Tohoku University
- Authorized by UNESCO and Japanese Government
- ★ Governing Equations
  - Non-Liner Shallow Water Equations
- ★ Numerical Scheme
  - Staggered Leap-Frog Finite Difference Method
- \* Memory-intensive application
  - B/F = 1.82 (single precision)
- Target Areas: Miyagi, Shizuoka & Kochi



February 19, 2016

PC Cluster Workshop in Sendai



#### **Computation Domain**

Therarchical multi-level grid models

**★** Computation Domain of Kochi City:

- 1244km x 826km
- 5 nested grids
- 6 hours of Tsunami Inundation

 $\checkmark \Delta t = 0.1$  sec.

| Region | Grid<br>Size<br>(m) | Num. of<br>Grid(X) | Num. of<br>Grid(Y) |
|--------|---------------------|--------------------|--------------------|
| 1      | 810                 | 1536               | 1020               |
| 2      | 270                 | 1680               | 990                |
| 3      | 90                  | 2292               | 1260               |
| 4      | 30                  | 1782               | 1188               |
| 5      | 10                  | 3504               | 2364               |





#### Program Structure



#### Doubly nested loops

| OO J=2, JF (Latitud | e)                      |
|---------------------|-------------------------|
| DO I=2, IF (Long    | itude) ← Vectorized     |
|                     |                         |
| ZZ = Z(I,J,I) - RX  | (*(M(I,J,I)-M(I-I,J,I)) |
| - RY                | (*(N(I,J,I)-N(I,J-I,I)) |
|                     | Stencil                 |
| END DO              |                         |
| ND DO               |                         |

#### Tuning

- Inlining subroutines
- Optimization of I/O routines
- Vectorization&Parallelization
  - Vect. Ration=99.6%, Vect. Length=235
- ADB Tuning of Stencil kernels

Hiroaki Kobayashi, Tohoku University



#### Visualization of Simulation Results by Delivery Server



The information is delivered to Local Governments through the Web

PC Cluster Workshop in Sendai

February 19, 2016

Select



#### **Real-Time Tsunami Inundation Forecasting**

| 000 ( ) [] | ŵ         | 0 2           | AA   |          |                  | edit6.a-         | 2.co.jp      |                                                                            | C         |                |                      | <u>e</u> 1 | Ð. | 0 |
|------------|-----------|---------------|------|----------|------------------|------------------|--------------|----------------------------------------------------------------------------|-----------|----------------|----------------------|------------|----|---|
|            | Main Page | .encyclopedia | アップル | Google N | Microsoftアホームページ | Yahool Japan     | Google ⊽ 2.7 | Montopia J., , : $m \! \! \! \! \! \! \! \! \! \! \! \! \! m \! \! \! \! $ | .Mar: You | Tube Wikipedia | $_{-2}-\lambda \sim$ | 非极立ち       | 4  | > |
|            |           |               |      |          | TSUNAMI Simula   | ator EEW Transfe | r.           |                                                                            |           |                |                      |            |    | + |

#### Rapid coseismic fault determination system for real-time tsunami inundation forecasting [This is a test based on expected events]



Japan Asia Group

Hiroaki Kobayashi, Tohoku University



#### **Demo: Visualization of Simulation Results**

#### Simulation Results of Inundation of Kochi City Caused by Nankai Trough Earthquake

0 Hour 0 M 10 S



PC Cluster Workshop in Sendai



PC Cluster Works

#### Performance of Tunami Code on SX-ACE

#### 5.5x performance improvement against LX (peak performance ratio is only 3x)



| System    | Perf. /<br>Socket<br>(Gflop/s) | No. of<br>Cores | Perf. / Core<br>(Gflop/s) | Mem.<br>BW<br>(GB/s) | Socket<br>B/F | Core<br>B/F |
|-----------|--------------------------------|-----------------|---------------------------|----------------------|---------------|-------------|
| SX-ACE    | 256                            | 4               | 64                        | 256                  | 1             | 4           |
| SX-9      | 102.4                          | 1               | 102.4                     | 256                  | 2.5           | 2.5         |
| LX406Re-2 | 230.4                          | 12              | 19.2                      | 59.7                 | 0.26          | N/A         |

February 19, 2016



#### Scalability of Tunami Code



Hiroaki Kobayashi, Tohoku University



#### Total Execution Time of Tsunami Forecasting Workflow (Kochi-City Case)





#### Summary

★ SX-ACE is involved as the social infrastructure for Tsunami Inundation forecasting, like a wether forecasting system, in addition to the research infrastructure for computational science and engineering in Japan

★ Current work



Target Area Extension: Full coverage of Japan



System Extension: Complemental operations of multiple systems February 19, 2016

PC Cluster Workshop in Sendai



#### Summary

- ★ SX-ACE shows high sustained performance compared with SX-9, in particular a significant improvement in short-vector processing and indirect memory accesses
  - ✓ achieved the same single core performance in practical applications even with 60% of peak performance
  - ✓ No1. computing-efficiency and power-efficiency in the HPCG Benchmark ranking
  - $\checkmark$  Pave the way to a new social infrastructure for homeland safety in Japan
- Well balanced HEC systems regarding memory performance is the key to success for realizing high productivity in science and engineering simulations
  - Demands for Supercomputers for the rest of us, especially for 2020 and beyond!
  - Brute force to Smart Force in HPC design
  - ✓ Quality, not Quantity!



#### WSSP開催案内

- 23rd Workshop on Sustained Simulation Performance
  - Held on March 16-17, 2016 at Tohoku University, Sendai Japan
  - Organized by Tohoku University and HLRS, Stuttgart, JAMSTEC, NEC
  - International researchers and engineers get together to discuss and exchange ideas, experience and perspectives on current and future HPC technologies
  - Confirmed invited Speakers
    - Michael Resch (HLRS)
    - Sabine Roller (University of Siegen)
    - Vladimir Voevodin (Moscow State University)
    - Toshimitsu Yokobori (Tohoku Univ)
    - Mitsuo Yokokawa (Kobe Univ)
    - Akiko Matsuo (Keio Univ)
    - Ken-ichi Itakura (JAMSTEC)
    - and more!
    - <u>https://www.sc.cc.tohoku.ac.jp/wssp23/index.ja.html</u>



Hiroaki Kobayashi, Tohoku University

21st WSSP in Sendai



PC Cluster Workshop in Sendai

42

February 19, 2016