[ISPASS ‘20] Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim

【A. Baseline: SCALE-Sim】 As the computational demands of Deep Neural Network (DNN) workloads continue to grow, research on Domain-Specific Architectures based on systolic array structures has become increasingly active. SCALE-Sim, presented in the ISPASS 2020 paper "A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE Sim", offers a cycle-accurate simulation framework to evaluate the performance of such DNN accelerators under various architectural and dataflow configurations.

Summary of key items 1.1 Objective and contributions SCALE-Sim aims to analyze and compare scaling strategies for DNN accelerators— namely, Scale-Up (increasing the size of a single systolic array) and Scale-Out (partitioning the array into smaller, parallel units). The simulator enables architectural exploration by modeling compute cycles, memory accesses, and bandwidth requirements. Key contributions include: • SCALE-Sim: A cycle-accurate simulator for systolic array-based DNN accelerators. • Analytical model: A first-order mathematical approximation runtime model that enables rapid design space exploration Although the scale-out architecture can offer higher efficiency for large systolic arrays, it also leads to a rapid increase in DRAM bandwidth requirements. SCALE-Sim enables cycle-accurate analysis to identify the optimal number of partitions. 1.2 Architecture and Components - - - - Systolic Array: A 2D mesh of multiply-accumulate (MAC) units that propagate operands in a skewed diagonal flow Double-Buffered SRAM Buffers: Used for IFMA, Filter, and OFMAP matrices. Supported Dataflows: Output Stationary (OS), Weight Stationary (WS), and Input Stationary (IS). Input Interface: Accepts hardware configuration files and DNN topology 1.3 Output and Metrics • Cycle-accurate traces for SRAM and DRAM accesses. • Utilization, runtime, bandwidth usage, and memory request patterns. • Insight into the trade-offs between scaling configurations with respect to performance and energy.
Cycle computation methodology 2.1 General Approach SCALE-Sim models systolic array behavior under the assumption compute-bound execution. The simulation workflow in SCALE-Sim proceeds as follows. (1) SCALE-Sim generates cycle-accurate read addresses for feeding operands at the top and left edges of the systolic array, ensuring no stalls in the PE array. (2) Using these traces, it calculates SRAM read/write traffic and determines the total runtime based on the last output write event. (3) It tracks the number of active rows and columns per cycle to compute array utilization according to the dataflow. (4) Input and output matrices are managed through double-buffered SRAM, enabling the generation of prefetch-based DRAM traces. (5) By analyzing the SRAM and DRAM traces, SCALE-Sim estimates interface bandwidth requirements, on-chip/off-chip memory requests, and compute efficiency. 2.2 Runtime Model for Scale-Up For a matrix multiplication of size SR × T and T × SC, SCALE-Sim models the computation across three dimensions: spatial rows (SR), spatial columns (SC), and the temporal reduction dimension (T). If the systolic array has dimensions R x C, the number of required folds and the number of cycles per fold are:

Then, the total cycle count is computed as:

This formular account for: • Skewed data movement across the array (due to pipelined propagation) • Input feeding time from array edges. • Compute latency for MAC operations. • Output draining time from the array. 2.3 Optimal Partitioning for Scale-Out In Scale-Out scenarios, the array is partitioned into PR x PC smaller array. The effective workload per partition becomes:

Cycle computation per parition follows the same fold-based model, and the overall runtime is determined by the slowest partition:

This model allows SCALE-Sim to estimate performance under various array configurations and heps identify the sweet-spot between performance, bandwidth, and energy. SCALE-Sim provides a detailed and accurate methodology for analyzing DNN accelerator scalability. Its cycle-accurate modeling of matrix multiplication on systolic arrays, along with analytical approximations, enables effective exploration of architectural trade-offs between scale-up and scale-out strategies.