1p Let me begin the presentation. Our project focuses on accelerating simulation for DNN accelerators.
2p Modern accelerators are moving toward scale-out architectures to achieve better scalability. The goal of this transition is to improve PE utilization and efficient energy consumption. However, scale-out architecture introduce a two-level design space. This space includes conventional parameters such as dataflow schemes, array size, and buffer size, as well as inter-pod design considerations. In addition, state-of-the-art DNN models have significantly larger parameter sizes than conventional CNN-based models.
3p Here is the outline of the presentation. In the next section, Motivation, I will define the problem in more detail and introduce our key ideas. After that, we will briefly cover the solution, evaluation, and conclusion.
4p Based on the background we discussed earlier, we defined the problem. The key issue is that simulating DNN models using existing simulators takes an excessive amount of time. As many of you know, prior works such as SCALE-Sim offer simplified analytical modeling, but the simulation time is still non-trivial.
5p To illustrate the severity of this issue, we ran an experiment to measure the required simulation time. This table shows the experimental setup, which includes several Transformer-based models.
6p Here are the results. As you can see, simulation time increases rapidly with model size. Large models like GPT-3 and PaLM take much longer to simulate than BERT-Large, requiring hundreds of times more simulation time.
7p We now introduce three ideas to address this challenge. The first is a layer-level optimization. Since Transformer-based models consist of many identical layers, simulators repeatedly perform the same simulation routines. If we can skip redundant computations across identical layers, we can reduce simulation time.
8p However, even with layer-level optimization, simulating large language models still takes a long time and limits iterative experimentation.
9p So the second idea targets redundant computations 'between pods'. In scale-out DNN accelerators, workloads are typically divided evenly across pods / to ensure balanced execution time. As a result, multiple pods often process matrix partitions with same dimensions. For example, in the figure shown, pods with the same color perform identical computations, and produce the same results. By skipping redundant operations across pods, we can significantly reduce simulation time.
10p And then, we shift our attention to 'intra-pod' simulation. The computation time can be estimated using existing analytical models, and the overhead from file I/O is negligible. The dominant bottleneck in per-pod simulation is memory access tracking. This is the third insight we identified.