Abstract
The high performance Neural Network accelerator architectures rely on either large external memory bandwidth and/or sparse computation paradigms which scale down unfavorably. Current state-of-the-art architectures also include on-chip SRAMs often more than 150 kB, establishing a lower bound on silicon area based on memory alone. This article presents an architecture which exploits programmable dataflow in combination with sparsity to make more efficient use of small on-chip memories. Its control logic supports an enlarged map space through uneven mappings, searched by a cost-model driven compiler to find points which allow prioritizing energy consumption and memory access over throughput. The problem of supporting sparse processing for flexible dataflows is circumvented by introducing an encoding scheme which provides sparsity metadata while still allowing random read and write access. Altogether, the system showcases enhanced performance in terms of external memory access while requiring only 51 kB of on-chip SRAM and small silicon area (0.5 mm2). The design achieves average energy efficiency of 4.4 TOPS/W and 9.7 inferences/s for a sparse AlexNet workload.
Schlagwörter
Bandwidth
Convolution
DNN accelerator
Encoding
Energy efficiency
Logic
Memory management
System-on-chip
edge AI
flexible dataflow
map space exploration
sparse processing