Hardware Acceleration for Convolutional Neural Networks on Edge Devices: A Survey of FPGA-based Designs, Quantization, and Lightweight Architectures for Microcontroller Integration
Keywords:
3×3 convolutions, Convolution engine, Convolutional neural networks, DMA, Edge devices, INT8, INT32Abstract
This literature survey investigates hardware acceleration techniques for convolutional neural networks (CNNs) targeting edge devices, with a specific emphasis on field-programmable gate array (FPGA)-based designs. The increasing computational complexity and memory demands of modern CNNs present significant challenges for deployment on resource-constrained edge platforms, necessitating specialized acceleration methods. This study systematically reviews recent academic contributions (2018–2024) focusing on optimization strategies, including INT8 quantization, architectural pruning, and the development of lightweight models (e.g., MobileNet, Tiny-YOLO). The efficiency of 3×3 convolution optimization, memory access bottlenecks, latency constraints, and the role of direct memory access (DMA)-based data transfer in mitigating these issues. A comparative analysis of various hardware platforms—GPUs, FPGAs, ASICs, and external accelerators—is provided, evaluating their performance, power consumption, flexibility, and cost in the context of edge AI. A critical review of existing FPGA-based convolution accelerators highlights their methodologies, advantages, and limitations, specifically addressing their applicability to low-cost microcontrollers. Identified research gaps include the high complexity and power consumption of current neural processing units (NPUs), their unsuitability for highly constrained microcontrollers, and inefficiencies in performing fundamental operations such as 3×3 convolutions. Based on this analysis, a novel, simple, and energy-efficient external convolution accelerator architecture is proposed. This architecture is dedicated to 3×3 convolutions, supports INT8 inputs with INT32 accumulation, employs parallel MAC units, and incorporates a DMA-based data transfer interface, all optimized for seamless integration with low-cost microcontrollers. This proposed design aims to offer reduced latency, ultra-low power consumption, and greater simplicity than full NPUs, making it highly suitable for diverse edge AI applications.
References
H. Hong et al., “Survey of convolutional neural network accelerators on field-programmable gate array platforms: architectures and optimization techniques,” Journal of Real-time Image Processing, vol. 21, no. 64, Mar. 2024.
C. Imianosky et al., “Characterization of single-event effects in a microcontroller with an artificial neural network accelerator,” Electronics, vol. 13, no. 22, p. 4461, Nov. 2024.
P. Madhumathy, R. Saravanakumar, R. Umamaheswari, A. Juliette Albert, and D. Devasenapathy, “Optimizing design and manufacturing processes with an effective algorithm using anti-collision enabled robot processor,” International Journal on Interactive Design and Manufacturing (IJIDeM), vol. 18, no. 8, pp. 5469–5477, Jan. 2024.
J. Zhai, B. Li, Shunsen Lv, and Q. Zhou, “FPGA-based vehicle detection and tracking accelerator,” Sensors, vol. 23, no. 4, pp. 2208–2208, Feb. 2023.
S. Sivasankar, Deepa Devasenapathy, P Madhumathy, G. Kaur, Y. Sharma, and P. Rana, “Design and modeling of graph theory approach based routing algorithm,” International Journal on Interactive Design and Manufacturing (IJIDeM), vol. 18, no. 8, pp. 6013–6021, Aug. 2023.
M. R. Suma and P. Madhumathy, “Acquisition and mining of agricultural data using ubiquitous sensors with Internet of Things,” International Conference on Computer Networks and Communication Technologies, S. Smys, R. Bestak, J. Z. Chen, and I. Kotuliak, Eds. Lecture Notes on Data Engineering and Communications Technologies, vol. 15. Singapore: Springer, 2019.
M. Perumal and Sivakumar Dhandapani, “Modeling and simulation of a novel relay node based secure routing protocol using multiple mobile sink for wireless sensor networks,” The Scientific World Journal, vol. 2015, pp. 1–9, Jan. 2015.
Farahani, H. Beithollahi, M. Fathi, and R. Barangi, “CNNX: A low cost, CNN accelerator for embedded system in vision at edge,” Arabian Journal for Science and Engineering, vol. 48, no. 2, pp. 1537–1545, May 2022.
Z. Qi, W. Chen, R. A. Naqvi, and K. Siddique, “Designing deep learning hardware accelerator and efficiency evaluation,” Computational Intelligence and Neuroscience, vol. 2022, pp. 1–11, Jul. 2022.
Thomas K, S. Poddar, and H. K. Mondal, “A CNN hardware accelerator using triangle-based convolution,” ACM Journal on Emerging Technologies in Computing Systems, vol. 18, no. 4, pp. 1–23, Oct. 2022.
K. V. Ramana, “Modified CNN accelerator using double MAC design,” AIP Conference Proceedings, vol. 2492, no. 1, Art. no. 020079, May 2023.
S. I. Venieris, J. Fernández-Marqués, and N. D. Lane, “Mitigating memory wall effects in CNN engines with on-the-fly weights generation,” ACM Transactions on Design Automation of Electronic Systems, vol. 28, no. 6, pp. 1–31, Oct. 2023.
Z. Liu, Q. Liu, S. Yan, and Ray, “An efficient FPGA-based depthwise separable convolutional neural network accelerator with hardware pruning,” ACM Transactions on Reconfigurable Technology and Systems, vol. 17, no. 1, pp. 1–20, Sep. 2023.
M. Véstias, R. P. Duarte, J. T. de Sousa, and H. Neto, “Efficient design of low bitwidth convolutional neural networks on FPGA with optimized dot product units,” ACM Transactions on Reconfigurable Technology and Systems, vol. 16, no. 1, pp. 1–36, Dec. 2022.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, 2017.
M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. C. Chen, “MobileNetV2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510–4520.
J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
V. Sze, Y. H. Chen, T. J. Yang, and J. S. Emer, “Efficient processing of deep neural networks: A tutorial and survey,” Proceedings of the IEEE, vol. 105, no. 12, pp. 2295–2329, Dec. 2017.
H. Sharma, J. Park, E. Amaro, B. Thwaites, P. Kotha, A. Gupta, and H. Kim, “From high-level deep neural models to FPGAs,” in Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, Taiwan, 2016, pp. 1–12.
S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding,” International Conference on Learning Representations (ICLR), 2016.
X. Zhang, X. Zhou, M. Lin, and J. Sun, “ShuffleNet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6848–6856.
N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017, pp. 1–12.