Developing an Accent Recognition System for Nigerian Native Languages Using Convolutional Neural Networks and Long Short-term Memory Models
Keywords:
Automatic speech recognition (ASR), Convolutional neural network (CNN), Cybersecurity, Deep learning techniques, Hybrid deep learning, Internet of Things (IoT), Long short-term memory (LSTM)Abstract
Automatic speech recognition (ASR) is a component of what is known as human/computer interaction, which allows voice assistants, transcription services, and even smart communication systems. Yet, the problem of limited recognition and accommodation to the accent variations can be ranked among the greatest issues of ASR systems. This issue is highly evident in a multilingual nation such as Nigeria, where there is a linguistic diversity, and numerous native accents have arisen that the systems in place struggle to categorize appropriately. These restrictions make speech technologies less inclusive and less useful, particularly to populations which are underrepresented.
In this study, it was filled in by constructing an accent detection system for the three main indigenous languages in Nigeria (Yoruba, Hausa, and Igbo) using deep learning methods.
Three models were proposed and compared: a convolutional neural network (CNN), a long short-term memory network (LSTM), and a CNN-LSTM hybrid architecture. The hybrid model was presented and tested by the SautiDB-Naija dataset containing audio of native speakers. The experimental findings revealed that CNN and LSTM had similar performance, whereas the CNN-LSTM hybrid model had a higher performance with the validation accuracy of 92.3% and F1-score of 91.3, which indicates the benefit of using convolutional layers in extracting spectral features and applying the recurrent layers in the sequencing of time features. This study brings about the development of inclusive speech technology and sets the stage for future studies which could assist more African accents in ASR systems.
References
M. Najafian, “Acoustic model selection for recognition of regional accented speech,” Ph.D. dissertation, Univ. Birmingham, Jul. 2016.
M. B. Hoy, “Alexa, Siri, Cortana, and more: An introduction to voice assistants,” Med. Ref. Serv. Q., vol. 37, no. 1, pp. 81–88, 2018.
L. Lulu and A. Elnagar, “Automatic Arabic dialect classification using deep learning models,” Procedia Comput. Sci., vol. 142, pp. 262–269, 2018.
S. D. Leo, L. D. Cicco, and S. Mascolo, “Real-time speech-to-text on edge: A prototype system for ultra-low latency communication with AI-powered NLP,” Information, vol. 16, no. 8, 2025.
A. M. Oduwale, B. K. Alese, O. O. Obe, and O. A. Odeniyi, “Hybrid CNN–LSTM deep learning model for security risk detection in industrial Internet of Things networks,” Int. J. Artif. Intell., Mach. Learn. Intell. Syst., vol. 1, no. 2, pp. 40–50, 2025.
M. Najafian and M. Russell, “Modelling accents for automatic speech recognition,” in Proc. Eur. Signal Process. Conf. (EUSIPCO), 2015.
L. W. Kat and P. C. W. Fung, “Fast accent identification and accented speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), Mar. 1999.
C. Huang, T. Chen, and E. Chang, “Accent issues in large vocabulary continuous speech recognition,” Int. J. Speech Technol., vol. 7, no. 2–3, pp. 141–153, Apr. 2004.
A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “Data2vec: A general framework for self-supervised learning in speech, vision and language,” arXiv preprint, 2022.
A. O. Salau, T. D. Olowoyo, and S. O. Akinola, “Accent classification of the three major Nigerian indigenous languages using 1D CNN-LSTM network model,” in Advances in Computational Intelligence Techniques, 2020, pp. 1–16.
D. Kollias and S. Zafeiriou, “Exploiting multi-CNN features in CNN-RNN based dimensional emotion recognition on the OMG in-the-wild dataset,” arXiv preprint, 2019.
S. Masood, A. Srivastava, H. C. Thuwal, and M. Ahmad, “Real-time sign language gesture recognition from video sequences using CNN and RNN,” in Advances in Intelligent Systems and Computing, 2018, pp. 623–632.
F. Oladipo and R. A. Habeeb, “Accent identification of ethnically diverse Nigerian English speakers,” SSRN Electron. J., Jan. 2020.
S. Chen et al., “WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Topics Signal Process., vol. 16, no. 6, pp. 1505–1518, Oct. 2022.
F. O. Oladipo, R. A. Habeeb, A. E. Musa, C. Umezuruike, and O. A. Adeiza, “Automatic speech recognition and accent identification of ethnically diverse Nigerian English speakers,” Int. J. Appl. Inf. Syst., vol. 12, no. 36, pp. 41–48, May 2021.
S. Duduka, H. Jain, V. Jain, and P. M. Chawan, “Accent classification using machine learning,” 2021.
F. Weninger, Y. Sun, J. Park, D. Willett, and P. Zhan, “Deep learning-based Mandarin accent identification for accent-robust ASR,” in Proc. Interspeech, Sep. 2019.
S. Sun, C.-F. Yeh, M.-Y. Hwang, M. Ostendorf, and L. Xie, “Domain adversarial training for accented speech recognition,” in Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Processing (ICASSP), 2018.
K. Deng, S. Cao, and L. Ma, “Improving accent identification and accented speech recognition under a framework of self-supervised learning,” arXiv preprint, 2021.
J. J. Bird, E. Wanner, A. Ekárt, and D. R. Faria, “Accent classification in human speech biometrics for native and non-native English speakers,” in Proc. 12th ACM Int. Conf. Pervasive Technologies Related to Assistive Environments, Jun. 2019.
S. M. S. I. Badhon, M. H. Rahaman, and F. R. Rupon, “A machine learning approach to automating Bengali voice-based gender classification,” in Proc. IEEE, Nov. 2019.
Y. Jiao, M. Tu, V. Berisha, and J. Liss, “Accent identification by combining deep neural networks and recurrent neural networks trained on long and short term features,” in Proc. Interspeech, Sep. 2016.
R. Upadhyay and S. Lui, “Foreign English accent classification using deep belief networks,” in Proc. IEEE, 2018.
A. Nahabwe et al., “Benchmarking automatic speech recognition models for African languages,” arXiv preprint, 2025.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” arXiv preprint, 2022.
W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021.