AI-Generated Speech Detection Using Convolutional Neural Network

Kaushik Sinha; Debalina Sinha Jana

Authors

Kaushik Sinha
Debalina Sinha Jana

Keywords:

Anti-spoofing, ASVspoof dataset, Audio forensics, Cloned speech, Convolutional Neural Networks (CNN), Deep learning in speech, Deepfake audio, Fake voice dataset, Speaker verification, Spectral features, Speech synthesis, Synthetic voice detection, Voice biometrics security, Voice cloning detection, Voice spoofing

Abstract

Recent advancements in deep learning and speech synthesis have led to the proliferation of highly realistic synthetic and cloned voices. While these technologies have beneficial applications in assistive technologies and media production, they pose serious security and ethical concerns, particularly in impersonation attacks, misinformation, and fraud. This paper provides a comprehensive survey of state-of-the-art techniques used for detecting synthetic and cloned voices. We evaluate signal-level, spectral, and deep-learning-based methods, and propose an ensemble framework combining Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and spectral feature engineering. We also benchmark multiple detection models on the ASVspoof and FakeVoice datasets and analyze their robustness against adversarial attacks. Experimental results show that our ensemble model significantly improves detection accuracy under clean and noisy conditions. We conclude with an outlook on future research directions in robust voice spoofing detection.

References

H. Azzuni and S. A. El, “Voice Cloning: Comprehensive Survey,” Arxiv.org, 2025. https://arxiv.org/abs/2505.00579.

S. O. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural Voice Cloning with a Few Samples,” arXiv.org, Oct. 12, 2018. https://arxiv.org/abs/1802.06006v.

J. Lorenzo-Trueba, T. Drugman, J. Latorre, T. Merritt, B. Putrycz, R. Barra-Chicote, and A. Moinet, "Can We Detect Voice Cloning? Challenges in Synthetic Speech Detection," in Proc. Interspeech, 2019, pp. 2110–2114. doi: https://doi.org/10.21437/Interspeech.2019-2229

Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilçi, M. Sahidullah, and A. Sizov, "ASVspoof 2015: The First Automatic Speaker Verification Spoofing and Countermeasures Challenge," in Proc. Interspeech, 2015, pp. 2037–2041. Available: https://www.asvspoof.org/is2015_asvspoof.pdf

Z. Wu , J. Yamagishi, T. Kinnunen, C. Hanilci, “ASVspoof: The Automatic Speaker Verification Spoofing and Countermeasures Challenge,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–604, Jun. 2017, doi: https://doi.org/10.1109/jstsp.2017.2671435.

J. Shen et al., "Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2018, pp. 4779–4783. doi: https://doi.org/10.48550/arXiv.1712.05884

P.-C. Hsu, H.-Y. Lee, and L.-S. Lee, "Towards Robust Neural Vocoding for Speech Generation: A Survey," Arxiv Preprint Arxiv: 1912.02461, Dec. 2019. Available: https://arxiv.org/abs/1912.02461

Y. Wang et al., "Tacotron: Towards End-to-End Speech Synthesis," in Proc. Interspeech, 2017, pp. 4006–4010. doi: https://doi.org/10.21437/Interspeech.2017-1452

Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, "FastSpeech: Fast, Robust and Controllable Text to Speech," in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2019. https://doi.org/10.48550/arXiv.1905.09263

R. Prenger, R. Valle, and B. Catanzaro, "WaveGlow: A Flow-based Generative Network for Speech Synthesis," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2019, pp. 3617–3621. doi: https://doi.org/10.48550/arXiv.1811.00002

H. Kameoka, T. Kaneko, K. Tanaka, and N. Hojo, "ACVAE-VC: Non-parallel Voice Conversion with Variational Autoencoder," in Proc. IEEE Spoken Lang. Technol. Workshop (SLT), 2018, pp. 527–533. doi: https://doi.org/10.1109/TASLP.2019.2917232

T. Kinnunen, Z. Wu, K. A. Lee, F. Sedlak, E. S. Chng, and H. Li, "Spoofing and countermeasures for speaker verification: A survey," Speech Communication, vol. 66, pp. 130–153, Feb. 2015. doi: https://doi.org/10.1016/j.specom.2014.10.005

M. Sahidullah, T. Kinnunen, and C. Hanilçi, "A Comparison of Features for Synthetic Speech Detection," in Proc. Interspeech, 2015, pp. 2087–2091. doi: https://doi.org/10.21437/Interspeech.2015-472

M. R. Kamble, H. B. Sailor, H. A. Patil, and H. Li, “Advances in anti‑spoofing: from the perspective of ASVspoof challenges,” APSIPA Trans. Signal Inf. Process., vol. 9, no. 1, Art. no. e2, 2020. doi: https://doi.org/10.1017/ATSIP.2019.21

H. Tak, J. Patino, A. Nautsch, N. Evans, and M. Todisco, “Spoofing Attack Detection using the Non-linear Fusion of Sub-band Classifiers,” Arxiv (Cornell University), Jan. 2020, doi: https://doi.org/10.48550/arxiv.2005.10393.

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, Jun. 2016, pp. 770–778. doi: https://doi.org/10.48550/arXiv.1512.03385

J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Salt Lake City, UT, USA, 2018, pp. 7132–7141. doi: https://doi.org/10.48550/arXiv.1709.01507.

J. Khan, A. K. Irtaza, S. H. Lee, B. J. Chung, H.-J. Shim, H.-S. Heo, N. Evans, and J.-W. Jung, “SpoTNet: A spoofing-aware Transformer network for effective synthetic speech detection,” in Proc. 2nd ACM Int. Workshop Multimedia AI against Disinformation, Virtual, 2022, pp. 13–18. doi: https://par.nsf.gov/servlets/purl/10427031

M. Todisco, H. Delgado, and N. Evans, "Constant Q Cepstral Coefficients: A Spoofing Countermeasure for Automatic Speaker Verification," in Proc. Odyssey: The Speaker and Language Recognition Workshop, 2016, pp. 283–290. doi: https://doi.org/10.1016/j.csl.2017.01.001

ASVspoof Consortium, "ASVspoof 2021 Challenge Summary," [Online]. Available: https://www.asvspoof.org/index2021.html.

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, "X-vectors: Robust DNN Embeddings for Speaker Recognition," in Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2018, pp. 5329–5333. doi: https://doi.org/10.1109/ICASSP.2018.8461375

M. R. Kamble, H. B. Sailor, H. A. Patil, and H. Li, "Advances in Anti-Spoofing: From the Perspective of ASVspoof Challenges," APSIPA Transactions on Signal and Information Processing, vol. 9, 2020, Art. no. e2. doi: https://doi.org/10.1017/ATSIP.2019.21.

T. Kinnunen and H. Li, "An Overview of Text-Independent Speaker Recognition: From Features to Supervectors," Speech Communication, vol. 52, no. 1, pp. 12–40, Jan. 2010. doi: https://doi.org/10.1016/j.specom.2009.08.009

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, "Front-End Factor Analysis for Speaker Verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, May 2011. doi: https://doi.org/10.1109/TASL.2010.2064307

J. Jung, H.-S. Heo, J. Kim, H. Shim, and H.-J. Yu, “RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification,” arXiv.org, 2019. https://arxiv.org/abs/1904.08104.

B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” Interspeech 2020, pp. 3830–3834, Oct. 2020, doi: https://doi.org/10.21437/Interspeech.2020-2650.

S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” JMLR.org, 2015, doi: https://doi.org/10.48550/arXiv.1502.03167.

B. Chettri, D. Stoller, V. Morfi, M. A. Martínez Ramírez, E. Benetos, and B. L. Sturm, “Ensemble Models for Spoofing Detection in Automatic Speaker Verification,” Arxiv Preprint Arxiv: 1904.04589, Apr. 2019: https://arxiv.org/abs/1904.04589

G. Ali, J. Rashid, M. R. UlHussnain, M. U. Tariq, and others, “Beyond the Illusion: Ensemble Learning for Effective Voice Deepfake Detection,” IEEE Access, vol. 12, pp. 149940–149959, 2024. doi: http://dx.doi.org/10.1109/ACCESS.2024.3457866

J. Alam, A. Fathan, and W. H. Kang, “End-to-End Voice Spoofing Detection Employing Time Delay Neural Networks and Higher Order Statistics,” Lecture notes in computer science, pp. 14–25, Jan. 2021, doi: https://doi.org/10.1007/978-3-030-87802-3_2.

Kaggle, “The Fake-or-Real (FoR) Dataset (deepfake audio),” www.kaggle.com. https://www.kaggle.com/datasets/mohammedabdeldayem/the-fake-or-real-dataset

N. Carlini and D. Wagner, "Audio Adversarial Examples: Targeted Attacks on Speech-to-Text," arXiv preprint arXiv: 1801.01944v2, Mar. 2018. https://doi.org/10.48550/arXiv.1801.01944

Korshunov, P., Marcel, S. (2016) “Cross-Database Evaluation of Audio-Based Spoofing Detection Systems,” Proc. Interspeech 2016, 1705-1709, doi: https://doi.org/10.21437/Interspeech.2016-1326

Sanil Joshi, Mohit Dua, “Noise robust automatic speaker verification systems: review and analysis,” in 2024, Telecommunication Systems, № 3, p. 845-886, Springer Science and Business Media LLC. doi: https://doi.org/10.1007/s11235-024-01212-8

Tianyu Zhao, Tatsuya Kawahara, “Joint dialog act segmentation and recognition in human conversations using attention to dialog context,” Computer Speech & Language, Volume 57, September 2019, Pages 108-127. https://doi.org/10.1016/j.csl.2019.03.001

AI-Generated Speech Detection Using Convolutional Neural Network

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Current Issue