Multimodal Emotion Recognition Through Convolutional Neural Networks and Natural Language Processing
Keywords:
Affective computing, Convolutional neural network, Facial emotion recognition, Human-computer interaction, Multimodal analysis, Natural language processingAbstract
Discerning human affective states from visual expressions, spoken phrases, and written passages is a central challenge in computer vision and natural language processing. This paper presents a fully integrated, lightweight, web-based multimodal emotion recognition system that handles five distinct input pathways: typed text, recorded audio, still images, video files, and live webcam feeds. For visual streams, the system utilizes a Convolutional Neural Network (CNN) trained on the FER-2013 dataset, paired with a Viola-Jones Haar Cascade classifier for rapid face localization. For linguistic streams, a lexicon-based natural language processing approach is implemented using the NRCLex engine, supplemented with a custom emotion vocabulary. Recorded audio files are transcribed using the Google Speech Recognition API prior to text-lexicon matching. Wrapped in a Flask web server, the platform operates entirely on commodity CPUs under a 1 GB RAM footprint, eliminating expensive GPU hardware dependencies. Experimental results demonstrate that the CNN model achieves a validation accuracy of 65 % to 68 % on the noisy FER-2013 benchmark, matching the human agreement rate, while the webcam stream maintains a smooth frame rate of 8 to 12 FPS via a strategic frame-caching technique. The resulting interface displays annotated visual outputs with styled bounding boxes and confidence scores, yielding an explainable and responsive affective computing overlay.
References
R. W. Picard, Affective Computing. MIT Press, 2000.
P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol. 1, 2001
S. Li and W. Deng, “Deep Facial Expression Recognition: A Survey,” IEEE Transactions on Affective Computing, vol. 13, no. 3, pp. 1–1, 2020.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015.
I. J. Goodfellow et al., “Challenges in Representation Learning: A Report on Three Machine Learning Contests,” Neural Information Processing, pp. 117–124, 2013.
N. Zhou, R. Liang, and W. Shi, “A Lightweight Convolutional Neural Network for Real-time Facial Expression Detection,” IEEE Access, pp. 1–1, 2020.
I. Mahmud, P. R. Das, N. I. Rifa, I. Hossain, R. Rahman, and D. Md. Farid, “Multimodal Emotion Recognition Using Visual and Thermal Image Fusion: A Deep Learning Approach,” 2024 27th International Conference on Computer and Information Technology (ICCIT), pp. 3360–3365, Dec. 2024.
C. Dixit and S. M. Satapathy, “Deep CNN with late fusion for real time multimodal emotion recognition,” Expert Systems with Applications, vol. 240, p. 122579, Apr. 2024.
Y. Li, J. Zeng, S. Shan, and X. Chen, “Occlusion Aware Facial Expression Recognition Using CNN With Attention Mechanism,” IEEE Transactions on Image Processing, vol. 28, no. 5, pp. 2439–2450, May 2019.
R. Khoeun, P. Chophuk, and K. Chinnasarn, “Emotion Recognition for Partial Faces Using a Feature Vector Technique,” Sensors, vol. 22, no. 12, p. 4633, Jun. 2022.
S. M. Mohammad and P. D. Turney, “Crowdsourcing A Word-Emotion Association Lexicon,” Computational Intelligence, vol. 29, no. 3, pp. 436–465, Sep. 2012.
P. Ekman and W. V. Friesen, “Constants across Cultures in the Face and Emotion,” Journal of Personality and Social Psychology, vol. 17, no. 2, pp. 124–129, 1971.