An Intelligent Virtual World Framework Using Computer Vision and Speech Control

Authors

  • Rituja S. Bardapure
  • Shravani J. Wagh
  • Anam K. Kazi
  • Asavari S. Patil
  • Z. M. Patwekar

Keywords:

Computer vision, Gesture-based interaction, Human–computer interaction (HCI), Speech recognition, Virtual environment

Abstract

Traditional computer interaction systems mainly rely on keyboards, mouse, or external controllers, which restricts natural and immersive human–computer interaction. Consequently, traditional interaction methods are not well-suited for modern applications such as virtual environments, training systems, and assistive technologies. Hence, there is a need to develop an interactive system that enables users to control virtual environments through natural human actions such as gestures and voice commands. This research presents a multimodal virtual world system that integrates computer vision and speech recognition technologies to enable hands-free interaction within a digital environment. The proposed system captures real-time video input through a camera to detect hand gestures using computer vision techniques and processes voice commands through speech recognition. These inputs are mapped to predefined actions within the virtual environment, enabling smooth and intuitive control. The system is developed in Python, using OpenCV for real-time image processing and speech recognition libraries to identify and process voice commands. The integration of gesture-based and voice-based interaction improves accessibility and enhances the overall user experience. Experimental results show that the system performs efficiently in real-time with accurate gesture and voice recognition under normal conditions. The developed system is affordable, user-friendly, and applicable to domains such as education, gaming, virtual training, and human computer interaction.

References

T. V. Kini, T. Raj, S. Meenatchisundaram, and K. A. Acharya, “Gesture and voice-enabled game interface for accessible human–computer interaction,” in Proceeding of 9th IEEE International Conference on Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER), Mangalore, India, 2025, pp. 776–782.

D. Ryumin, D. Ivanko and E. Ryumina, “Audio-visual speech and gesture recognition by sensors of mobile devices,” Sensors, vol. 23, no. 4, Feb. 2023.

Z. Li et al., “Enabling voice-accompanying hand-to-face gesture recognition with cross-device sensing,” in Proceedings of 2023 CHI Conference on Human Factors Computing System (CHI ’23), Apr. 2023, pp. 1–17.

M. Oudah, A. Al-Naji and J. Chahl, “Hand gesture recognition based on computer vision: A review of techniques,” Journal of Imaging, vol. 6, no. 3, Jul. 2020.

R. Zhao, Y. Wang, P. Jia, C. Li, Y. Ma and Z. Zhang, “Review of human gesture recognition based on computer vision technology,” 2021 IEEE 5th Advanced Information Technology, Electronic and Automation Control Conference (IAEAC), Chongqing, China, 2021, pp. 1599–1603.

M. Mahmudul Alam, M. Tariqul Islam and S. M. Mahbubur Rahman, “Unified learning approach for egocentric hand gesture recognition and fingertip detection,” Pattern Recognition, vol. 121, Jan. 2022.

H. Lee, J. K. Mandivarapu, N. Ogbazghi and Y. Li, “Real-time interface control with motion gesture recognition based on non-contact capacitive sensing,” arXiv, Jan. 2022.

Y. Dong, “A survey on code generation with LLM-based agents,” arXiv, Jul. 2025.

W. Chang and A. Aberash, “Embodied learning in the digital age: assessing the impact of gesture-based educational technology on working memory capacity, spatial reasoning, and engagement,” BMC Psychology, vol. 14, Feb. 2026.

Y. Qawqzeh, A. A. Shraah, A. Rizwan, M. Sánchez-Chero, L. A. V. More, and M. Shabaz, “Exploring the effectiveness of virtual reality-based training for sustainable health and occupational safety in industry 4.0,” Scientific Reports, vol. 15, Aug. 2025.

Published

2026-03-26

How to Cite

S. Bardapure, R., J. Wagh, S., K. Kazi, A., S. Patil, A., & Patwekar, Z. M. (2026). An Intelligent Virtual World Framework Using Computer Vision and Speech Control. Journal of Data Engineering and Knowledge Discovery, 3(1), 27–36. Retrieved from https://matjournals.net/engineering/index.php/JoDEKD/article/view/3278