Implementation of an AI-Based Architectural Image Captioning System for Visual Accessibility
Keywords:
Architectural image captioning, Assistive technology, Computer vision, Deep learning, InceptionV3, LSTM, Multilingual text-to-speech, Visual accessibilityAbstract
The widespread availability of architectural imagery across digital platforms, combined with the lack of automated description tools, creates a significant accessibility barrier for visually impaired users and multilingual communities. Architectural images often contain complex structural details that are difficult to interpret without domain-specific knowledge, making manual captioning impractical at scale. This paper presents an AI-assisted architectural image captioning system designed to bridge the gap between architectural visual content and accessible language. The proposed system integrates an InceptionV3-based Convolutional Neural Network (CNN) for robust visual feature extraction and a custom Long Short-Term Memory (LSTM) decoder, trained on domain-specific architectural datasets, to generate accurate, context-aware captions. To enhance accessibility, a multilingual Text-to-Speech (TTS) module is incorporated, supporting English, Hindi, and Kannada, enabling users to receive audio descriptions in their preferred language. Additionally, a user-friendly graphical interface is developed using Tkinter, allowing real-time image upload, caption generation, and audio playback. The implementation demonstrates that combining domain-specific deep learning techniques with multilingual audio output can effectively transform architectural images into meaningful spoken descriptions. The system is lightweight, scalable, and operates efficiently on standard hardware without requiring cloud-based infrastructure. This work highlights the potential of artificial intelligence in improving accessibility, supporting architectural education, and enabling inclusive interaction with visual content for diverse user groups.
References
O. Vinyals, A. Toshev, S. Bengio and D. Erhan, "Show and tell: A neural image caption generator," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 3156-3164, Jun. 2015.
B. Wang, C. Wang, Q. Zhang, Y. Su, Y. Wang, and Y. Xu, “Cross-Lingual Image Caption Generation Based on Visual Attention Model,” IEEE Access, vol. 8, pp. 104543–104554, 2020.
P. Anderson et al., “Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering,” arXiv:1707.07998, Mar. 2018.
J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3242–3250, Jun. 201.
Q. You, H. Jin, Z. Wang, C. Fang and J. Luo, "Image Captioning with Semantic Attention," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 4651-4659, Jun. 2016.
M. Cornia, M. Stefanini, L. Baraldi and R. Cucchiara, "Meshed-Memory Transformer for Image Captioning," 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 10575-10584, Jun. 2020
S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical Sequence Training for Image Captioning, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),1179-1195, 2016.
A. Radford et al., “Learning Transferable Visual Models from Natural Language Supervision,” International Conference on Machine Learning, pp. 8748-8763, 2021.
A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” arXiv:2010.11929, Oct. 2020.
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “CIDEr: Consensus-Based Image Description Evaluation,” 2015 IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566-4575, 2015.
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural Networks,” NIPS'14: Proceedings of the 28th International Conference on Neural Information Processing Systems, vol. 2, pp. 3104 – 3112, Sep. 2014.
A. Karpathy and L. Fei-Fei, “Deep Visual-Semantic Alignments for Generating Image Descriptions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 664–676, Apr. 2017.
Z. Yang, Y. Yuan, Y. Wu, R. Salakhutdinov, and W. W. Cohen, “Review Networks for Caption Generation,” NIPS'16: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 2369 – 2377, 2016.
T. Yao, Y. Pan, Y. Li, and T. Mei, “Exploring Visual Relationship for Image Captioning,” Computer Vision – ECCV 2018: 15th European Conference, vol. 11218, pp. 711–722, May 28, 2026.
L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on Attention for Image Captioning,” Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4634-4643, 2019.
A. K. Muhammed Kunju, S. Baskar, S. Zafar, B. A R, R. S, and S. K. A, “A transformer based real-time photo captioning framework for visually impaired people with visual attention,” Multimedia Tools and Applications, vol. 83, no. 41, pp. 88859–88878, Mar. 2024.