Developing a Self-Healing Software Architecture Using AI for Fault Detection and Recovery

Authors

  • Manas Kumar Majhi Research Scholar, Amity School of Engineering & Technology, Amity University Chhattisgarh, Raipur, Chhattisgarh, India
  • Neha Research Scholar, Amity School of Engineering & Technology, Amity University Chhattisgarh, Raipur, Chhattisgarh, India
  • Nayan Pandey Research Scholar, Amity School of Engineering & Technology, Amity University Chhattisgarh, Raipur, Chhattisgarh, India

Keywords:

AI, AIOps, Autonomous system, Cloud computing, Fault detection,, Intelligent automation, Machine learning, Predictive analytics, Self-healing software, System resilience

Abstract

As software development and deployment continue to advance quickly, maintaining system stability and reducing downtime has become increasingly important. Traditional software systems often require manual intervention to detect, diagnose, and resolve faults, which can lead to delays, increased operational costs, and compromised user experience. To address these limitations, AI-based self-healing software systems are emerging as a transformative solution. Such systems function independently to detect irregularities, forecast potential issues, and carry out recovery steps without requiring manual input by integrating artificial intelligence, machine learning, and observability tools, self-healing systems continuously monitor software behavior and infrastructure metrics in real-time. Leveraging historical data and predictive analytics, the system learns to detect deviations from normal operations and automatically applies remediation techniques such as restarting services, scaling resources, applying patches, or rerouting traffic. This intelligent automation not only enhances system resilience and availability but also reduces the mean time to repair (MTTR), thereby improving overall efficiency and user satisfaction. Furthermore, AI-driven self-healing mechanisms can adapt and evolve with changing system conditions, ensuring scalability and robustness. They are especially beneficial in complex distributed environments like cloud- native applications, microservices architectures, and IoT ecosystems, where manual troubleshooting can be error-prone and time- consuming.

This study examines the structure, key elements, and techniques used in AI-powered self-healing systems, with a focus on their practical uses and advantages. It also addresses key challenges such as false positives, data privacy, and system transparency. The proposed approach aims to move towards autonomous IT operations (AIOps), where software systems become increasingly self-aware and self-reliant, marking a significant step toward the future of intelligent automation in software engineering.

References

S. Bothe, U. Masood, H. Farooq, and A. Imran, “Neuromorphic AI Empowered Root Cause Analysis of Faults in Emerging Networks,” 2020 IEEE International Black Sea Conference on Communications and Networking (BlackSeaCom), pp. 1–6, May 2020, doi: https://doi.org/10.1109/blackseacom48709.2020.9235002.

P. Rauba, N. Seedat, K. Kacprzyk, and van, “Self-Healing Machine Learning: A Framework for Autonomous Adaptation in Real-World Environments,” arXiv (Cornell University), Oct. 2024, doi: https://doi.org/10.48550/arxiv.2411.00186

M. A. Naqvi, M. Astekin, S. Malik, and L. Moonen, “Adaptive Immunity for Software: Towards Autonomous Self-Healing Systems,” Arxiv (Cornell University), Mar. 2021, doi: https://doi.org/10.1109/saner50967.2021.00058.

O. Kephart and D. M. Chess, “The vision of autonomic computing,” Computer, vol. 36, no. 1, pp. 41–50, Jan. 2023, doi: https://doi.org/10.1109/mc.2003.1160055.

A. Bhavsar, A. More, C. Kulkarni, and D. Oswal, “A Holistic Approach to Autonomic Self-Healing Distributed Computing System,” International Journal of Computer Applications, vol. 76, no. 3, pp. 25–30, Aug. 2013, Accessed: Aug. 04, 2025. Available: https://www.ijcaonline.org/archives/volume76/number3/13228-0657

K. Yan, X. Lin, W. Ma, and Y. Zhang, “AI-Based Self-Learning System in Distributed Structural Health Monitoring and Control,” Neural Processing Letters, vol. 55, no. 1, pp. 229–245, Aug. 2021, doi: https://doi.org/10.1007/s11063-021-10571-1.

C. Gao et al., “Emerging App Issue Identification from User Feedback: Experience on WeChat,” In2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), May 2019, doi: https://doi.org/10.1109/icse-seip.2019.00040.

J. Farmani and A. K. Zadeh, “AI-based Self-healing Solutions Applied to Cellular Networks: An Overview,” Arxiv.Org, 2023. https://arxiv.org/abs/2311.02390

J. Alonso, “Optimization and Prediction Techniques for Self-Healing and Self-Learning Applications in a Trustworthy Cloud Continuum,” Information, vol. 12, no. 8, p. 308, Jul. 2021, doi: https://doi.org/10.3390/info12080308.

M. T. Ribeiro, S. Singh, and C. Guestrin, “‘Why Should I Trust you?’: Explaining the Predictions of Any Classifier,” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’16, pp. 1135–1144, Aug. 2016, doi: https://doi.org/10.1145/2939672.2939778.

M. J. Karamthulla, J. Narkarunai, A. Malaiyappan, and S. Prakash, “AI-powered Self-healing Systems for Fault Tolerant Platform Engineering: Case Studies and Challenges,” Journal of knowledge learning and science technology, vol. 2, no. 2, pp. 327–338, May 2023, doi: https://doi.org/10.60087/jklst.vol2.n2.p338.

X. Yang, P. Shi, H. Sun, W. Zheng, and J. Alves-Foss, “A Fast Boot, Fast Shutdown Technique for Android OS Devices,” Computer, vol. 49, no. 7, pp. 62–68, Jul. 2016, doi: https://doi.org/10.1109/mc.2016.210.

S. Amershi , A. Begel , C. Bird , R. DeLine ,“Software Engineering for Machine Learning: A Case Study,” 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), May 2019, doi: https://doi.org/10.1109/icse-seip.2019.00042.

D. Sculley, G. Holt, and D. Golovin, “Hidden technical debt in Machine learning systems,” NIPS’15: Proceedings of the 29th International Conference on Neural Information Processing Systems, 2015. https://proceedings.neurips.cc/paper_files/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

J. Patel and H. Shah, “Software Engineering Revolutionized by Machine Learning-Powered Self-Healing Systems,” International Research Journal of Engineering and Applied Sciences, vol. 9, no. 1, pp. 43–49, Jan. 2021, doi: https://doi.org/10.55083/irjeas.2021.v09i01008.

R. Sheikh, M. S. Chande, and D. K. Mishra, “Security issues in MANET: A review,” IEEE Xplore, Sep. 01, 2010. https://ieeexplore.ieee.org/document/5587317

Published

2025-08-11