A Fault-Tolerant Framework for Improving Reliability in Distributed Cloud Systems

Authors

  • Darika S
  • P. Vanitha

Keywords:

Checkpointing, Cloud computing, Distributed cloud systems, Failure detection, Fault tolerance, Lightweight framework, Recovery mechanism, Reliability, Replication, Task scheduling

Abstract

Distributed cloud systems, which offer scalable and on-demand services to consumers globally, have emerged as the foundation of contemporary computer infrastructures. However, the availability and dependability of services can be severely impacted by these systems’ high susceptibility to hardware malfunctions, network outages, and node breakdowns. Although they increase system resilience, traditional fault-tolerance techniques like replication and checkpointing frequently result in higher computing overhead, storage costs, and system complexity. Hence, a lightweight fault-tolerant framework is presented in this paper to reduce resource overhead and enhance dependability in distributed cloud settings. To ensure service continuity in the event of node failures, the suggested model integrates effective recovery mechanisms, adaptive task reallocation, and failure detection. To assess system performance in the event of a failure, a simulation-based model was created. In comparison to non-fault-tolerant methods, experimental results show increased task completion rate, decreased recovery time, and improved system availability. The study emphasises how crucial it is to create fault-tolerance plans for next-generation distributed cloud systems that are both economical and effective.

References

M. Armbrust et al., “A view of cloud computing,” Communications of the ACM, vol. 53, no. 4, pp. 50–58, Apr. 2010.

P. Mell and T. Grance, “The NIST definition of cloud computing: Recommendations of the National Institute of Standards and Technology,” National Institute of Standards and Technology, Sep. 2011.

L. Lamport, “Paxos made simple,”ACM SIGACT News, vol. 32, no. 4, pp. 51–58, Dec. 2001.

D. Ongaro and J. Ousterhout, “In search of an understandable consensus algorithm (Extended Version).” Stanford University, May. 2014.

R. Buyya, R. Ranjan, and R. N. Calheiros, “Modeling and simulation of scalable Cloud computing environments and the CloudSim toolkit: Challenges and opportunities,” 2009 International Conference on High Performance Computing & Simulation, Jun. 2009, pp. 1–11.

T. D. Chandra and S. Toueg, “Unreliable failure detectors for reliable distributed systems,” Journal of the ACM, vol. 43, no. 2, pp. 225–267, Mar. 1996.

K. Hwang, J. Dongarra and G. C. Fox, Distributed and Cloud Computing: From Parallel Processing to the Internet of Things, Morgan Kaufmann, Dec. 2013.

A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing,” IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 1, pp. 11–33, 2004.

J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, Jan. 2008.

L. Lamport, “Time, clocks, and the ordering of events in a distributed system,” Communications of the ACM, vol. 21, no. 7, pp. 558–565, Jul. 1978.

M. Armbrust et al., “Above the clouds: A Berkeley view of cloud computing,” University of California at Berkeley, Feb. 2009.

M. Zaharia et al., “Apache spark: A unified engine for big data processing,” Communications of the ACM, vol. 59, no. 11, pp. 56–65, Oct. 2016.

M. J. Fischer, N. A. Lynch, and M. S. Paterson, “Impossibility of distributed consensus with one faulty process,” Journal of the ACM, vol. 32, no. 2, pp. 374–382, Apr. 1985.

Published

2026-04-04

How to Cite

S, D., & Vanitha, P. (2026). A Fault-Tolerant Framework for Improving Reliability in Distributed Cloud Systems. Journal of Security in Computer Networks and Distributed Systems, 3(1), 9–18. Retrieved from https://matjournals.net/engineering/index.php/JoSCNDS/article/view/3367