Integrating Machine Learning and Statistical Approaches for Predicting Breast Cancer Survival

Authors

  • A. Touseef Ahmed
  • B. Towfeeq Ahmed

Keywords:

Artificial Neural Network (ANN), Big Data, Learning Algorithm, Logistic Regression, Machine Learning (ML) algorithm, Receiver Operating Characteristic (ROC)

Abstract

The challenge of predicting 5-year survival rates for cancer caused by breast using ML techniques. It differentiates itself from previous research by utilizing a larger training dataset, addressing the imbalance between the minority and majority classes, and implementing improved data-cleaning processes. A key finding of this study is that logistic regression when adjusted with class weights, delivers the best balance of precision and recall for the minority class. Of particular interest were strategies to improve the recall level for the minority class, as the cost of misclassification is prohibitive. The main contribution of this work is that logistic regression with the proper setting of class weight gives the highest precision/recall level for the minority class. Furthermore, this paper includes comprehensive algorithms and code to facilitate class membership determination and the implementation of competing methods, enabling other researchers to reproduce and build upon this work.

References

K. A. Baggerly and K. R. Coombes, “Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology,” The Annals of Applied Statistics, vol. 3, no. 4, pp. 1309-1334, Dec. 2009, doi: https://doi.org/10.1214/09-aoas291.

Q. Wei and R. L. Dunbrack, “The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics,” PLoS ONE, vol. 8, no. 7, p. e67863, Jul. 2013, doi: https://doi.org/10.1371/journal.pone.0067863.

K. Taghva, R. Beckley, and J. Coombs, “The Effects of OCR Error on the Extraction of Private Information,” Lecture notes in computer science, pp. 348–357, Jan. 2006, doi: https://doi.org/10.1007/11669487_31.

K. Taghva, “Identification of Sensitive Unclassified Information,” Springer eBooks, pp. 89–108, Jan. 2009, doi: https://doi.org/10.1007/978-3-642-01141-2_6.

R. D. Peng, “Reproducible Research in Computational Science,” Science, vol. 334, no. 6060, pp. 1226–1227, Dec. 2011, doi: https://doi.org/10.1126/science.1213847.

C. Phua, D. Alahakoon, and V. Lee, “Minority report in fraud detection,” ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 50–59, Jun. 2004, doi: https://doi.org/10.1145/1007730.1007738.

R. O. Duda, D. G. Stork, and P. E. Hart, Pattern classification and scene analysis. Part 1, Pattern classification. New York; Chichester: Wiley, 2000. Available: https://dl.acm.org/citation.cfm?id=954544

Fabrício Benevenuto, T. Rodrigues, V. Almeida, J. M. Almeida, and Marcos André Gonçalves, “Detecting spammers and content promoters in online video social networks,” SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, Jul. 2009, doi: https://doi.org/10.1145/1571941.1572047.

D. M. Parkin and T. Hakulinen, “Cancer registration: principles and methods. Analysis of survival,” IARC scientific publications, no. 95, pp. 159–176, 1991, Available: https://pubmed.ncbi.nlm.nih.gov/1894319/

Breastcancer.org, “Breastcancer.org - Breast Cancer Information and Support,” Breastcancer.org, 2019. https://www.breastcancer.org/

Published

2025-01-31

How to Cite

A. Touseef Ahmed, & B. Towfeeq Ahmed. (2025). Integrating Machine Learning and Statistical Approaches for Predicting Breast Cancer Survival. Journal of Statistics and Mathematical Engineering, 11(1), 17–22. Retrieved from https://matjournals.net/engineering/index.php/JOSME/article/view/1370

Issue

Section

Articles