Correlation-SVM: A Multicollinearity-Aware Feature Selection Framework for SVM-Based Medical Diagnosis
DOI:
https://doi.org/10.46610/RTAIA.2026.v05i02.001Keywords:
Correlation analysis, Feature selection, Medical diagnosis, Multicollinearity, Support Vector machine, Variance inflation factorAbstract
Medical datasets often contain redundant or highly correlated features, leading to multicollinearity that adversely affects Support Vector Machine (SVM) classifiers by causing unstable decision boundaries, inflated coefficient variances, reduced interpretability, and degraded generalization performance, yet traditional feature selection methods inadequately address this issue. This paper proposes Correlation-SVM, a novel multicollinearity-aware feature selection framework that integrates Pearson correlation analysis and Variance Inflation Factor (VIF) computation within a hierarchical elimination process specifically optimized for SVM-based medical diagnosis. The framework operates in four stages: Pearson correlation analysis to identify highly correlated feature pairs; VIF computation to quantify multicollinearity severity; hierarchical feature elimination to iteratively remove redundant features while recomputing VIF after each removal; and SVM training with cross-validation evaluation. Evaluated on four benchmark medical datasets (Wisconsin Breast Cancer, PIMA Indian Diabetes, Hepatitis, and Mammographic Mass) and compared against six state-of-the-art methods (CFS, FCBF, mRMR, SVM-RFE, LASSO, and GA-SVM) using 10-fold cross-validation with five repeats, Correlation-SVM achieved 97.42% accuracy on the Wisconsin dataset with only 5 features (44.4% reduction), outperforming all comparison methods. Multicollinearity was substantially reduced, with maximum VIF decreasing from 8.3 to 2.3 (72.3% reduction) on Wisconsin, from 12.5 to 2.1 (83.2% reduction) on Hepatitis, and from 4.2 to 1.6 (61.9% reduction) on PIMA, achieving VIF values below the acceptable threshold of 2.5. The framework requires only 38.7 seconds of computational time, making it 84% faster than GA-SVM and 79% faster than SVM-RFE, thus achieving wrapper-like performance with filter-like speed. Additionally, the selected feature subsets align with established medical knowledge across all four datasets, enhancing clinical interpretability and trust. Correlation-SVM provides an effective, computationally efficient framework for multicollinearity-aware feature selection in SVM-based medical diagnosis, achieving substantial feature reduction, eliminating multicollinearity, and improving classification accuracy while maintaining interpretability.
References
Guyon and A. De, “An Introduction to Variable and Feature Selection,” Journal of Machine Learning Research, vol. 3, pp. 1157–1182, 2003.
H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. Boston, MA: Springer US, vol. 454, 1998.
C. F. Dormann et al., “Collinearity: a review of methods to deal with it and a simulation study evaluating their performance,” Ecography, vol. 36, no. 1, pp. 27–46, May 2012.
W. H. Wolberg and O. L. Mangasarian, “Multisurface method of pattern separation for medical diagnosis applied to breast cytology,” Proceedings of the National Academy of Sciences, vol. 87, no. 23, pp. 9193–9196, Dec. 1990.
J. W. Smith, J. Everhart, W. Dickson, W. Knowler, and R. Johannes, “Using the ADAP Learning Algorithm to Forecast the Onset of Diabetes Mellitus,” Proceedings of the Annual Symposium on Computer Application in Medical Care, pp. 261, Nov. 1988.
D. A. Belsley, E. Kuh, and R. E. Welsch, Regression Diagnostics. Hoboken, NJ, USA: John Wiley & Sons, Inc., 1980.
V. N. Vapnik, The Nature of Statistical Learning Theory. New York, NY: Springer New York, pp. 203, 2000.
N. Cristianini and J. Shawe-Taylor, “An Introduction to Support Vector Machines and Other Kernel-based Learning Methods,” Cambridge University Press, 2000.
M. Hall, “Correlation-based Feature Selection for Machine Learning,” Department of Computer Science, 1999.
L. Yu and H. Liu, “Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution,” ICML'03: Proceedings of the Twentieth International Conference on Machine Learning, pp. 856 – 863, 2003.
H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, Aug. 2005.
I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, vol. 46, no. 1/3, pp. 389–422, 2002.
C.-L. Huang and C.-J. Wang, “A GA-based feature selection and parameters optimization for support vector machines,” Expert Systems with Applications, vol. 31, no. 2, pp. 231–240, Aug. 2006.
G. H. John and R. Kohavi, “Irrelevant Features and the Subset Selection Problem,” Machine Learning Proceedings 1994, pp. 121–129, Jan. 1994.
R. Tibshirani, “Regression Shrinkage and Selection Via the Lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, Jan. 1996.
B. Cestnik, I. Kononenko, and I. Bratko, “ASSISTANT 86: A knowledge-elicitation tool for sophisticated users,” InProceedings of the 2nd European conference on European working session on learning, pp. 31-45, May 1987.
M. Elter, R. Schulz-Wendtland, and T. Wittenberg, “The prediction of breast cancer biopsy outcomes using two CAD approaches that both emphasize an intelligible decision process,” Medical Physics, vol. 34, no. 11, pp. 4164–4172, Oct. 2007,
M. A. Hall, “Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning,” ICML ’00: Proceedings of the Seventeenth International Conference on Machine Learning, pp. 359 – 366, 2000.