SVM-based Medical Diagnosis: A Review of Multicollinearity-aware Feature Selection

Satish Kumar Kalagotla; Thoudam Basanta; Mutum Bidyarani Devi

Authors

Satish Kumar Kalagotla
Thoudam Basanta
Mutum Bidyarani Devi

Abstract

Feature selection is essential in medical diagnosis applications where datasets often contain redundant or highly correlated features. Multicollinearity occurs when predictor variables are strongly correlated, creating challenges for support vector machine (SVM) classifiers, including unstable decision boundaries and reduced generalization performance. This study presents a literature survey of multicollinearity-aware feature selection frameworks for SVM-based medical diagnosis, focusing on correlation-SVM approaches that integrate Pearson correlation analysis and variance inflation factor (VIF) computation. The study examines the theoretical foundations of multicollinearity, its impact on SVM performance, and existing feature selection methodologies, including filter, wrapper, and embedded methods. The survey reveals that correlation-guided feature selection with iterative VIF recomputation consistently achieves 40–63% feature reduction while improving classification accuracy by 2–8% on benchmark medical datasets. The study also explores recent advancements and identifies promising directions for future research, including multi-objective optimization and explainable AI integration.

References

C. F. Dormann et al., “Collinearity: A review of methods to deal with it and a simulation study evaluating their performance,” Ecography, vol. 36, no. 1, pp. 27–46, May 2012.

W. H. Wolberg and O. L. Mangasarian, “Multisurface method of pattern separation for medical diagnosis applied to breast cytology,” Proc. Natl. Acad. Sci., vol. 87, no. 23, pp. 9193–9196, Dec. 1990.

J. W. Smith, J. Everhart, W. Dickson, W. Knowler, and R. Johannes, “Using the ADAP learning algorithm to forecast the onset of diabetes mellitus,” in Proc. Annu. Symp. Comput. Appl. Med. Care, Nov. 1988, p. 261.

D. A. Belsley, E. Kuh, and R. E. Welsch, Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York, NY, USA: John Wiley & Sons, 1980.

C. H. Mason and W. D. Perreault, “Collinearity, power, and interpretation of multiple regression analysis,” J. Marketing Res., vol. 28, no. 3, pp. 268–280, 1991.

V. N. Vapnik, The Nature of Statistical Learning Theory. New York, NY, USA: Springer, 1995.

I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Mach. Learn., vol. 46, no. 1, pp. 389–422, 2002.

M. H. Kutner, C. J. Nachtsheim, J. Neter, and W. Li, Applied Linear Statistical Models, 5th ed. New York, NY, USA: McGraw-Hill, 2005.

M. Kuhn and K. Johnson, Applied Predictive Modeling. New York, NY, USA: Springer, 2013.

T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, 2nd ed. New York, NY, USA: Springer, 2009.

D. E. Farrar and R. R. Glauber, “Multicollinearity in regression analysis: The problem revisited,” Rev. Econ. Stat., vol. 49, no. 1, pp. 92–107, 1967.

I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” J. Mach. Learn. Res., vol. 3, pp. 1157–1182, 2003.

M. A. Hall, “Correlation-based feature selection for machine learning,” Ph.D. dissertation, Univ. Waikato, Hamilton, New Zealand, 1999.

R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artif. Intell., vol. 97, no. 1–2, pp. 273–324, 1997.

R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Stat. Soc. B, vol. 58, no. 1, pp. 267–288, 1996.

M. Kiani Sarkaleh, H. Azgomi, and A. Kiani-Sarkaleh, “Breast cancer classification using feature selection via improved simulated annealing and SVM classifier,” Diagnostics, vol. 16, no. 4, p. 637, Feb. 2026.

R. M. O’Brien, “A caution regarding rules of thumb for variance inflation factors,” Qual. Quant., vol. 41, no. 5, pp. 673–690, 2007.

W. H. Greene, Econometric Analysis, 7th ed. Upper Saddle River, NJ, USA: Prentice Hall, 2012.

C. F. Mela and P. K. Kopalle, “The impact of collinearity on regression analysis,” Appl. Econ., vol. 34, no. 6, pp. 667–677, 2002.

K. Pearson, “Notes on regression and inheritance in the case of two parents,” Proc. Roy. Soc. Lond., vol. 58, pp. 240–242, 1895.

J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ, USA: Lawrence Erlbaum Associates, 1988.

E. R. Mansfield and B. P. Helms, “Detecting multicollinearity,” Amer. Stat., vol. 36, no. 3a, pp. 158–160, 1982.

D. W. Marquardt, “Generalized inverses, ridge regression, biased linear estimation, and nonlinear estimation,” Technometrics, vol. 12, no. 3, pp. 591–612, 1970.

S. Chatterjee and A. S. Hadi, Regression Analysis by Example, 5th ed. Hoboken, NJ, USA: John Wiley & Sons, 2012.

J. F. Hair, W. C. Black, B. J. Babin, and R. E. Anderson, Multivariate Data Analysis, 7th ed. Upper Saddle River, NJ, USA: Prentice Hall, 2010.

G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning. New York, NY, USA: Springer, 2013.

H. Liu and H. Motoda, Feature Selection for Knowledge Discovery and Data Mining. Boston, MA, USA: Springer, 1998.

M. Dash and H. Liu, “Feature selection for classification,” Intell. Data Anal., vol. 1, no. 3, pp. 131–156, 1997.

Y. Saeys, I. Inza, and P. Larrañaga, “A review of feature selection techniques in bioinformatics,” Bioinformatics, vol. 23, no. 19, pp. 2507–2517, 2007.

M. A. Hall, “Correlation-based feature selection for discrete and numeric class machine learning,” in Proc. Int. Conf. Mach. Learn., pp. 359–366, 2000.

L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast correlation-based filter solution,” in Proc. Int. Conf. Mach. Learn., pp. 856–863, 2003.

H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8, pp. 1226–1238, 2005.

H. Liu and L. Yu, “Toward integrating feature selection algorithms for classification and clustering,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 4, pp. 491–502, 2005.

G. H. John, R. Kohavi, and K. Pfleger, “Irrelevant features and the subset selection problem,” in Proc. Int. Conf. Mach. Learn., pp. 121–129, 1994.

C. L. Huang and C. J. Wang, “A GA-based feature selection and parameters optimization for support vector machines,” Expert Syst. Appl., vol. 31, no. 2, pp. 231–240, 2006.

S. W. Lin, K. C. Ying, S. C. Chen, and Z. J. Lee, “Particle swarm optimization for parameter determination and feature selection of support vector machines,” Expert Syst. Appl., vol. 35, no. 4, pp. 1817–1824, 2008.

R. Kohavi and D. Sommerfield, “Feature subset selection using the wrapper method,” in Proc. Int. Conf. Knowl. Discov. Data Mining, pp. 192–197, 1995.

J. Reunanen, “Overfitting occurs when making comparisons between variable selection methods,” J. Mach. Learn. Res., vol. 3, pp. 1371–1382, 2003.

T. N. Lal, O. Chapelle, J. Weston, and A. Elisseeff, “Embedded methods,” in Feature Extraction. New York, NY, USA: Springer, 2006, pp. 137–165.

H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J. Roy. Stat. Soc. B, vol. 67, no. 2, pp. 301–320, 2005.

C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, USA: Springer, 2006.

J. Friedman, T. Hastie, and R. Tibshirani, “Regularization paths for generalized linear models via coordinate descent,” J. Stat. Softw., vol. 33, no. 1, pp. 1–22, 2010.

G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning with Applications in R, 2nd ed. New York, NY, USA: Springer, 2021.

C. Molnar, Interpretable Machine Learning. 2022.

D. N. Reshef et al., “Detecting novel associations in large data sets,” Science, vol. 334, no. 6062, pp. 1518–1524, 2011.

J. Fan and J. Lv, “Sure independence screening for ultrahigh-dimensional feature space,” J. Roy. Stat. Soc. B, vol. 70, no. 5, pp. 849–911, 2008.

N. Meinshausen and P. Bühlmann, “Stability selection,” J. Roy. Stat. Soc. B, vol. 72, no. 4, pp. 417–473, 2010.

J. Bien, J. Taylor, and R. Tibshirani, “A lasso for hierarchical interactions,” Ann. Stat., vol. 41, no. 3, pp. 1111–1141, 2013.

C. W. Hsu and C. J. Lin, “A comparison of methods for multiclass support vector machines,” IEEE Trans. Neural Netw., vol. 13, no. 2, pp. 415–425, 2002.

Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1798–1828, 2013.

SVM-based Medical Diagnosis: A Review of Multicollinearity-aware Feature Selection

Authors

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Current Issue