DT-SVM: A Novel Hybrid Decision Tree-Support Vector Machine Framework for Robust Classification with Missing Medical Data
Keywords:
Decision tree, DT-SVM, Hybrid classifier, Medical data analysis, Missing value imputation, Robust classification, Surrogate splits, Support vector machineAbstract
Background: Missing values represent one of the most pervasive challenges in medical data analysis, affecting 5–20% of clinical datasets and significantly degrading the performance of machine learning classifiers. Traditional imputation methods, such as mean, median, or k-nearest neighbor imputation, often underestimate variance, introduce bias, or fail to leverage local data structures. Support vector machines (SVM), despite their superior classification capabilities, cannot directly handle missing values, necessitating integrated approaches that combine imputation with classification.
Objective: This paper proposes DT-SVM, a novel hybrid framework that integrates decision trees and support vector machines to address missing value problems in medical data classification. The framework leverages decision trees’ inherent ability to handle missing values through surrogate splits while utilizing SVM’s superior classification performance.
Methods: The proposed DT-SVM framework operates in two stages: (1) a decision tree trained on complete cases performs missing value imputation using surrogate splits that leverage attribute correlations within homogeneous data segments; (2) a support vector machine with radial basis function (RBF) kernel performs final classification on the imputed dataset, incorporating decision tree-derived feature importance weights. The framework was evaluated on four benchmark medical datasets (Wisconsin Breast Cancer, PIMA Indian Diabetes, Hepatitis, and Mammographic Mass) under three missing mechanisms (MCAR, MAR, MNAR) with missing rates ranging from 5% to 30%.
Results: DT-SVM achieved 96.12% accuracy (95% CI: [0.9587, 0.9637]) on the Wisconsin dataset with 10% missing values, significantly outperforming mean imputation (94.12%, 95% CI: [0.9389, 0.9435], p < 0.001, Cohen’s d = 1.24), kNN imputation (94.87%, 95% CI: [0.9465, 0.9509], p < 0.001, d = 0.89), and MICE (95.23%, 95% CI: [0.9501, 0.9545], p < 0.01, d = 0.52). The framework demonstrated remarkable robustness across missing mechanisms, with performance degradation of only 2.71% at 30% missingness compared to 7.8% for mean imputation. Cross-dataset validation showed consistent improvements across all datasets. In high-dimensional experiments (500 features, 10% missing), DT-SVM maintained 91.3% accuracy versus 87.2% for mean imputation (+4.1%, p < 0.01), with linear computational scaling O(d·n).
Conclusion: The DT-SVM framework provides a practical solution for developing reliable diagnostic systems capable of operating effectively with real-world clinical data containing missing values, making it particularly suitable for medical applications where data quality issues are common and prediction accuracy is critical.
References
Y. Ge, Z. Li, and J. Zhang, “A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods,” Sci. Rep., vol. 13, no. 1, Jun. 2023.
M. Jena and Satchidananda Dehuri, “An integrated novel framework for coping missing values imputation and classification,” IEEE Access, vol. 10, pp. 69373–69387, Jan. 2022.
R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, 3rd ed. Hoboken, NJ: Wiley, 2019.
D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976.
D. B. Rubin. Multiple Imputation for Nonresponse in Surveys. New York: Wiley, 1987.
P. J. García-Laencina, J.-L. Sancho-Gómez, and A. R. Figueiras-Vidal, “Pattern classification with missing data: a review,” Neural Computing and Applications, vol. 19, no. 2, pp. 263–282, Sep. 2009.
K. Pelckmans, J. De Brabanter, J. A. K. Suykens, and B. De Moor, “Handling missing values in support vector machine classifiers,” Neural Networks, vol. 18, no. 5–6, pp. 684–692, Jul. 2005.
E. Acuña and C. Rodriguez, “The treatment of missing values and its effect on classifier accuracy,” in Classification, Clustering, and Data Mining Applications, D. Banks, F. R. McMorris, P. Arabie, and W. Gaul, Eds. Berlin, Heidelberg, Germany: Springer, pp. 639–647, 2004.
G. E. A. P. A. Batista and M. C. Monard, “An analysis of four missing data treatment methods for supervised learning,” Applied Artificial Intelligence, vol. 17, no. 5–6, pp. 519–533, May 2003.
A. Farhangfar, L. Kurgan, and J. Dy, “Impact of imputation of missing values on classification error for discrete data,” Pattern Recognition, vol. 41, no. 12, pp. 3692–3705, Dec. 2008.
J. L. Schafer and J. W. Graham, “Missing data: Our view of the state of the art.,” Psychological Methods, vol. 7, no. 2, pp. 147–177, 2002.
J. Luengo, S. García, and F. Herrera, “On the choice of the best imputation methods for missing values considering three groups of classification methods,” Knowledge and Information Systems, vol. 32, no. 1, pp. 77–108, Jun. 2011.
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification And Regression Trees. New York, NY, USA: Routledge, 2017.
M. G. Rahman and M. Z. Islam. A decision tree-based missing value imputation technique for data preprocessing. In Proceedings of the 2011 Australasian Database Conference, vol. 115, 2011, pp. 41–50.
V. N. Vapnik. The Nature of Statistical Learning Theory. New York: Springer. 1995.
B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. Cambridge, MA: MIT Press, 2002.
B. Durgalakshmi and V. Vijayakumar, “Feature selection and classification using support vector machine and decision tree,” Computational Intelligence, vol. 36, no. 4, pp. 1480–1492, Nov. 2020.
C. Azad, B. Bhushan, R. Sharma, et al., “Prediction model using SMOTE, genetic algorithm and decision tree (PMSGD) for classification of diabetes mellitus,” Multimedia Syst., vol. 28, pp. 1289–1307, 2022.
S. van Buuren and K. Groothuis-Oudshoorn, “mice: Multivariate Imputation by Chained Equations in R,” Journal of Statistical Software, vol. 45, no. 3, 2011.
M. Kuhn and K. Johnson, Applied Predictive Modeling. New York, NY: Springer New York, 2013.