DT-SVM and Hybrid Approaches for Missing Data Imputation and Classification: A Comprehensive Survey

Satish Kumar Kalagotla; Thoudam Basanta; Mutum Bidyarani Devi

Authors

Satish Kumar Kalagotla
Thoudam Basanta
Mutum Bidyarani Devi

Keywords:

Data preprocessing, Decision tree, DT-SVM, Ensemble methods, Hybrid classifier, Medical data analysis, Missing value imputation, Support vector machine

Abstract

Missing data represents a pervasive challenge in real-world datasets, particularly within medical research and clinical applications, where its presence can substantially degrade the performance of machine learning classifiers and compromise the validity of analytical conclusions. This comprehensive survey paper systematically examines hybrid approaches that integrate decision trees (DT) and support vector machines (SVM) for missing value imputation and subsequent classification, with particular emphasis on the DT-SVM framework and its algorithmic variants. The study provides a thorough exploration of missing data mechanisms, evaluates traditional and machine learning-based imputation techniques, and delineates the theoretical foundations of decision trees and support vector machines. Through critical analysis of existing hybrid methodologies and comparative evaluation against conventional approaches, this review synthesizes current literature to reveal that DT-based imputation, which leverages enhanced attribute correlations within homogeneous data segments identified through recursive partitioning, consistently outperforms simple imputation methods when combined with SVM classification. The survey further examines recent advancements, including approximated k-nearest neighbor (A-kNN) variants that address computational efficiency concerns while maintaining classification accuracy. Key research gaps are identified, including challenges in high-dimensional settings, handling of missing not at random mechanisms, and integration with deep learning architectures. The findings collectively suggest that integrated frameworks such as DT-SVM represent a promising trajectory for achieving robust classification performance in the presence of missing data, with particular relevance to medical diagnosis applications where data quality issues are prevalent and prediction accuracy is paramount.

References

Y. Ge, Z. Li, and J. Zhang, “A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods,” Sci Rep. vol. 13, no. 1, Jun. 2023.

M. Jena and S. Dehuri, “An integrated novel framework for coping missing values imputation and classification,” IEEE Access, vol. 10, pp. 69373–69387, Jan. 2022.

D. B. Rubin, “Inference and missing data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976.

R. Little and D. Rubin, “Statistical analysis with missing data, third edition,” Wiley Series in Probability and Statistics, vol. 793, no. 3, Apr. 2019.

G. E. A. P. A. Batista and M. C. Monard, “An analysis of four missing data treatment methods for supervised learning,” Applied Artificial Intelligence, vol. 17, no. 5–6, pp. 519–533, May 2003.

E. Acuña and C. Rodriguez, “The treatment of missing values and its effect on classifier accuracy,” Classification, Clustering, and Data Mining Applications, Studies in Classification, Data Analysis, and Knowledge Organisation. Springer, Berlin, pp. 639–647.

P. J. García-Laencina, J.-L. Sancho-Gómez, and A. R. Figueiras-Vidal, “Pattern classification with missing data: a review,” Neural Computing and Applications, vol. 19, no. 2, pp. 263–282, Sep. 2009.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification And Regression Trees. Routledge, 2017.

M. G. Rahman and M. Z. Islam, “A decision tree-based missing value imputation technique for data preprocessing,” in Proceedings of the 2011 Australasian Database Conference, vol. 115, 2011, pp. 41–50.

V. N. Vapnik, The Nature of Statistical Learning Theory. New York, NY: Springer New York, 1995.

K. Pelckmans, J. De Brabanter, J. A. K. Suykens, and B. De Moor, “Handling missing values in support vector machine classifiers,” Neural Networks, vol. 18, no. 5–6, pp. 684–692, Jul. 2005.

Z. Wu, M. Guo, X. Jin, J. Chen, and B. Liu, “CFAGO: cross-fusion of network and attributes based on attention mechanism for protein function prediction,” Bioinformatics, vol. 39, no. 3, Mar. 2023.

13J. Sim, O. Kwon, and K. C. Lee, “Adaptive pairing of classifier and imputation methods based on the characteristics of missing values in data sets,” Expert Systems with Applications, vol. 46, pp. 485–493, Mar. 2016.

H. P. H. Luu, H. M. Le, and H. A. Le Thi, “Markov chain stochastic DCA and applications in deep learning with PDEs regularization,” Neural Networks, vol. 170, pp. 149–166, Feb. 2024.

V. F. Dailyudenko, “Comparative analysis of electrocardiogram data by means of temporal locality approach with additional normalization,” Lecture Notes in Computer Science, pp. 115–127, 2008.

D. J. Stekhoven and P. Buhlmann, “MissForest—non-parametric missing value imputation for mixed-type data,” Bioinformatics, vol. 28, no. 1, pp. 112–118, Oct. 2011.

L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001.

A. Farhangfar, L. Kurgan, and J. Dy, “Impact of imputation of missing values on classification error for discrete data,” Pattern Recognition, vol. 41, no. 12, pp. 3692–3705, Dec. 2008.

O. Troyanskaya et al., “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, Jun. 2001.

S. van Buuren and K. Groothuis-Oudshoorn, “mice: Multivariate imputation by chained equations in R,” Journal of Statistical Software, vol. 45, no. 3, 2011.

H. Kim, G. H. Golub, and H. Park. Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics, vol. 21, no. 2, pp. 187–198, 2005.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, no. 16, pp. 321–357, Jun. 2002.

J. Luengo, S. García, and F. Herrera, “On the choice of the best imputation methods for missing values considering three groups of classification methods,” Knowledge and Information Systems, vol. 32, no. 1, pp. 77–108, Jun. 2011.

A. M. S. Lima and E. P. M. de Sousa, “Handling missing values in data streams: An overview,” Anais do XXXIX Simpósio Brasileiro de Banco de Dados (SBBD 2024), pp. 750–756, Oct. 2024.

J. Digitale et al., “Methods for Addressing Missingness in Electronic Health Record Data for Clinical Prediction Models: Comparative Evaluation,” JMIR Medical Informatics, vol. 13, pp. e79307–e79307, Nov. 2025.

Y. Li, Y. Wang, Y. Cheng, and L. Yang, “Low-switching policy gradient with exploration via online sensitivity sampling,” in Proceedings of Machine Learning Research (PMLR), Jul. 2023, pp. 19995–20034.

J. G. Ibrahim and G. Molenberghs, “Missing data methods in longitudinal studies: a review,” TEST, vol. 18, no. 1, pp. 1–43, Feb. 2009.

DT-SVM and Hybrid Approaches for Missing Data Imputation and Classification: A Comprehensive Survey

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

Current Issue