A comparison of strategies for missing values in data on machine learning classification algorithms

Makaba, Tebogo; Dogo, Eustace

Please use this identifier to cite or link to this item: http://ir.futminna.edu.ng:8080/jspui/handle/123456789/6714

Full metadata record

DC Field	Value	Language
dc.contributor.author	Makaba, Tebogo	-
dc.contributor.author	Dogo, Eustace	-
dc.date.accessioned	2021-07-06T10:11:21Z	-
dc.date.available	2021-07-06T10:11:21Z	-
dc.date.issued	2019	-
dc.identifier.other	doi: 10.1109/IMITEC45504.2019.9015889.	-
dc.identifier.uri	http://repository.futminna.edu.ng:8080/jspui/handle/123456789/6714	-
dc.description.abstract	Dealing with missing values in data is an important feature engineering task in data science to prevent negative impacts on machine learning classification models in terms of accurate prediction. However, it is often unclear what the underlying cause of the missing values in real-life data is or rather the missing data mechanism that is causing the missingness. Thus, it becomes necessary to evaluate several missing data approaches for a given dataset. In this paper, we perform a comparative study of several approaches for handling missing values in data, namely listwise deletion, mean, mode, k-nearest neighbors, expectation-maximization, and multiple imputations by chained equations. The comparison is performed on two real-world datasets, using the following evaluation metrics: Accuracy, root mean squared error, receiver operating characteristics, and the F1 score. Most classifiers performed well across the missing data strategies. However, based on the result obtained, the support vector classifier method overall performed marginally better for the numerical data and naïve Bayes classifier for the categorical data when compared to the other evaluated missing value methods.	en_US
dc.description.sponsorship	University of Johannesburg, South Africa	en_US
dc.language.iso	en	en_US
dc.publisher	IEEE	en_US
dc.subject	missing data	en_US
dc.subject	imputation methods	en_US
dc.subject	performance metric	en_US
dc.subject	machine learning	en_US
dc.subject	classification	en_US
dc.title	A comparison of strategies for missing values in data on machine learning classification algorithms	en_US
dc.type	Other	en_US
Appears in Collections:	Computer Engineering

Files in This Item:

File	Description	Size	Format
Comparison of missing Values_ML.pdf		765.75 kB	Adobe PDF	View/Open

Show simple item record