Imbalanced data ensemble classification based on cluster-based under-sampling algorithm

WU Sen; LIU Lu; LU Dan

doi:10.13374/j.issn2095-9389.2017.08.015

Volume 39 Issue 8

Aug. 2017

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Engineering > 2017 > 39(8): 1244-1253

WU Sen, LIU Lu, LU Dan. Imbalanced data ensemble classification based on cluster-based under-sampling algorithm[J]. Chinese Journal of Engineering, 2017, 39(8): 1244-1253. doi: 10.13374/j.issn2095-9389.2017.08.015

Citation:

WU Sen, LIU Lu, LU Dan. Imbalanced data ensemble classification based on cluster-based under-sampling algorithm[J]. Chinese Journal of Engineering, 2017, 39(8): 1244-1253. doi: 10.13374/j.issn2095-9389.2017.08.015

Citation:

PDF( 1967 KB)

Imbalanced data ensemble classification based on cluster-based under-sampling algorithm

doi: 10.13374/j.issn2095-9389.2017.08.015

Donlinks School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China

Received Date: 2016-12-30

Abstract

Abstract

Most traditional classification algorithms assume the data set to be well-balanced and focus on achieving overall classification accuracy. However, actual data sets are usually imbalanced, so traditional classification approaches may lead to classification errors in minority class samples. With respect to imbalanced data, there are two main methods for improving classification performance. The first is to improve the data set by increasing the number of minority class samples by over-sampling and decreasing the number of majority class samples by under-sampling. The other method is to improve the algorithm itself. By combining the cluster-based under-sampling method with ensemble classification, in this paper, an approach was proposed for classifying imbalanced data. First, the cluster-based under-sampling method is used to establish a balanced data set in the data processing stage, and then the new data set is trained by the AdaBoost ensemble algorithm. In the integration process, when calculating the error rate of integrated learning, this algorithm uses weights to distinguish minority class data from majority class data. This makes the algorithm focus more on small data classes, thereby improving the classification accuracy of minority class data.
- imbalanced data,
- under-sampling,
- classification,
- ensemble learning

FullText(HTML)

References(9)

References

[1]	Napierala K, Stefanowski J. Types of minority class examples and their influence on learning classifiers from imbalanced data. J Intell Inf Syst, 2016, 46(3):563
[2]	Glauner P, Boechat A, Dolberg L, et al. Large-scale detection of non-technical losses in imbalanced data sets//2016 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT). Minneapolis, 2016
[3]	Haque M N, Noman N, Berretta R, et al. Heterogeneous ensemble combination search using genetic algorithm for class imbalanced data classification. Plos One, 2016, 11(1):e0146116
[4]	Klein K, Hennig S, Paul S K. A bayesian modelling approach with balancing informative prior for analysing imbalanced data. Plos One, 2016, 11(4):e0152700
[5]	Chawla N V, Bowyer K W, Hall L O, et al. SMOTE:synthetic minority over-sampling technique. J Artif Intell Res, 2002, 16:321
[9]	Liu X Y, Wu J X, Zhou Z H. Exploratory under-sampling for class-imbalance learning. IEEE Trans Syst Man Cybernetics Part B Cybernetics, 2009, 39(2):539
[10]	Mani I, Zhang I. kNN approach to unbalanced data distributions:a case study involving information extraction//Proceedings of the ICML 2003 Workshop on Learning from Imbalanced Datasets. Washington DC,2003:42
[11]	Kubat M, Matwin S. Addressing the curse of imbalanced training sets:one-sided selection//International Conference on Machine Learning. Scotland, 2012:179
[13]	Dietterich T G. Machine learning research:four current directions. Artif Intell Mag, 1997, 18(4):97