<listing id="l9bhj"><var id="l9bhj"></var></listing>
<var id="l9bhj"><strike id="l9bhj"></strike></var>
<menuitem id="l9bhj"></menuitem>
<cite id="l9bhj"><strike id="l9bhj"></strike></cite>
<cite id="l9bhj"><strike id="l9bhj"></strike></cite>
<var id="l9bhj"></var><cite id="l9bhj"><video id="l9bhj"></video></cite>
<menuitem id="l9bhj"></menuitem>
<cite id="l9bhj"><strike id="l9bhj"><listing id="l9bhj"></listing></strike></cite><cite id="l9bhj"><span id="l9bhj"><menuitem id="l9bhj"></menuitem></span></cite>
<var id="l9bhj"></var>
<var id="l9bhj"></var>
<var id="l9bhj"></var>
<var id="l9bhj"><strike id="l9bhj"></strike></var>
<ins id="l9bhj"><span id="l9bhj"></span></ins>
Volume 42 Issue 9
Sep.  2020
Turn off MathJax
Article Contents
WU Sen, WANG Yu-zhi, GAO Xiao-nan. Clustering algorithm for imbalanced data based on nearest neighbor[J]. Chinese Journal of Engineering, 2020, 42(9): 1209-1219. doi: 10.13374/j.issn2095-9389.2019.10.09.003
Citation: WU Sen, WANG Yu-zhi, GAO Xiao-nan. Clustering algorithm for imbalanced data based on nearest neighbor[J]. Chinese Journal of Engineering, 2020, 42(9): 1209-1219. doi: 10.13374/j.issn2095-9389.2019.10.09.003

Clustering algorithm for imbalanced data based on nearest neighbor

doi: 10.13374/j.issn2095-9389.2019.10.09.003
More Information
  • Clustering is an important task in the field of data mining. Most clustering algorithms can effectively deal with the clustering problems of balanced datasets, but their processing ability is weak for imbalanced datasets. For example, K–means, a classical partition clustering algorithm, tends to produce a “uniform effect” when dealing with imbalanced datasets, i.e., the K–means algorithm often produces clusters that are relatively uniform in size when clustering unbalanced datasets with the data objects in small clusters “swallowing” the part of the data objects in large clusters. This means that the number and density of the data objects in different clusters tend to be the same. To solve the problem of “uniform effect” generated by the classical K–means algorithm in the clustering of imbalanced data, a clustering algorithm based on nearest neighbor (CABON) is proposed for imbalanced data. Firstly, the initial clustering of data objects is performed to obtain the undetermined-cluster set, which is defined as a set that consists of the data objects that must be checked further regarding the clusters in which they belong. Then, from the edge to the center of the set, the nearest-neighbor method is used to reassign the data objects in the undetermined-cluster set to the clusters of their nearest neighbors. Meanwhile the undetermined-cluster set is dynamically adjusted, to obtain the final clustering result, which prevents the influence of the “uniform effect” on the clustering result. The clustering results of the proposed algorithm is compared with that of K–means, the imbalanced K–means clustering method with multiple centers (MC_IK), and the coefficient of variation clustering for non-uniform data (CVCN) on synthetic and real datasets. The experimental results reveal that the CABON algorithm effectively reduces “uniform effect” generated by the K–means algorithm on imbalanced data, and its clustering result is superior to that of the K–means, MC_IK, and CVCN algorithms.

     

  • loading
  • [1]
    Wu S, Feng X D, Zhou W J. Spectral clustering of high-dimensional data exploiting sparse representation vectors. <italic>Neurocomputing</italic>, 2014, 135: 229 doi: 10.1016/j.neucom.2013.12.027
    [2]
    Wilson J, Chaudhury S, Lall B. Clustering short temporal behaviour sequences for customer segmentation using LDA. <italic>Expert Syst</italic>, 2018, 35(3): e12250 doi: 10.1111/exsy.12250
    [3]
    Zhao L B, Shi G Y. A trajectory clustering method based on Douglas-Peucker compression and density for marine traffic pattern recognition. <italic>Ocean Eng</italic>, 2019, 172: 456 doi: 10.1016/j.oceaneng.2018.12.019
    [4]
    Al-Shammari A, Zhou R, Naseriparsaa M, et al. An effective density-based clustering and dynamic maintenance framework for evolving medical data streams. <italic>Int J Med Inform</italic>, 2019, 126: 176 doi: 10.1016/j.ijmedinf.2019.03.016
    [5]
    胡圓, 李暉, 陳梅. 基于密度聚類的出租車異常軌跡檢測. 計算機與現代化, 2019(6):49 doi: 10.3969/j.issn.1006-2475.2019.06.008

    Hu Y, Li H, Chen M. Taxi abnormal trajectory detection based on density clustering. <italic>Comput Modernization</italic>, 2019(6): 49 doi: 10.3969/j.issn.1006-2475.2019.06.008
    [6]
    Han W H, Huang Z Z, Li S D, et al. Distribution-sensitive unbalanced data oversampling method for medical diagnosis. <italic>J Med Syst</italic>, 2019, 43(2): 39 doi: 10.1007/s10916-018-1154-8
    [7]
    Chen L T, Xu G H, Zhang Q, et al. Learning deep representation of imbalanced SCADA data for fault detection of wind turbines. <italic>Meas</italic>, 2019, 139: 370 doi: 10.1016/j.measurement.2019.03.029
    [8]
    Xiong H, Wu J J, Chen J. K–means clustering versus validation measures: A data-distribution perspective. <italic>IEEE Trans Syst Man Cybern Part B </italic>(<italic>Cybern</italic>)<italic></italic>, 2009, 39(2): 318 doi: 10.1109/TSMCB.2008.2004559
    [9]
    駱自超, 金隼, 邱雪峰. 考慮類內不平衡的譜聚類過抽樣方法. 計算機工程與應用, 2014, 50(11):120 doi: 10.3778/j.issn.1002-8331.1312-0148

    Luo Z C, Jin S, Qiu X F. Spectral clustering based oversampling: oversampling taking within class imbalance into consideration. <italic>Comput Eng Appl</italic>, 2014, 50(11): 120 doi: 10.3778/j.issn.1002-8331.1312-0148
    [10]
    Kumar N S, Rao K N, Govardhan A, et al. Undersampled K–means approach for handling imbalanced distributed data. <italic>Prog Artif Intelligence</italic>, 2014, 3(1): 29 doi: 10.1007/s13748-014-0045-6
    [11]
    武森, 劉露, 盧丹. 基于聚類欠采樣的集成不均衡數據分類算法. 工程科學學報, 2017, 39(8):1244

    Wu S, Liu L, Lu D. Imbalanced data ensemble classification based on cluster-based under-sampling algorithm. <italic>Chin J Eng</italic>, 2017, 39(8): 1244
    [12]
    Lin W C, Tsai C F, Hu Y H, et al. Clustering-based undersampling in class-imbalanced data. <italic>Inform Sci</italic>, 2017, 409-410: 17 doi: 10.1016/j.ins.2017.05.008
    [13]
    Liang J Y, Bai L, Dang C Y, et al. The K–means–type algorithms versus imbalanced data distributions. <italic>IEEE Trans Fuzzy Syst</italic>, 2012, 20(4): 728 doi: 10.1109/TFUZZ.2011.2182354
    [14]
    亓慧. 多中心的非平衡K–均值聚類方法. 中北大學學報(自然科學版), 2015, 36(4):453

    Qi H. Imbalanced K–means clustering method with multiple centers. <italic>J North Univ China Nat Sci</italic>, 2015, 36(4): 453
    [15]
    楊天鵬, 徐鯤鵬, 陳黎飛. 非均勻數據的變異系數聚類算法. 山東大學學報: 工學版, 2018, 48(3):140

    Yang T P, Xu K P, Chen L F. Coefficient of variation clustering algorithm for non-uniform data. <italic>J Shandong Univ Eng Sci</italic>, 2018, 48(3): 140
    [16]
    劉歡, 胡德敏. 類不平衡數據的卡方聚類算法研究. 軟件, 2019, 40(4):7 doi: 10.3969/j.issn.1003-6970.2019.04.002

    Liu H, Hu D M. Research on Chi-square clustering algorithm for unbalanced data. <italic>Comput Eng Software</italic>, 2019, 40(4): 7 doi: 10.3969/j.issn.1003-6970.2019.04.002
    [17]
    江鵬. 面向非平衡數據集的多簇IB算法研究[學位論文]. 鄭州: 鄭州大學, 2015

    Jiang P. The Research of Multi-clusters IB Algorithm for Imbalanced Data Set [Dissertation]. Zhengzhou: Zhengzhou University, 2015
    [18]
    白亮. 聚類學習的理論分析與高效算法研究[學位論文]. 太原: 山西大學, 2012

    Bai L. Theoretical Analysis and Effective Algorithms of Cluster Learning [Dissertation]. Taiyuan: Shanxi University, 2012
    [19]
    Gionis A, Mannila H, Tsaparas P. Clustering aggregation. <italic>ACM Trans Knowledge Discovery Data</italic>, 2007, 1(1): 1 doi: 10.1145/1217299.1217300
    [20]
    Chen M, Li L J, Wang B, et al. Effectively clustering by finding density backbone based-on kNN. <italic>Pattern Recognit</italic>, 2016, 60: 486 doi: 10.1016/j.patcog.2016.04.018
    [21]
    李濤, 葛洪偉, 蘇樹智. 基于密度自適應距離的密度峰聚類. 小型微型計算機系統, 2017, 38(6):1347 doi: 10.3969/j.issn.1000-1220.2017.06.032

    Li T, Geng H W, Su S Z. Density peaks clustering based on density adaptive distance. <italic>J Chin Comput Syst</italic>, 2017, 38(6): 1347 doi: 10.3969/j.issn.1000-1220.2017.06.032
    [22]
    Forina M. Wine Data Set [EB/OL]. UCI Machine Learning (1991-07-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Wine
    [23]
    Quinlan J R. Thyroid Disease Data Set [EB/OL]. UCI Machine Learning (1987-01-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease
    [24]
    Sigillito V G. Ionosphere Data Set [EB/OL]. UCI Machine Learning (1989-01-01) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Ionosphere
    [25]
    Dua D, Graff C. Statlog (Heart) Data Set [EB/OL]. UCI Machine Learning (1993-02-13) [2019-10-09]. http://archive.ics.uci.edu/ml/datasets/Statlog+%28Heart%29
    [26]
    武鵬鵬. 初始類中心選擇及在非平衡數據中的聚類研究[學位論文]. 太原: 山西大學, 2015

    Wu P P. Research on Initial Cluster Centers Choice Algorithm and Clustering for Imbalanced Data [Dissertation]. Taiyuan: Shanxi University, 2015
    [27]
    傅立偉, 武森. 基于屬性值集中度的分類數據聚類有效性內部評價指標. 工程科學學報, 2019, 41(5):682

    Fu L W, Wu S. A new internal clustering validation index for categorical data based on concentration of attribute values. <italic>Chin J Eng</italic>, 2019, 41(5): 682
    [28]
    Hussain S F, Haris M. A K–means based co-clustering (kCC) algorithm for sparse, high dimensional data. <italic>Expert Syst Appl</italic>, 2019, 118: 20 doi: 10.1016/j.eswa.2018.09.006
    [29]
    Yeh C C, Yang M S. Evaluation measures for cluster ensembles based on a fuzzy generalized Rand index. <italic>Appl Soft Comput</italic>, 2017, 57: 225 doi: 10.1016/j.asoc.2017.03.030
    [30]
    Qannari E M, Courcoux P, Faye P. Significance test of the adjusted Rand index. Application to the free sorting task. <italic>Food Qual Preference</italic>, 2014, 32: 93 doi: 10.1016/j.foodqual.2013.05.005
  • 加載中

Catalog

    通訊作者: 陳斌, bchen63@163.com
    • 1. 

      沈陽化工大學材料科學與工程學院 沈陽 110142

    1. 本站搜索
    2. 百度學術搜索
    3. 萬方數據庫搜索
    4. CNKI搜索

    Figures(12)  / Tables(9)

    Article views (3037) PDF downloads(155) Cited by()
    Proportional views
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return
    久色视频