A new internal clustering validation index for categorical data based on concentration of attribute values

FU Li-wei; WU Sen

doi:10.13374/j.issn2095-9389.2019.05.015

Volume 41 Issue 5

May 2019

Turn off MathJax

Article Contents

Article Navigation > Chinese Journal of Engineering > 2019 > 41(5): 682-693

FU Li-wei, WU Sen. A new internal clustering validation index for categorical data based on concentration of attribute values[J]. Chinese Journal of Engineering, 2019, 41(5): 682-693. doi: 10.13374/j.issn2095-9389.2019.05.015

Citation:

FU Li-wei, WU Sen. A new internal clustering validation index for categorical data based on concentration of attribute values[J]. Chinese Journal of Engineering, 2019, 41(5): 682-693. doi: 10.13374/j.issn2095-9389.2019.05.015

Citation:

PDF( 1238 KB)

A new internal clustering validation index for categorical data based on concentration of attribute values

doi: 10.13374/j.issn2095-9389.2019.05.015

FU Li-wei,
WU Sen^,

Donlinks School of Economics and Management, University of Science and Technology Beijing, Beijing 100083, China

More Information

Corresponding author: WU Sen, E-mail: wusen@manage.ustb.edu.cn
Received Date: 2018-04-18
Publish Date: 2019-05-01

Abstract

Abstract

Clustering is a main task of data mining, and its purpose is to identify natural structures in a dataset. The results of cluster analysis are not only related to the nature of the data itself but also to some priori conditions, such as clustering algorithms, similarity/dissimilarity, and parameters. For data without a clustering structure, clustering results need to be evaluated. For data with a clustering structure, different results obtained under different algorithms and parameters also need to be further optimized by clustering validation. Moreover, clustering validation is vital to clustering applications, especially when external information is not available. It is applied in algorithm selection, parameter determination, number of clusters determination. Most traditional internal clustering validation indices for numerical data fail to measure the categorical data. Categorical data is a popular data type, and its attribute value is discrete and cannot be ordered. For categorical data, the existing measures have their limitations in different application circumstances. In this paper, a new similarity based on the concentration ratio of every attribute value, called CONC, which can evaluate the similarity of objects in a cluster, was defined. Similarly, a new dissimilarity based on the discrepancy of characteristic attribute values, called DCRP, which can evaluate the dissimilarity between two clusters, was defined. A new internal clustering validation index, called CVC, which is based on CONC and DCRP, was proposed. Compared to other indices, CVC has three characteristics: (1) it evaluates the compactness of a cluster based on the information of the whole dataset and not only that of a cluster; (2) it evaluates the separation between two clusters by several characteristic attributes values so that the clustering information is not lost and the negative effects caused by noise are eliminated; (3) it evaluates the compactness and separation without influence from the number of objects. Furthermore, UCI benchmark datasets were used to compare the proposed index with other internal clustering validation indices (CU, CDCS, and IE). An external index (NMI) was used to evaluate the effect of these internal indices. According to the experiment results, CVC is more effective than the other internal clustering validation indices. In addition, CVC, as an internal index, is more applicable than the NMI external index, because it can evaluate the clustering results without external information.
- cluster analysis,
- internal clustering validation index,
- categorical data,
- high dimensional data,
- similarity,
- dissimi-larity

FullText(HTML)

References(22)

References

[1]	Cornuéjols A, Wemmert C, Gan?arski P, et al. Collaborative clustering: why, when, what and how. Inf Fusion, 2017, 39: 81
[2]	楊虎, 付宇, 范丹. 噪音特征對聚類內部有效性的影響. 計算機科學, 2018, 45(7): 22 https://www.cnki.com.cn/Article/CJFDTOTAL-JSJA201807004.htm Yang H, Fu Y, Fan D. Influence of noisy features on internal validation of clustering. Comput Sci, 2018, 45(7): 22 https://www.cnki.com.cn/Article/CJFDTOTAL-JSJA201807004.htm
[3]	Cheung Y M, Jia H. Categorical-and-numerical-attribute data clustering based on a unified similarity metric without knowing cluster number. Pattern Recognit, 2013, 46(8): 2228 doi: 10.1016/j.patcog.2013.01.027
[4]	dos Santos T R L, Zárate L E. Categorical data clustering: what similarity measure to recommend?. Expert Syst Appl, 2015, 42(3): 1247 doi: 10.1016/j.eswa.2014.09.012
[5]	武森, 姜丹丹, 王薔. 分類屬性數據聚類算法HABOS. 工程科學學報, 2016, 38(7): 1017 https://www.cnki.com.cn/Article/CJFDTOTAL-BJKD201607018.htm Wu S, Jiang D D, Wang Q. HABOS clustering algorithm for categorical data. Chin J Eng, 2016, 38(7): 1017 https://www.cnki.com.cn/Article/CJFDTOTAL-BJKD201607018.htm
[6]	Ilango V, Subramanian R, Vasudevan V. Cluster analysis research design model, problems, issues, challenges, trends and tools. Int J Comput Sci Eng, 2011, 3(8): 3064
[7]	Huang D, Lai J H, Wang C D. Ensemble clustering using factor graph. Pattern Recognit, 2016, 50: 131 doi: 10.1016/j.patcog.2015.08.015
[8]	黃棟, 王昌棟, 賴劍煌, 等. 基于決策加權的聚類集成算法. 智能系統學報, 2016, 11(3): 418 https://www.cnki.com.cn/Article/CJFDTOTAL-ZNXT201603018.htm Huang D, Wang C D, Lai J H, et al. Clustering ensemble by decision weighting. CAAI Trans Intell Syst, 2016, 11(3): 418 https://www.cnki.com.cn/Article/CJFDTOTAL-ZNXT201603018.htm
[9]	Zhao X W, Liang J Y, Dang C Y. Clustering ensemble selection for categorical data based on internal validity indices. Pattern Recognit, 2017, 69: 150 doi: 10.1016/j.patcog.2017.04.019
[10]	Jaskowiak P A, Moulavi D, Furtado A C S, et al. On strategies for building effective ensembles of relative clustering validity criteria. Knowledge Inf Syst, 2016, 47(2): 329 doi: 10.1007/s10115-015-0851-6
[11]	Yu Z W, Li L, Gao Y J, et al. Hybrid clustering solution selection strategy. Pattern Recognit, 2014, 47(10): 3362 doi: 10.1016/j.patcog.2014.04.005
[12]	Li F J, Qian Y H, Wang J T, et al. Cluster's quality evaluation and selective clustering ensemble. ACM Trans Knowledge Discovery Data, 2018, 12(5): 60
[13]	Larsen B, Aone C. Fast and effective text mining using linear-time document clustering // Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, 1999: 16
[14]	Halkidi M, Batistakis Y, Vazirgiannis M. Cluster validity methods: Part I. ACM SIGMOD Record, 2002, 31(2): 40 doi: 10.1145/565117.565124
[15]	Halkidi M, Batistakis Y, Vazirgiannis M. Clustering validity checking methods: Part Ⅱ. SIGMOD Record, 2002, 31(3): 19 doi: 10.1145/601858.601862
[16]	Fu L W, Wu S. An internal clustering validation index for boolean data. Cybernetics Inf Technol, 2016, 16(6): 232 doi: 10.1515/cait-2016-0091
[17]	Gluck M. Information, uncertainty, and the utility of categories // Proceedings of the Seventh Annual Conference of the Cognitive Science Society. Irvine, 1985: 283
[18]	Bai L, Liang J Y. Cluster validity functions for categorical data: a solution-space perspective. Data Min Knowledge Discovery, 2015, 29(6): 1560 doi: 10.1007/s10618-014-0387-5
[19]	Chang C H, Ding Z K. Categorical data visualization and clustering using subjective factors. Data Knowledge Eng, 2005, 53(3): 243 doi: 10.1016/j.datak.2004.09.001
[20]	Barbará D, Li Y, Couto J. COOLCAT: an entropy-based algorithm for categorical clustering // Proceedings of the 11th International Conference of Information Knowledge Management. McLean, 2002: 582
[21]	Xiong H, Wu J J, Chen J. K-means clustering versus validation measures: a data distribution perspective. IEEE Trans Syst Man Cybern B Cybern, 2009, 39(2): 318 doi: 10.1109/TSMCB.2008.2004559
[22]	Sangam R S, Om H. The k-modes algorithm with entropy based similarity coefficient. Procedia Comput Sci, 2015, 50: 93 doi: 10.1016/j.procs.2015.04.066