Cluster analysis aims at identifying groups of similar objects and, therefore helps to discover distribution of patterns and interesting correlations in large data sets. It has been subject of wide research since it arises in many application domains in engineering, business and social sciences. Especially, in the last years the availability of huge transactional and experimental data sets and the arising requirements for data mining created needs for clustering algorithms that scale and can be applied in diverse domains. This paper introduces the fundamental concepts of clustering while it surveys the widely known clustering algorithms in a comparative way. Moreover, it addresses an important issue of clustering process regarding the quality assessment of the clustering results. This is also related to the inherent features of the data set under concern. A review of clustering validity measures and approaches available in the literature is presented. Furthermore, the paper illustrates the issues that are under-addressed by the recent algorithms and gives the trends in clustering process.
translated by 谷歌翻译
Although the goal of clustering is intuitively compelling and its notion arises in many fields, it is difficult to define a unified approach to address the clustering problem and thus diverse clustering algorithms abound in the research community. These algorithms, under different clustering assumptions, often lead to qualitatively different results. As a consequence the results of clustering algorithms (i.e. data set partitionings) need to be evaluated as regards their validity based on widely accepted criteria. In this paper a cluster validity index, CDbw, is proposed which assesses the compactness and separation of clusters defined by a clustering algorithm. The cluster validity index, given a data set and a set of clustering algorithms, enables: i) the selection of the input parameter values that lead an algorithm to the best possible partitioning of the data set, and ii) the selection of the algorithm that provides the best partitioning of the data set. CDbw handles efficiently arbitrarily shaped clusters by representing each cluster with a number of points rather than by a single representative point. A full implementation and experimental results confirm the reliability of the validity index showing also that its performance compares favourably to that of several others.
translated by 谷歌翻译
The validation of the results obtained by clustering algorithms is a fundamental part of the clustering process. The most used approaches for cluster validation are based on internal cluster validity indices. Although many indices have been proposed, there is no recent extensive comparative study of their performance. In this paper we show the results of an experimental work that compares 30 cluster validity indices in many different environments with different characteristics. These results can serve as a guideline for selecting the most suitable index for each possible application and provide a deep insight into the performance differences between the currently available indices.
translated by 谷歌翻译
Organizing data into sensible groupings is one of the most fundamental modes of understanding and learning. As an example, a common scheme of scientific classification puts organisms into a system of ranked taxa: domain, kingdom, phylum, class, etc. Cluster analysis is the formal study of methods and algorithms for grouping, or clustering, objects according to measured or perceived intrinsic characteristics or similarity. Cluster analysis does not use category labels that tag objects with prior identifiers, i.e., class labels. The absence of category information distinguishes data clustering (unsupervised learning) from classification or discriminant analysis (supervised learning). The aim of clustering is to find structure in data and is therefore exploratory in nature. Clustering has a long and rich history in a variety of scientific fields. One of the most popular and simple clustering algorithms, K-means, was first published in 1955. In spite of the fact that K-means was proposed over 50 years ago and thousands of clustering algorithms have been published since then, K-means is still widely used. This speaks to the difficulty in designing a general purpose clustering algorithm and the ill-posed problem of clustering. We provide a brief overview of clustering, summarize well known clustering methods, discuss the major challenges and key issues in designing clustering algorithms, and point out some of the emerging and useful research directions, including semi-supervised clustering, ensemble clustering, simultaneous feature selection during data clustering, and large scale data clustering.
translated by 谷歌翻译
This paper presents a survey of evolutionary algorithms designed for clustering tasks. It tries to reflect the profile of this area by focusing more on those subjects that have been given more importance in the literature. In this context, most of the paper is devoted to partitional algorithms that look for hard clusterings of data, though overlapping (i.e., soft and fuzzy) approaches are also covered in the paper. The paper is original in what concerns two main aspects. First, it provides an up-to-date overview that is fully devoted to evolutionary algorithms for clustering, is not limited to any particular kind of evolutionary approach, and comprises advanced topics like multiobjective and ensemble-based evolutionary clustering. Second, it provides a taxonomy that highlights some very important aspects in the context of evolutionary data clustering , namely, fixed or variable number of clusters, cluster-oriented or nonoriented operators, context-sensitive or context-insensitive operators, guided or unguided operators, binary, integer, or real encodings, centroid-based, medoid-based, label-based, tree-based, or graph-based representations, among others. A number of references are provided that describe applications of evolutionary algorithms for clustering in different domains, such as image processing , computer security, and bioinformatics. The paper ends by addressing some important issues and open questions that can be subject of future research.
translated by 谷歌翻译
Clustering is the unsupervised classification of patterns (observations, data items, or feature vectors) into groups (clusters). The clustering problem has been addressed in many contexts and by researchers in many disciplines; this reflects its broad appeal and usefulness as one of the steps in exploratory data analysis. However, clustering is a difficult problem combinatorially, and differences in assumptions and contexts in different communities has made the transfer of useful generic concepts and methodologies slow to occur. This paper presents an overview of pattern clustering methods from a statistical pattern recognition perspective, with a goal of providing useful advice and references to fundamental concepts accessible to the broad community of clustering practitioners. We present a taxonomy of clustering techniques, and identify cross-cutting themes and recent advances. We also describe some important applications of clustering algorithms such as image segmentation, object recognition, and information retrieval.
translated by 谷歌翻译
One of the most challenging aspects of clustering is validation , which is the objective and quantitative assessment of clustering results. A number of different relative validity criteria have been proposed for the validation of globular, clusters. Not all data, however, are composed of globular clusters. Density-based clustering algorithms seek partitions with high density areas of points (clusters, not necessarily globular) separated by low density areas, possibly containing noise objects. In these cases relative validity indices proposed for globular cluster validation may fail. In this paper we propose a relative validation index for density-based, arbitrarily shaped clusters. The index assesses clustering quality based on the relative density connection between pairs of objects. Our index is formulated on the basis of a new kernel density function, which is used to compute the density of objects and to evaluate the within-and between-cluster density connectedness of clustering results. Experiments on synthetic and real world data show the effectiveness of our approach for the evaluation and selection of clustering algorithms and their respective appropriate parameters.
translated by 谷歌翻译
Determining the number of clusters is an important part of cluster validity that has been widely studied in cluster analysis. Sum-of-squares based indices show promising properties in terms of determining the number of clusters. However, knee point detection is often required because most indices show monotonicity with increasing number of clusters. Therefore, indices with a clear minimum or maximum value are preferred. The aim of this paper is to revisit a sum-of-squares based index called the WB-index that has a minimum value as the determined number of clusters. We shed light on the relation between the WB-index and two popular indices which are the Calinski-Harabasz and the Xu-index. According to a theoretical comparison, the Calinski-Harabasz index is shown to be affected by the data size and level of data overlap. The Xu-index is close to the WB-index theoretically, however, it does not work well when the dimension of the data is greater than two. Here, we conduct a more thorough comparison of 12 internal indices and provide a summary of the experimental performance of different indices. Furthermore, we introduce the sum-of-squares based indices into automatic keyword categorization, where the indices are specially defined for determining the number of clusters.
translated by 谷歌翻译
Data with mixed-type (metric-ordinal-nominal) variables are typical for social strat-ification, i.e. partitioning a population into social classes. Approaches to cluster such data are compared, namely a latent class mixture model assuming local independence and dissimilar-ity-based methods such as k-medoids. The design of an appropriate dissimilarity measure and the estimation of the number of clusters are discussed as well, comparing the Bayesian information criterion with dissimilarity-based criteria. The comparison is based on a philosophy of cluster analysis that connects the problem of a choice of a suitable clustering method closely to the application by considering direct interpretations of the implications of the methodology. The application of this philosophy to economic data from the 2007 US Survey of Consumer Finances demonstrates techniques and decisions required to obtain an interpretable clustering. The clustering is shown to be significantly more structured than a suitable null model. One result is that the data-based strata are not as strongly connected to occupation categories as is often assumed in the literature.
translated by 谷歌翻译
Data analysis plays an indispensable role for understanding various phenomena. Cluster analysis, primitive exploration with little or no prior knowledge, consists of research developed across a wide variety of communities. The diversity, on one hand, equips us with many tools. On the other hand, the profusion of options causes confusion. We survey clustering algorithms for data sets appearing in statistics, computer science, and machine learning, and illustrate their applications in some benchmark data sets, the traveling salesman problem, and bioin-formatics, a new field attracting intensive efforts. Several tightly related topics, proximity measure, and cluster validation, are also discussed. Index Terms-Adaptive resonance theory (ART), clustering, clustering algorithm, cluster validation, neural networks, proximity , self-organizing feature map (SOFM).
translated by 谷歌翻译
We present V-measure, an external entropy-based cluster evaluation measure. V-measure provides an elegant solution to many problems that affect previously defined cluster evaluation measures including 1) dependence on clustering algorithm or data set, 2) the "problem of matching", where the clustering of only a portion of data points are evaluated and 3) accurate evaluation and combination of two desirable aspects of clustering, homogeneity and completeness. We compare V-measure to a number of popular cluster evaluation measures and demonstrate that it satisfies several desirable properties of clustering solutions, using simulated clustering results. Finally, we use V-measure to evaluate two clustering tasks: document clustering and pitch accent type clustering.
translated by 谷歌翻译
Clustering validation is a long standing challenge in the clustering literature. While many validation measures have been developed for evaluating the performance of clustering algorithms , these measures often provide inconsistent information about the clustering performance and the best suitable measures to use in practice remain unknown. This paper thus fills this crucial void by giving an organized study of 16 external validation measures for K-means clustering. Specifically , we first introduce the importance of measure normal-ization in the evaluation of the clustering performance on data with imbalanced class distributions. We also provide normalization solutions for several measures. In addition, we summarize the major properties of these external measures. These properties can serve as the guidance for the selection of validation measures in different application scenarios. Finally , we reveal the interrelationships among these external measures. By mathematical transformation, we show that some validation measures are equivalent. Also, some measures have consistent validation performances. Most importantly , we provide a guide line to select the most suitable validation measures for K-means clustering.
translated by 谷歌翻译
Clustering is a division of data into groups of similar objects. Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification. It models data by its clusters. Data modeling puts clustering in a historical perspective rooted in mathematics, statistics, and numerical analysis. From a machine learning perspective clusters correspond to hidden patterns, the search for clusters is unsupervised learning, and the resulting system represents a data concept. From a practical perspective clustering plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, spatial database applications, Web analysis, CRM, marketing, medical diagnostics, computational biology, and many others. Clustering is the subject of active research in several fields such as statistics, pattern recognition, and machine learning. This survey focuses on clustering in data mining. Data mining adds to clustering the complications of very large datasets with very many attributes of different types. This imposes unique computational requirements on relevant clustering algorithms. A variety of algorithms have recently emerged that meet these requirements and were successfully applied to real-life data mining problems. They are subject of the survey.
translated by 谷歌翻译
Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes , it is difficult to define a distance between pairs of values of a categorical attribute, since the values are not ordered. In this paper, we propose a framework to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute A i can be determined by the way in which the values of the other attributes A j are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of A i a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes A j. We validate our approach by embedding our distance learning framework in a hierarchical clustering algorithm. We applied it on various real world and synthetic datasets, both low and high-dimensional. Experimental results show that our method is competitive w.r.t. the state of the art of categorical data clustering approaches. We also show that our approach is scalable and has a low impact on the overall computational time of a clustering task.
translated by 谷歌翻译
We investigate how random projection can best be used for clustering high dimensional data. Random projection has been shown to have promising theoretical properties. In practice, however, we find that it results in highly unstable clustering performance. Our solution is to use random projection in a cluster ensemble approach. Empirical results show that the proposed approach achieves better and more robust clustering performance compared to not only single runs of random projection/clustering but also clustering with PCA, a traditional data reduction method for high dimensional data. To gain insights into the performance improvement obtained by our ensemble method, we analyze and identify the influence of the quality and the diversity of the individual clustering solutions on the final ensemble performance.
translated by 谷歌翻译
分层聚类是一类旨在构建聚类的层次结构的算法。它一直是构建嵌入式分类方案的主要方法,因为它输出树状图,同时捕获各级粒度成员之间的层次关系。在算法意义上贪婪,分层聚类在每个步骤仅基于相似性/不相似性度量来划分数据。聚类结果通常不仅取决于底层数据的分布,还取决于相异度度量和聚类算法的选择。在本文中,我们提出了一种方法,将关于实体关系的先验领域知识纳入分层聚类。具体来说,我们使用距离函数在算法空间中编码外部本体信息。我们证明了基于流变链接的算法可以忠实地恢复编码结构。类似于一些正则化的机器学习技术,我们将这个距离作为惩罚项添加到原始成对距离以调节树形图的终结构。作为案例研究,我们将这种方法应用于实数数据,为亚马逊服务构建基于客户行为的产品分类,利用来自亚马逊范围内更广泛的浏览结构的信息。当想要利用来自外部源的关系信息,或者用于生成距离矩阵的数据是嘈杂和稀疏时,该方法是有用的。我们的工作属于半监督者约束聚类的范畴。
translated by 谷歌翻译
Use of traditional k-mean type algorithm is limited to numeric data. This paper presents a clustering algorithm based on k-mean paradigm that works well for data with mixed numeric and categorical features. We propose new cost function and distance measure based on co-occurrence of values. The measures also take into account the significance of an attribute towards the clustering process. We present a modified description of cluster center to overcome the numeric data only limitation of k-mean algorithm and provide a better characterization of clusters. The performance of this algorithm has been studied on real world data sets. Comparisons with other clustering algorithms illustrate the effectiveness of this approach.
translated by 谷歌翻译
Time series clustering has been shown effective in providing useful information in various domains. There seems to be an increased interest in time series clustering as part of the effort in temporal data mining research. To provide an overview, this paper surveys and summarizes previous works that investigated the clustering of time series data in various application domains. The basics of time series clustering are presented, including general-purpose clustering algorithms commonly used in time series clustering studies, the criteria for evaluating the performance of the clustering results, and the measures to determine the similarity/dissimilarity between two time series being compared, either in the forms of raw data, extracted features, or some model parameters. The past researchs are organized into three groups depending upon whether they work directly with the raw data either in the time or frequency domain, indirectly with features extracted from the raw data, or indirectly with models built from the raw data. The uniqueness and limitation of previous research are discussed and several possible topics for future research are identified. Moreover, the areas that time series clustering have been applied to are also summarized, including the sources of data used. It is hoped that this review will serve as the steppingstone for those interested in advancing this area of research.
translated by 谷歌翻译
时间序列聚类是根据相似性或特征对时间序列进行分组的过程。以前的方法通常结合时间序列的非特定距离度量和标准聚类方法。然而,这些方法没有考虑每个时间序列的不同后序列的相似性,这可以用于更好地比较数据集的时间序列对象。在本文中,我们提出了基于两个聚类阶段的时间序列聚类的新技术。在第一步中,将最小二乘多项式分割过程应用于每个时间序列,其基于返回不同长度段的增长窗口技术。然后,基于近似于该段的模型的系数和一组统计特征,将所有段投影到相同的维空间中。映射之后,将第一个分层聚类阶段应用于所有映射的段,返回每个时间序列的分段组。在定义另一个特定的映射过程之后,这些簇用于表示同一维空间中的所有时间序列。在第二个也是最后一个聚类阶段,所有时间序列对象都被分组。我们考虑内部聚类质量来自动调整算法的主要参数,这是一个用于分割的误差阈值。从UCR时间序列分类档案的84个数据集中获得的结果已经与两种最先进的方法进行了比较,表明该方法的表现非常有前途。
translated by 谷歌翻译
A Monte Carlo evaluation of 30 procedures for determining the number of clusters was conducted on artificial data sets which contained either 2, 3, 4, or 5 distinct nonoverlapping clusters. To provide a variety of clustering solutions, the data sets were analyzed by four hierarchical clustering methods. External criterion measures indicated excellent recovery of the true cluster structure by the methods at the correct hierarchy level. Thus, the clustering present in the data was quite strong. The simulation results for the stopping rules revealed a wide range in their ability to determine the correct number of clusters in the data. Several procedures worked fairly well, whereas others performed rather poorly. Thus, the latter group of rules would appear to have little validity, particularly for data sets containing distinct clusters. Applied researchers are urged to select one or more of the better criteria. However, users are cautioned that the performance of some of the criteria may be data dependent.
translated by 谷歌翻译