我们建议并研究一种具有内在网络结构的数据的新型图形聚类方法。与光谱聚类类似,我们利用数据的固有网络结构来构建欧几里得特征向量。然后可以将这些特征向量馈入基本的聚类方法,例如基于K均值或高斯混合模型(GMM)的软聚类。除了光谱聚类之外,我们的方法设定的原因是,我们不使用图形laplacian的特征向量来构建特征向量。取而代之的是,我们使用总变异最小化问题的解决方案来构建反映数据点之间连接性的特征向量。我们的动机是,总变异最小化的溶液在给定的一组种子节点周围是零件的常数。这些种子节点可以从域知识或基于数据网络结构的简单启发式方法中获得。我们的结果表明,我们的聚类方法可以应对某些对光谱聚类方法具有挑战性的图形结构。
translated by 谷歌翻译
We develop the theory and algorithmic toolbox for networked federated learning in decentralized collections of local datasets with an intrinsic network structure. This network structure arises from domain-specific notions of similarity between local datasets. Different notions of similarity are induced by spatio-temporal proximity, statistical dependencies or functional relations. Our main conceptual contribution is to formulate networked federated learning using a generalized total variation minimization. This formulation unifies and considerably extends existing federated multi-task learning methods. It is highly flexible and can be combined with a broad range of parametric models including Lasso or deep neural networks. Our main algorithmic contribution is a novel networked federated learning algorithm which is well suited for distributed computing environments such as edge computing over wireless networks. This algorithm is robust against inexact computations arising from limited computational resources including processing time or bandwidth. For local models resulting in convex problems, we derive precise conditions on the local models and their network structure such that our algorithm learns nearly optimal local models. Our analysis reveals an interesting interplay between the convex geometry of local models and the (cluster-) geometry of their network structure.
translated by 谷歌翻译
In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.
translated by 谷歌翻译
超图允许使用多向高阶关系建模问题。然而,大多数现有超图的算法的计算成本可能严重取决于输入的超图尺寸。为了解决不断增加的计算挑战,可以通过积极聚合其顶点(节点)来预先处理给定的超图来促进图表粗化。然而,未经纳入启发式图粗化技术的最先进的超图分区(聚类)方法未得到优化,以保留超图的结构(全局)属性。在这项工作中,我们提出了一种有效的光谱超图粗化方案(HypersF),以保持超图的原始光谱(结构)特性。我们的方法利用了最近的强烈局部最大流量的聚类算法,用于检测最小化比例的超图形顶点集。为了进一步提高算法效率,我们通过利用与原始超图对应的二分形图的光谱聚类来提出分频和征服方案。我们从现实世界VLSI设计基准提取的各种超图的实验结果表明,与现有最先进的现有技术相比,所提出的超图粗略化算法可以显着提高超图和运行时效率的多线电导算法。
translated by 谷歌翻译
由于其数值益处增加及其坚实的数学背景,光谱聚类方法的非线性重构近来的关注。我们在$ p $ -norm中提出了一种新的直接多道谱聚类算法,以$ p \ in(1,2] $。计算图表的多个特征向量的问题$ p $ -laplacian,标准的非线性概括Graph Laplacian,被重用作为Grassmann歧管的无约束最小化问题。$ P $的价值以伪连续的方式减少,促进对应于最佳图形的稀疏解决方案载体作为$ P $接近。监测单调减少平衡图削减了我们从$ P $ -Levels获得的最佳可用解决方案的保证。我们展示了我们算法在各种人工测试案件中的算法的有效性和准确性。我们的数值和比较结果具有各种状态-Art聚类方法表明,所提出的方法在均衡的图形剪切度量和标签分配的准确性方面取得高质量的集群。此外,我们进行S面部图像和手写字符分类的束缚,以展示现实数据集中的适用性。
translated by 谷歌翻译
大图通常出现在社交网络,知识图,推荐系统,生命科学和决策问题中。通过其高级别属性总结大图有助于解决这些设置中的问题。在光谱聚类中,我们旨在确定大多数边缘落在簇内的节点簇,而在簇之间只有很少的边缘。此任务对于许多下游应用和探索性分析很重要。光谱聚类的核心步骤是执行相应图的拉普拉斯矩阵(或等效地,奇异值分解,SVD)的特征分类。迭代奇异值分解方法的收敛取决于给定矩阵的光谱的特征,即连续特征值之间的差异。对于对应于群集图的图形的图形拉普拉斯,特征值将是非负的,但很小(小于$ 1 $)的减慢收敛性。本文引入了一种可行的方法,用于扩张光谱以加速SVD求解器,然后又是光谱群集。这是通过对矩阵操作的多项式近似来实现的,矩阵操作有利地改变矩阵的光谱而不更改其特征向量。实验表明,这种方法显着加速了收敛,我们解释了如何并行化和随机近似于可用的计算。
translated by 谷歌翻译
This paper introduces a scalable algorithmic framework (HyperEF) for spectral coarsening (decomposition) of large-scale hypergraphs by exploiting hyperedge effective resistances. Motivated by the latest theoretical framework for low-resistance-diameter decomposition of simple graphs, HyperEF aims at decomposing large hypergraphs into multiple node clusters with only a few inter-cluster hyperedges. The key component in HyperEF is a nearly-linear time algorithm for estimating hyperedge effective resistances, which allows incorporating the latest diffusion-based non-linear quadratic operators defined on hypergraphs. To achieve good runtime scalability, HyperEF searches within the Krylov subspace (or approximate eigensubspace) for identifying the nearly-optimal vectors for approximating the hyperedge effective resistances. In addition, a node weight propagation scheme for multilevel spectral hypergraph decomposition has been introduced for achieving even greater node coarsening ratios. When compared with state-of-the-art hypergraph partitioning (clustering) methods, extensive experiment results on real-world VLSI designs show that HyperEF can more effectively coarsen (decompose) hypergraphs without losing key structural (spectral) properties of the original hypergraphs, while achieving over $70\times$ runtime speedups over hMetis and $20\times$ speedups over HyperSF.
translated by 谷歌翻译
Graph clustering is a fundamental problem in unsupervised learning, with numerous applications in computer science and in analysing real-world data. In many real-world applications, we find that the clusters have a significant high-level structure. This is often overlooked in the design and analysis of graph clustering algorithms which make strong simplifying assumptions about the structure of the graph. This thesis addresses the natural question of whether the structure of clusters can be learned efficiently and describes four new algorithmic results for learning such structure in graphs and hypergraphs. All of the presented theoretical results are extensively evaluated on both synthetic and real-word datasets of different domains, including image classification and segmentation, migration networks, co-authorship networks, and natural language processing. These experimental results demonstrate that the newly developed algorithms are practical, effective, and immediately applicable for learning the structure of clusters in real-world data.
translated by 谷歌翻译
我们提出了一种新型的强大分散图聚类算法,该算法与流行的光谱聚类方法相当。我们提出的方法使用现有的波方程聚类算法,该算法基于通过图的传播波。但是,我们提出的方法没有在每个节点上使用快速的傅立叶变换(FFT)计算,而是利用了Koopman操作员框架。具体而言,我们表明,在图中传播波,然后在每个节点处进行局部动态模式分解(DMD)计算,能够检索图形laplacian的特征值和局部特征向量组件,从而为所有节点提供局部群集分配。我们证明,DMD计算比现有的基于FFT的方法更强大,并且需要减少波动方程的步骤20倍,以准确恢复群集信息并通过数量级减少相对误差。我们在一系列图集聚类问题上演示了分散的方法。
translated by 谷歌翻译
这是针对非线性维度和特征提取方法的教程和调查论文,该方法基于数据图的拉普拉斯语。我们首先介绍邻接矩阵,拉普拉斯矩阵的定义和拉普拉斯主义的解释。然后,我们涵盖图形和光谱聚类的切割,该谱图应用于数据子空间。解释了Laplacian征收及其样本外扩展的不同优化变体。此后,我们将保留投影的局部性及其内核变体作为拉普拉斯征本征的线性特殊案例。然后解释了图嵌入的版本,这些版本是Laplacian eigenmap和局部保留投影的广义版本。最后,引入了扩散图,这是基于数据图和随机步行的方法。
translated by 谷歌翻译
Research in Graph Signal Processing (GSP) aims to develop tools for processing data defined on irregular graph domains. In this paper we first provide an overview of core ideas in GSP and their connection to conventional digital signal processing, along with a brief historical perspective to highlight how concepts recently developed in GSP build on top of prior research in other areas. We then summarize recent advances in developing basic GSP tools, including methods for sampling, filtering or graph learning. Next, we review progress in several application areas using GSP, including processing and analysis of sensor network data, biological data, and applications to image processing and machine learning.
translated by 谷歌翻译
Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -- given $n$ input points, most kernel-based algorithms need to materialize the full $n \times n$ kernel matrix before performing any subsequent computation, thus incurring $\Omega(n^2)$ runtime. Breaking this quadratic barrier for various problems has therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain $\textit{subquadratic}$ time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving linear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recent Kernel Density Estimation framework, which (after preprocessing in time subquadratic in $n$) can return estimates of row/column sums of the kernel matrix. In particular, we develop efficient reductions from $\textit{weighted vertex}$ and $\textit{weighted edge sampling}$ on kernel graphs, $\textit{simulating random walks}$ on kernel graphs, and $\textit{importance sampling}$ on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in $\textit{sublinear}$ (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsification, where we observe a $\textbf{9x}$ decrease in the number of kernel evaluations over baselines for LRA and a $\textbf{41x}$ reduction in the graph size for spectral sparsification.
translated by 谷歌翻译
这项工作研究了经典的光谱群集算法,该算法嵌入了某些图$ g =(v_g,e_g)$的顶点,使用$ g $的某些矩阵的$ k $ eigenVectors纳入$ \ m athbb {r}^k $k $ - 分区$ v_g $ to $ k $簇。我们的第一个结果是对光谱聚类的性能进行更严格的分析,并解释了为什么它在某些条件下的作用比文献中研究的弱点要弱得多。对于第二个结果,我们表明,通过应用少于$ k $的特征向量来构建嵌入,光谱群集能够在许多实际情况下产生更好的输出;该结果是光谱聚类中的第一个结果。除了其概念性和理论意义外,我们工作的实际影响还通过对合成和现实世界数据集的经验分析证明,其中光谱聚类会产生可比或更好的结果,而较少$ k $ k $ eigenVectors。
translated by 谷歌翻译
There are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entirety. While FairSC algorithm (Kleindessner et al., 2019) is able to find the fairer clustering, it is compromised by high costs due to the kernels of computing nullspaces and the square roots of dense matrices explicitly. We present a new formulation of underlying spectral computation by incorporating nullspace projection and Hotelling's deflation such that the resulting algorithm, called s-FairSC, only involves the sparse matrix-vector products and is able to fully exploit the sparsity of the fair SC model. The experimental results on the modified stochastic block model demonstrate that s-FairSC is comparable with FairSC in recovering fair clustering. Meanwhile, it is sped up by a factor of 12 for moderate model sizes. s-FairSC is further demonstrated to be scalable in the sense that the computational costs of s-FairSC only increase marginally compared to the SC without fairness constraints.
translated by 谷歌翻译
光谱聚类在从业者和理论家中都很受欢迎。尽管对光谱聚类的性能保证有充分的了解,但最近的研究集中于在群集中执行``公平'',要求它们在分类敏感的节点属性方面必须``平衡''人口中的种族分布)。在本文中,我们考虑了一个设置,其中敏感属性间接表现在辅助\ textit {表示图}中,而不是直接观察到。该图指定了可以相对于敏感属性互相表示的节点对,除了通常的\ textit {相似性图}外,还可以观察到。我们的目标是在相似性图中找到簇,同时尊重由表示图编码的新个人公平性约束。我们为此任务开发了不均衡和归一化光谱聚类的变体,并在代表图诱导的种植分区模型下分析其性能。该模型同时使用节点的群集成员身份和表示图的结构来生成随机相似性图。据我们所知,这些是在个人级别的公平限制下受约束光谱聚类的第一个一致性结果。数值结果证实了我们的理论发现。
translated by 谷歌翻译
In this work we study statistical properties of graph-based algorithms for multi-manifold clustering (MMC). In MMC the goal is to retrieve the multi-manifold structure underlying a given Euclidean data set when this one is assumed to be obtained by sampling a distribution on a union of manifolds $\mathcal{M} = \mathcal{M}_1 \cup\dots \cup \mathcal{M}_N$ that may intersect with each other and that may have different dimensions. We investigate sufficient conditions that similarity graphs on data sets must satisfy in order for their corresponding graph Laplacians to capture the right geometric information to solve the MMC problem. Precisely, we provide high probability error bounds for the spectral approximation of a tensorized Laplacian on $\mathcal{M}$ with a suitable graph Laplacian built from the observations; the recovered tensorized Laplacian contains all geometric information of all the individual underlying manifolds. We provide an example of a family of similarity graphs, which we call annular proximity graphs with angle constraints, satisfying these sufficient conditions. We contrast our family of graphs with other constructions in the literature based on the alignment of tangent planes. Extensive numerical experiments expand the insights that our theory provides on the MMC problem.
translated by 谷歌翻译
网络模型提供了一种强大而灵活的框架,用于分析各种结构化数据源。然而,在许多感兴趣的情况下,可以构建多个网络以捕获底层现象的不同方面或随时间捕获改变行为。在这样的设置中,群集在一起识别共同结构模式的相关网络通常是有用的。在本文中,我们提出了一种凸面的网络聚类任务方法。我们的方法使用凸融合惩罚来诱导平稳变化的树状集群结构,消除了选择群集的群数。我们为凸网络聚类提供了一种有效的算法,并证明了其对合成示例的有效性。
translated by 谷歌翻译
We review clustering as an analysis tool and the underlying concepts from an introductory perspective. What is clustering and how can clusterings be realised programmatically? How can data be represented and prepared for a clustering task? And how can clustering results be validated? Connectivity-based versus prototype-based approaches are reflected in the context of several popular methods: single-linkage, spectral embedding, k-means, and Gaussian mixtures are discussed as well as the density-based protocols (H)DBSCAN, Jarvis-Patrick, CommonNN, and density-peaks.
translated by 谷歌翻译
Selecting subsets of features that differentiate between two conditions is a key task in a broad range of scientific domains. In many applications, the features of interest form clusters with similar effects on the data at hand. To recover such clusters we develop DiSC, a data-driven approach for detecting groups of features that differentiate between conditions. For each condition, we construct a graph whose nodes correspond to the features and whose weights are functions of the similarity between them for that condition. We then apply a spectral approach to compute subsets of nodes whose connectivity differs significantly between the condition-specific feature graphs. On the theoretical front, we analyze our approach with a toy example based on the stochastic block model. We evaluate DiSC on a variety of datasets, including MNIST, hyperspectral imaging, simulated scRNA-seq and task fMRI, and demonstrate that DiSC uncovers features that better differentiate between conditions compared to competing methods.
translated by 谷歌翻译
在本文中,我们提出了一种新方法来检测具有归因顶点的无向图中的簇。目的是将不仅在结构连接性方面,而且在属性值方面相似的顶点分组。我们通过创建[6,38]中提出的其他顶点和边缘,将顶点之间的结构和属性相似。然后将增强图嵌入到与其拉普拉斯式相关的欧几里得空间中,在该空间中,应用了修改的K-均值算法以识别簇。修改后的k均值依赖于矢量距离度量,根据每个原始顶点,我们分配了合适的矢量值坐标集,这取决于结构连接性和属性相似性,因此每个原始图顶点都被认为是$ M+1的代表增强图的$顶点,如果$ m $是顶点属性的数量。为了定义坐标矢量,我们基于自适应AMG(代数多机)方法采用了我们最近提出的算法,该方法识别了嵌入欧几里得空间中的坐标方向,以代数平滑的矢量相对于我们的增强图Laplacian,从而扩展了laplacian,从而扩展了坐标。没有属性的图形的先前结果。我们通过与一些知名方法进行比较,分析了我们提出的聚类方法的有效性,这些方法可以免费获得软件实现,并与文献中报告的结果相比,在两种不同类型的广泛使用的合成图上以及在某些现实世界中的图形上。
translated by 谷歌翻译