我们研究了p-laplacians和光谱聚类,以融合了边缘依赖性顶点权重(EDVW)的最近提出的超图模型。这些权重可以反映在超边缘内顶点的不同重要性,从而赋予超图模型更高的表达性和灵活性。通过构建基于EDVWS的基于EDVWS的分裂函数,我们将具有EDVW的超图转换为频谱理论更好地开发的谱图。这样,现有的概念和定理,例如P-Laplacians和Subsodular HyperGraph设置下提出的P-Laplacians和Cheeger不平等现象,可以直接扩展到具有EDVW的超图。对于具有基于EDVWS的拆分功能的子管道超图,我们提出了一种有效的算法来计算与1-Laplacian的第二小特征值相关的特征向量。然后,我们利用此特征向量来聚类顶点,比基于2-Laplacian的传统光谱聚类获得更高的聚类精度。从更广泛的角度来看,所提出的算法适用于所有可降低图的亚物种超图。使用现实世界数据的数值实验证明了基于1-Laplacian和EDVW的光谱聚类的有效性。
translated by 谷歌翻译
Graph clustering is a fundamental problem in unsupervised learning, with numerous applications in computer science and in analysing real-world data. In many real-world applications, we find that the clusters have a significant high-level structure. This is often overlooked in the design and analysis of graph clustering algorithms which make strong simplifying assumptions about the structure of the graph. This thesis addresses the natural question of whether the structure of clusters can be learned efficiently and describes four new algorithmic results for learning such structure in graphs and hypergraphs. All of the presented theoretical results are extensively evaluated on both synthetic and real-word datasets of different domains, including image classification and segmentation, migration networks, co-authorship networks, and natural language processing. These experimental results demonstrate that the newly developed algorithms are practical, effective, and immediately applicable for learning the structure of clusters in real-world data.
translated by 谷歌翻译
In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.
translated by 谷歌翻译
超图允许使用多向高阶关系建模问题。然而,大多数现有超图的算法的计算成本可能严重取决于输入的超图尺寸。为了解决不断增加的计算挑战,可以通过积极聚合其顶点(节点)来预先处理给定的超图来促进图表粗化。然而,未经纳入启发式图粗化技术的最先进的超图分区(聚类)方法未得到优化,以保留超图的结构(全局)属性。在这项工作中,我们提出了一种有效的光谱超图粗化方案(HypersF),以保持超图的原始光谱(结构)特性。我们的方法利用了最近的强烈局部最大流量的聚类算法,用于检测最小化比例的超图形顶点集。为了进一步提高算法效率,我们通过利用与原始超图对应的二分形图的光谱聚类来提出分频和征服方案。我们从现实世界VLSI设计基准提取的各种超图的实验结果表明,与现有最先进的现有技术相比,所提出的超图粗略化算法可以显着提高超图和运行时效率的多线电导算法。
translated by 谷歌翻译
Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -- given $n$ input points, most kernel-based algorithms need to materialize the full $n \times n$ kernel matrix before performing any subsequent computation, thus incurring $\Omega(n^2)$ runtime. Breaking this quadratic barrier for various problems has therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain $\textit{subquadratic}$ time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving linear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recent Kernel Density Estimation framework, which (after preprocessing in time subquadratic in $n$) can return estimates of row/column sums of the kernel matrix. In particular, we develop efficient reductions from $\textit{weighted vertex}$ and $\textit{weighted edge sampling}$ on kernel graphs, $\textit{simulating random walks}$ on kernel graphs, and $\textit{importance sampling}$ on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in $\textit{sublinear}$ (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsification, where we observe a $\textbf{9x}$ decrease in the number of kernel evaluations over baselines for LRA and a $\textbf{41x}$ reduction in the graph size for spectral sparsification.
translated by 谷歌翻译
作为建模复杂关系的强大工具,HyperGraphs从图表学习社区中获得了流行。但是,深度刻画学习中的常用框架专注于具有边缘独立的顶点权重(EIVW)的超图,而无需考虑具有具有更多建模功率的边缘依赖性顶点权重(EDVWS)的超图。为了弥补这一点,我们提出了一般的超图光谱卷积(GHSC),这是一个通用学习框架,不仅可以处理EDVW和EIVW HyperGraphs,而且更重要的是,理论上可以明确地利用现有强大的图形卷积神经网络(GCNN)明确说明,从而很大程度上可以释放。超图神经网络的设计。在此框架中,给定的无向GCNN的图形拉普拉斯被统一的HyperGraph Laplacian替换,该统一的HyperGraph Laplacian通过将我们所定义的广义超透明牌与简单的无向图等同起来,从随机的步行角度将顶点权重信息替换。来自各个领域的广泛实验,包括社交网络分析,视觉目标分类和蛋白质学习,证明了拟议框架的最新性能。
translated by 谷歌翻译
我们介绍了一种新颖的谐波分析,用于在函数上定义的函数,随机步行操作员是基石。作为第一步,我们将随机步行操作员的一组特征向量作为非正交傅里叶类型的功能,用于通过定向图。我们通过将从其Dirichlet能量获得的随机步行操作员的特征向量的变化与其相关的特征值的真实部分连接来发现频率解释。从这个傅立叶基础,我们可以进一步继续,并在有向图中建立多尺度分析。通过将Coifman和MagGioni扩展到定向图,我们提出了一种冗余小波变换和抽取的小波变换。因此,我们对导向图的谐波分析的发展导致我们考虑应用于突出了我们框架效率的指示图的图形上的半监督学习问题和信号建模问题。
translated by 谷歌翻译
社交网络通常是使用签名图对社交网络进行建模的,其中顶点与用户相对应,并且边缘具有一个指示用户之间的交互作用的符号。出现的签名图通常包含一个清晰的社区结构,因为该图可以分配到少数极化社区中,每个群落都定义了稀疏切割,并且不可分割地分为较小的极化亚共同体。我们为具有如此清晰的社区结构的签名图提供了本地聚类甲骨文图的小部分。正式地,当图形具有最高度且社区数量最多为$ o(\ log n)$时,则使用$ \ tilde {o}(\ sqrt {n} \ sqrt {n} \ propatatorName {poly}(1/\ varepsilon) )$预处理时间,我们的Oracle可以回答$ \ tilde {o}(\ sqrt {n} \ operatorname {poly}(1/\ varepsilon))$ time的每个成员查询,并且它正确地分类了$(1--1-(1-) \ varepsilon)$ - 顶点W.R.T.的分数一组隐藏的种植地面真实社区。我们的Oracle在仅需要少数顶点需要的聚类信息的应用中是可取的。以前,此类局部聚类牙齿仅因无符号图而闻名。我们对签名图的概括需要许多新的想法,并对随机步行的行为进行了新的光谱分析。我们评估了我们的算法,用于在合成和现实世界数据集上构建这种甲骨文和回答成员资格查询,从而在实践中验证其性能。
translated by 谷歌翻译
Since the invention of word2vec [28,29], the skip-gram model has significantly advanced the research of network embedding, such as the recent emergence of the DeepWalk, LINE, PTE, and node2vec approaches. In this work, we show that all of the aforementioned models with negative sampling can be unified into the matrix factorization framework with closed forms. Our analysis and proofs reveal that: (1) DeepWalk [31] empirically produces a low-rank transformation of a network's normalized Laplacian matrix; (2) LINE [37], in theory, is a special case of DeepWalk when the size of vertices' context is set to one; (3) As an extension of LINE, PTE [36] can be viewed as the joint factorization of multiple networks' Laplacians; (4) node2vec [16] is factorizing a matrix related to the stationary distribution and transition probability tensor of a 2nd-order random walk. We further provide the theoretical connections between skip-gram based network embedding algorithms and the theory of graph Laplacian. Finally, we present the NetMF method 1 as well as its approximation algorithm for computing network embedding. Our method offers significant improvements over DeepWalk and LINE for conventional network mining tasks. This work lays the theoretical foundation for skip-gram based network embedding methods, leading to a better understanding of latent network representation learning.
translated by 谷歌翻译
马尔可夫链是一类概率模型,在定量科学中已广泛应用。这部分是由于它们的多功能性,但是可以通过分析探测的便利性使其更加复杂。本教程为马尔可夫连锁店提供了深入的介绍,并探索了它们与图形和随机步行的联系。我们利用从线性代数和图形论的工具来描述不同类型的马尔可夫链的过渡矩阵,特别着眼于探索与这些矩阵相对应的特征值和特征向量的属性。提出的结果与机器学习和数据挖掘中的许多方法有关,我们在各个阶段描述了这些方法。本文并没有本身就成为一项新颖的学术研究,而是提出了一些已知结果的集合以及一些新概念。此外,该教程的重点是向读者提供直觉,而不是正式的理解,并且仅假定对线性代数和概率理论的概念的基本曝光。因此,来自各种学科的学生和研究人员可以访问它。
translated by 谷歌翻译
我们提出了一个多路相似性的理论框架,与将实价数据建模为通过光谱嵌入聚类的超图。对于基于图形的光谱群集,通常,通过使用内核函数对成对相似性进行建模,将实值数据模拟为图。这是因为内核函数与图形切割具有理论连接。对于使用多路相似性比成对相似性更合适的问题,自然地将模型作为超图,即图形的概括。然而,尽管剪切幅度进行了充分研究,但尚未建立基于HyperGraph Cut的框架来模拟多路相似性。在本文中,我们通过利用内核函数的理论基础来制定多路相似性。我们展示了我们的配方和超图之间的理论联系,以两种方式削减了加权内核$ k $ -MEANS和热核,我们证明了我们的配方合理性。我们还为光谱聚类提供了快速算法。我们的算法在经验上比现有图和其他启发式建模方法显示出更好的性能。
translated by 谷歌翻译
由于其数值益处增加及其坚实的数学背景,光谱聚类方法的非线性重构近来的关注。我们在$ p $ -norm中提出了一种新的直接多道谱聚类算法,以$ p \ in(1,2] $。计算图表的多个特征向量的问题$ p $ -laplacian,标准的非线性概括Graph Laplacian,被重用作为Grassmann歧管的无约束最小化问题。$ P $的价值以伪连续的方式减少,促进对应于最佳图形的稀疏解决方案载体作为$ P $接近。监测单调减少平衡图削减了我们从$ P $ -Levels获得的最佳可用解决方案的保证。我们展示了我们算法在各种人工测试案件中的算法的有效性和准确性。我们的数值和比较结果具有各种状态-Art聚类方法表明,所提出的方法在均衡的图形剪切度量和标签分配的准确性方面取得高质量的集群。此外,我们进行S面部图像和手写字符分类的束缚,以展示现实数据集中的适用性。
translated by 谷歌翻译
We study the problem of graph clustering under a broad class of objectives in which the quality of a cluster is defined based on the ratio between the number of edges in the cluster, and the total weight of vertices in the cluster. We show that our definition is closely related to popular clustering measures, namely normalized associations, which is a dual of the normalized cut objective, and normalized modularity. We give a linear time constant-approximate algorithm for our objective, which implies the first constant-factor approximation algorithms for normalized modularity and normalized associations.
translated by 谷歌翻译
This paper introduces a scalable algorithmic framework (HyperEF) for spectral coarsening (decomposition) of large-scale hypergraphs by exploiting hyperedge effective resistances. Motivated by the latest theoretical framework for low-resistance-diameter decomposition of simple graphs, HyperEF aims at decomposing large hypergraphs into multiple node clusters with only a few inter-cluster hyperedges. The key component in HyperEF is a nearly-linear time algorithm for estimating hyperedge effective resistances, which allows incorporating the latest diffusion-based non-linear quadratic operators defined on hypergraphs. To achieve good runtime scalability, HyperEF searches within the Krylov subspace (or approximate eigensubspace) for identifying the nearly-optimal vectors for approximating the hyperedge effective resistances. In addition, a node weight propagation scheme for multilevel spectral hypergraph decomposition has been introduced for achieving even greater node coarsening ratios. When compared with state-of-the-art hypergraph partitioning (clustering) methods, extensive experiment results on real-world VLSI designs show that HyperEF can more effectively coarsen (decompose) hypergraphs without losing key structural (spectral) properties of the original hypergraphs, while achieving over $70\times$ runtime speedups over hMetis and $20\times$ speedups over HyperSF.
translated by 谷歌翻译
The stochastic block model (SBM) is a random graph model with planted clusters. It is widely employed as a canonical model to study clustering and community detection, and provides generally a fertile ground to study the statistical and computational tradeoffs that arise in network and data sciences.This note surveys the recent developments that establish the fundamental limits for community detection in the SBM, both with respect to information-theoretic and computational thresholds, and for various recovery requirements such as exact, partial and weak recovery (a.k.a., detection). The main results discussed are the phase transitions for exact recovery at the Chernoff-Hellinger threshold, the phase transition for weak recovery at the Kesten-Stigum threshold, the optimal distortion-SNR tradeoff for partial recovery, the learning of the SBM parameters and the gap between information-theoretic and computational thresholds.The note also covers some of the algorithms developed in the quest of achieving the limits, in particular two-round algorithms via graph-splitting, semi-definite programming, linearized belief propagation, classical and nonbacktracking spectral methods. A few open problems are also discussed.
translated by 谷歌翻译
随机块模型(SBM)是一个随机图模型,其连接不同的顶点组不同。它被广泛用作研究聚类和社区检测的规范模型,并提供了肥沃的基础来研究组合统计和更普遍的数据科学中出现的信息理论和计算权衡。该专着调查了最近在SBM中建立社区检测的基本限制的最新发展,无论是在信息理论和计算方案方面,以及各种恢复要求,例如精确,部分和弱恢复。讨论的主要结果是在Chernoff-Hellinger阈值中进行精确恢复的相转换,Kesten-Stigum阈值弱恢复的相变,最佳的SNR - 单位信息折衷的部分恢复以及信息理论和信息理论之间的差距计算阈值。该专着给出了在寻求限制时开发的主要算法的原则推导,特别是通过绘制绘制,半定义编程,(线性化)信念传播,经典/非背带频谱和图形供电。还讨论了其他块模型的扩展,例如几何模型和一些开放问题。
translated by 谷歌翻译
这项工作研究了经典的光谱群集算法,该算法嵌入了某些图$ g =(v_g,e_g)$的顶点,使用$ g $的某些矩阵的$ k $ eigenVectors纳入$ \ m athbb {r}^k $k $ - 分区$ v_g $ to $ k $簇。我们的第一个结果是对光谱聚类的性能进行更严格的分析,并解释了为什么它在某些条件下的作用比文献中研究的弱点要弱得多。对于第二个结果,我们表明,通过应用少于$ k $的特征向量来构建嵌入,光谱群集能够在许多实际情况下产生更好的输出;该结果是光谱聚类中的第一个结果。除了其概念性和理论意义外,我们工作的实际影响还通过对合成和现实世界数据集的经验分析证明,其中光谱聚类会产生可比或更好的结果,而较少$ k $ k $ eigenVectors。
translated by 谷歌翻译
在本文中,我们提出了一种新方法来检测具有归因顶点的无向图中的簇。目的是将不仅在结构连接性方面,而且在属性值方面相似的顶点分组。我们通过创建[6,38]中提出的其他顶点和边缘,将顶点之间的结构和属性相似。然后将增强图嵌入到与其拉普拉斯式相关的欧几里得空间中,在该空间中,应用了修改的K-均值算法以识别簇。修改后的k均值依赖于矢量距离度量,根据每个原始顶点,我们分配了合适的矢量值坐标集,这取决于结构连接性和属性相似性,因此每个原始图顶点都被认为是$ M+1的代表增强图的$顶点,如果$ m $是顶点属性的数量。为了定义坐标矢量,我们基于自适应AMG(代数多机)方法采用了我们最近提出的算法,该方法识别了嵌入欧几里得空间中的坐标方向,以代数平滑的矢量相对于我们的增强图Laplacian,从而扩展了laplacian,从而扩展了坐标。没有属性的图形的先前结果。我们通过与一些知名方法进行比较,分析了我们提出的聚类方法的有效性,这些方法可以免费获得软件实现,并与文献中报告的结果相比,在两种不同类型的广泛使用的合成图上以及在某些现实世界中的图形上。
translated by 谷歌翻译
大图通常出现在社交网络,知识图,推荐系统,生命科学和决策问题中。通过其高级别属性总结大图有助于解决这些设置中的问题。在光谱聚类中,我们旨在确定大多数边缘落在簇内的节点簇,而在簇之间只有很少的边缘。此任务对于许多下游应用和探索性分析很重要。光谱聚类的核心步骤是执行相应图的拉普拉斯矩阵(或等效地,奇异值分解,SVD)的特征分类。迭代奇异值分解方法的收敛取决于给定矩阵的光谱的特征,即连续特征值之间的差异。对于对应于群集图的图形的图形拉普拉斯,特征值将是非负的,但很小(小于$ 1 $)的减慢收敛性。本文引入了一种可行的方法,用于扩张光谱以加速SVD求解器,然后又是光谱群集。这是通过对矩阵操作的多项式近似来实现的,矩阵操作有利地改变矩阵的光谱而不更改其特征向量。实验表明,这种方法显着加速了收敛,我们解释了如何并行化和随机近似于可用的计算。
translated by 谷歌翻译
Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.
translated by 谷歌翻译