This article explores and analyzes the unsupervised clustering of large partially observed graphs. We propose a scalable and provable randomized framework for clustering graphs generated from the stochastic block model. The clustering is first applied to a sub-matrix of the graph's adjacency matrix associated with a reduced graph sketch constructed using random sampling. Then, the clusters of the full graph are inferred based on the clusters extracted from the sketch using a correlation-based retrieval step. Uniform random node sampling is shown to improve the computational complexity over clustering of the full graph when the cluster sizes are balanced. A new random degree-based node sampling algorithm is presented which significantly improves upon the performance of the clustering algorithm even when clusters are unbalanced. This framework improves the phase transitions for matrix-decomposition-based clustering with regard to computational complexity and minimum cluster size, which are shown to be nearly dimension-free in the low inter-cluster connectivity regime. A third sampling technique is shown to improve balance by randomly sampling nodes based on spatial distribution. We provide analysis and numerical results using a convex clustering algorithm based on matrix completion.
translated by 谷歌翻译
社区检测和正交组同步是科学和工程中各种重要应用的基本问题。在这项工作中,我们考虑了社区检测和正交组同步的联合问题,旨在恢复社区并同时执行同步。为此,我们提出了一种简单的算法,该算法由频谱分解步骤组成,然后是彼此枢转的QR分解(CPQR)。所提出的算法与数据点数线性有效且缩放。我们还利用最近开发的“休闲一淘汰”技术来建立近乎最佳保证,以确切地恢复集群成员资格,并稳定地恢复正交变换。数值实验证明了我们算法的效率和功效,并确认了我们的理论表征。
translated by 谷歌翻译
随机奇异值分解(RSVD)是用于计算大型数据矩阵截断的SVD的一类计算算法。给定A $ n \ times n $对称矩阵$ \ mathbf {m} $,原型RSVD算法输出通过计算$ \ mathbf {m mathbf {m} $的$ k $引导singular vectors的近似m}^{g} \ mathbf {g} $;这里$ g \ geq 1 $是一个整数,$ \ mathbf {g} \ in \ mathbb {r}^{n \ times k} $是一个随机的高斯素描矩阵。在本文中,我们研究了一般的“信号加上噪声”框架下的RSVD的统计特性,即,观察到的矩阵$ \ hat {\ mathbf {m}} $被认为是某种真实但未知的加法扰动信号矩阵$ \ mathbf {m} $。我们首先得出$ \ ell_2 $(频谱规范)和$ \ ell_ {2 \ to \ infty} $(最大行行列$ \ ell_2 $ norm)$ \ hat {\ hat {\ Mathbf {M}} $和信号矩阵$ \ Mathbf {M} $的真实单数向量。这些上限取决于信噪比(SNR)和功率迭代$ g $的数量。观察到一个相变现象,其中较小的SNR需要较大的$ g $值以保证$ \ ell_2 $和$ \ ell_ {2 \ to \ fo \ infty} $ distances的收敛。我们还表明,每当噪声矩阵满足一定的痕量生长条件时,这些相变发生的$ g $的阈值都会很清晰。最后,我们得出了近似奇异向量的行波和近似矩阵的进入波动的正常近似。我们通过将RSVD的几乎最佳性能保证在应用于三个统计推断问题的情况下,即社区检测,矩阵完成和主要的组件分析,并使用缺失的数据来说明我们的理论结果。
translated by 谷歌翻译
我们提出了对学度校正随机块模型(DCSBM)的合适性测试。该测试基于调整后的卡方统计量,用于测量$ n $多项式分布的组之间的平等性,该分布具有$ d_1,\ dots,d_n $观测值。在网络模型的背景下,多项式的数量($ n $)的数量比观测值数量($ d_i $)快得多,与节点$ i $的度相对应,因此设置偏离了经典的渐近学。我们表明,只要$ \ {d_i \} $的谐波平均值生长到无穷大,就可以使统计量在NULL下分配。顺序应用时,该测试也可以用于确定社区数量。该测试在邻接矩阵的压缩版本上进行操作,因此在学位上有条件,因此对大型稀疏网络具有高度可扩展性。我们结合了一个新颖的想法,即在测试$ K $社区时根据$(k+1)$ - 社区分配来压缩行。这种方法在不牺牲计算效率的情况下增加了顺序应用中的力量,我们证明了它在恢复社区数量方面的一致性。由于测试统计量不依赖于特定的替代方案,因此其效用超出了顺序测试,可用于同时测试DCSBM家族以外的各种替代方案。特别是,我们证明该测试与具有社区结构的潜在可变性网络模型的一般家庭一致。
translated by 谷歌翻译
We consider an approach for community detection in time-varying networks. At its core, this approach maintains a small sketch graph to capture the essential community structure found in each snapshot of the full network. We demonstrate how the sketch can be used to explicitly identify six key community events which typically occur during network evolution: growth, shrinkage, merging, splitting, birth and death. Based on these detection techniques, we formulate a community detection algorithm which can process a network concurrently exhibiting all processes. One advantage afforded by the sketch-based algorithm is the efficient handling of large networks. Whereas detecting events in the full graph may be computationally expensive, the small size of the sketch allows changes to be quickly assessed. A second advantage occurs in networks containing clusters of disproportionate size. The sketch is constructed such that there is equal representation of each cluster, thus reducing the possibility that the small clusters are lost in the estimate. We present a new standardized benchmark based on the stochastic block model which models the addition and deletion of nodes, as well as the birth and death of communities. When coupled with existing benchmarks, this new benchmark provides a comprehensive suite of tests encompassing all six community events. We provide analysis and a set of numerical results demonstrating the advantages of our approach both in run time and in the handling of small clusters.
translated by 谷歌翻译
光谱聚类是网络中广泛使用的社区检测方法之一。然而,大型网络为其中的特征值分解带来了计算挑战。在本文中,我们研究了从统计角度使用随机草图算法的光谱聚类,在那里我们通常假设网络数据是从随机块模型生成的,这些模型不一定是完整等级的。为此,我们首先使用最近开发的草图算法来获得两个随机谱聚类算法,即基于随机投影和基于随机采样的光谱聚类。然后,我们在群体邻接矩阵的近似误差,错误分类误差和链路概率矩阵的估计误差方面研究得到的算法的理论界限。事实证明,在温和条件下,随机谱聚类算法导致与原始光谱聚类算法相同的理论界。我们还将结果扩展到校正的程度校正的随机块模型。数值实验支持我们的理论发现并显示随机化方法的效率。一个名为rclusct的新R包是开发的,并提供给公众。
translated by 谷歌翻译
随机块模型(SBM)是一个随机图模型,其连接不同的顶点组不同。它被广泛用作研究聚类和社区检测的规范模型,并提供了肥沃的基础来研究组合统计和更普遍的数据科学中出现的信息理论和计算权衡。该专着调查了最近在SBM中建立社区检测的基本限制的最新发展,无论是在信息理论和计算方案方面,以及各种恢复要求,例如精确,部分和弱恢复。讨论的主要结果是在Chernoff-Hellinger阈值中进行精确恢复的相转换,Kesten-Stigum阈值弱恢复的相变,最佳的SNR - 单位信息折衷的部分恢复以及信息理论和信息理论之间的差距计算阈值。该专着给出了在寻求限制时开发的主要算法的原则推导,特别是通过绘制绘制,半定义编程,(线性化)信念传播,经典/非背带频谱和图形供电。还讨论了其他块模型的扩展,例如几何模型和一些开放问题。
translated by 谷歌翻译
本文研究了一般D-均匀的HyperGraph随机块模型(D-HSBM)中精确恢复的基本限制,其中n个节点被分配到具有相对大小的k差异群落中(p1,...,pk)。具有基数d的节点的每个子集都是独立生成的,作为订单-D超边,其一定概率取决于D节点所属的地面真相群落。目标是根据观察到的超图准确地恢复K隐藏的社区。我们表明存在一个尖锐的阈值,因此可以在阈值之上实现精确的恢复,而不可能在阈值以下(除了将精确指定的小参数制度之外)。该阈值是根据我们称为社区之间普遍的Chernoff-Hellinger分歧的数量来表示的。我们对该通用模型的结果恢复了标准SBM和D-HSBM的先前结果,其中两个对称群落作为特殊情况。在证明我们的可实现结果的途径中,我们开发了一种符合阈值的多项式两阶段算法。第一阶段采用某种超图光谱聚类方法来获得社区的粗略估计,第二阶段通过局部细化步骤单独完善每个节点,以确保精确恢复。
translated by 谷歌翻译
为了捕获许多社区检测问题的固有几何特征,我们建议使用一个新的社区随机图模型,我们称之为\ emph {几何块模型}。几何模型建立在\ emph {随机几何图}(Gilbert,1961)上,这是空间网络的随机图的基本模型之一,就像在ERD \ H上建立的良好的随机块模型一样{o} s-r \'{en} yi随机图。它也是受到社区发现中最新的理论和实际进步启发的随机社区模型的自然扩展。为了分析几何模型,我们首先为\ emph {Random Annulus图}提供新的连接结果,这是随机几何图的概括。自引入以来,已经研究了几何图的连通性特性,并且由于相关的边缘形成而很难分析它们。然后,我们使用随机环形图的连接结果来提供必要的条件,以有效地为几何块模型恢复社区。我们表明,一种简单的三角计数算法来检测几何模型中的社区几乎是最佳的。为此,我们考虑了两个图密度方案。在图表的平均程度随着顶点的对数增长的状态中,我们表明我们的算法在理论上和实际上都表现出色。相比之下,三角计数算法对于对数学度方案中随机块模型远非最佳。我们还查看了图表的平均度与顶点$ n $的数量线性增长的状态,因此要存储一个需要$ \ theta(n^2)$内存的图表。我们表明,我们的算法需要在此制度中仅存储$ o(n \ log n)$边缘以恢复潜在社区。
translated by 谷歌翻译
The stochastic block model (SBM) is a fundamental model for studying graph clustering or community detection in networks. It has received great attention in the last decade and the balanced case, i.e., assuming all clusters have large size, has been well studied. However, our understanding of SBM with unbalanced communities (arguably, more relevant in practice) is still very limited. In this paper, we provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes. We improve upon a result of Ailon, Chen and Xu [ICML 2013] by removing the assumption that there is a large interval such that the sizes of clusters do not fall in. Under the planted clique conjecture, the size of the clusters that can be recovered by our algorithm is nearly optimal (up to polylogarithmic factors) when the probability parameters are constant. As a byproduct, we obtain a polynomial-time algorithm with sublinear query complexity for a clustering problem with a faulty oracle, which finds all clusters of size larger than $\tilde{\Omega}({\sqrt{n}})$ even if $\Omega(n)$ small clusters co-exist in the graph. In contrast, all the previous efficient algorithms that makes sublinear number of queries cannot recover any large cluster, if there are more than $\tilde{\Omega}(n^{2/5})$ small clusters.
translated by 谷歌翻译
Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency -- given $n$ input points, most kernel-based algorithms need to materialize the full $n \times n$ kernel matrix before performing any subsequent computation, thus incurring $\Omega(n^2)$ runtime. Breaking this quadratic barrier for various problems has therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain $\textit{subquadratic}$ time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving linear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recent Kernel Density Estimation framework, which (after preprocessing in time subquadratic in $n$) can return estimates of row/column sums of the kernel matrix. In particular, we develop efficient reductions from $\textit{weighted vertex}$ and $\textit{weighted edge sampling}$ on kernel graphs, $\textit{simulating random walks}$ on kernel graphs, and $\textit{importance sampling}$ on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in $\textit{sublinear}$ (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsification, where we observe a $\textbf{9x}$ decrease in the number of kernel evaluations over baselines for LRA and a $\textbf{41x}$ reduction in the graph size for spectral sparsification.
translated by 谷歌翻译
我们考虑一个矩阵完成问题,用于将社交或项目相似性图形作为侧面信息。我们开发了一种普遍的,无参数和计算的有效算法,该算法以分层图形聚类开始,然后迭代地改进图形聚类和矩阵额定值。在一个层次的随机块模型,尊重实际相关的社交图和低秩评级矩阵模型(要详细),我们证明了我们的算法实现了观察到的矩阵条目数量的信息 - 理论限制(即,最佳通过与较低的不可能结果一起导出的样本复杂性)通过最大似然估计。该结果的一个结果是利用社交图的层次结构,相对于简单地识别不同组的情况,在不诉诸于它们的情况下,可以产生相对于不同组的样本复杂性的大量增益。我们对合成和现实世界数据集进行了广泛的实验,以证实我们的理论结果,并展示了利用图形侧信息的其他矩阵完成算法的显着性能改进。
translated by 谷歌翻译
The stochastic block model (SBM) is a random graph model with planted clusters. It is widely employed as a canonical model to study clustering and community detection, and provides generally a fertile ground to study the statistical and computational tradeoffs that arise in network and data sciences.This note surveys the recent developments that establish the fundamental limits for community detection in the SBM, both with respect to information-theoretic and computational thresholds, and for various recovery requirements such as exact, partial and weak recovery (a.k.a., detection). The main results discussed are the phase transitions for exact recovery at the Chernoff-Hellinger threshold, the phase transition for weak recovery at the Kesten-Stigum threshold, the optimal distortion-SNR tradeoff for partial recovery, the learning of the SBM parameters and the gap between information-theoretic and computational thresholds.The note also covers some of the algorithms developed in the quest of achieving the limits, in particular two-round algorithms via graph-splitting, semi-definite programming, linearized belief propagation, classical and nonbacktracking spectral methods. A few open problems are also discussed.
translated by 谷歌翻译
This paper is about a curious phenomenon. Suppose we have a data matrix, which is the superposition of a low-rank component and a sparse component. Can we recover each component individually? We prove that under some suitable assumptions, it is possible to recover both the low-rank and the sparse components exactly by solving a very convenient convex program called Principal Component Pursuit; among all feasible decompositions, simply minimize a weighted combination of the nuclear norm and of the 1 norm. This suggests the possibility of a principled approach to robust principal component analysis since our methodology and results assert that one can recover the principal components of a data matrix even though a positive fraction of its entries are arbitrarily corrupted. This extends to the situation where a fraction of the entries are missing as well. We discuss an algorithm for solving this optimization problem, and present applications in the area of video surveillance, where our methodology allows for the detection of objects in a cluttered background, and in the area of face recognition, where it offers a principled way of removing shadows and specularities in images of faces.
translated by 谷歌翻译
We consider the nonlinear inverse problem of learning a transition operator $\mathbf{A}$ from partial observations at different times, in particular from sparse observations of entries of its powers $\mathbf{A},\mathbf{A}^2,\cdots,\mathbf{A}^{T}$. This Spatio-Temporal Transition Operator Recovery problem is motivated by the recent interest in learning time-varying graph signals that are driven by graph operators depending on the underlying graph topology. We address the nonlinearity of the problem by embedding it into a higher-dimensional space of suitable block-Hankel matrices, where it becomes a low-rank matrix completion problem, even if $\mathbf{A}$ is of full rank. For both a uniform and an adaptive random space-time sampling model, we quantify the recoverability of the transition operator via suitable measures of incoherence of these block-Hankel embedding matrices. For graph transition operators these measures of incoherence depend on the interplay between the dynamics and the graph topology. We develop a suitable non-convex iterative reweighted least squares (IRLS) algorithm, establish its quadratic local convergence, and show that, in optimal scenarios, no more than $\mathcal{O}(rn \log(nT))$ space-time samples are sufficient to ensure accurate recovery of a rank-$r$ operator $\mathbf{A}$ of size $n \times n$. This establishes that spatial samples can be substituted by a comparable number of space-time samples. We provide an efficient implementation of the proposed IRLS algorithm with space complexity of order $O(r n T)$ and per-iteration time complexity linear in $n$. Numerical experiments for transition operators based on several graph models confirm that the theoretical findings accurately track empirical phase transitions, and illustrate the applicability and scalability of the proposed algorithm.
translated by 谷歌翻译
Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for performing low-rank matrix approximation. These techniques exploit modern computational architectures more fully than classical methods and open the possibility of dealing with truly massive data sets.This paper presents a modular framework for constructing randomized algorithms that compute partial matrix decompositions. These methods use random sampling to identify a subspace that captures most of the action of a matrix. The input matrix is then compressed-either explicitly or implicitly-to this subspace, and the reduced matrix is manipulated deterministically to obtain the desired low-rank factorization. In many cases, this approach beats its classical competitors in terms of accuracy, speed, and robustness. These claims are supported by extensive numerical experiments and a detailed error analysis.The specific benefits of randomized techniques depend on the computational environment. Consider the model problem of finding the k dominant components of the singular value decomposition of an m × n matrix. (i) For a dense input matrix, randomized algorithms require O(mn log(k)) floating-point operations (flops) in contrast with O(mnk) for classical algorithms. (ii) For a sparse input matrix, the flop count matches classical Krylov subspace methods, but the randomized approach is more robust and can easily be reorganized to exploit multi-processor architectures. (iii) For a matrix that is too large to fit in fast memory, the randomized techniques require only a constant number of passes over the data, as opposed to O(k) passes for classical algorithms. In fact, it is sometimes possible to perform matrix approximation with a single pass over the data.
translated by 谷歌翻译
本文研究了聚类基质值观测值的计算和统计限制。我们提出了一个低级别的混合模型(LRMM),该模型适用于经典的高斯混合模型(GMM)来处理基质值观测值,该观测值假设人口中心矩阵的低级别。通过集成Lloyd算法和低级近似值设计了一种计算有效的聚类方法。一旦定位良好,该算法将快速收敛并达到最小值最佳的指数型聚类错误率。同时,我们表明一种基于张量的光谱方法可提供良好的初始聚类。与GMM相当,最小值最佳聚类错误率是由分离强度(即种群中心矩阵之间的最小距离)决定的。通过利用低级度,提出的算法对分离强度的要求较弱。但是,与GMM不同,LRMM的统计难度和计算难度的特征是信号强度,即最小的人口中心矩阵的非零奇异值。提供了证据表明,即使信号强度不够强,即使分离强度很强,也没有多项式时间算法是一致的。在高斯以下噪声下进一步证明了我们低级劳埃德算法的性能。讨论了LRMM下估计和聚类之间的有趣差异。通过全面的仿真实验证实了低级劳埃德算法的优点。最后,我们的方法在现实世界数据集的文献中优于其他方法。
translated by 谷歌翻译
Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.
translated by 谷歌翻译
现实世界网络经常具有侧面信息,可以帮助提高网络分析任务等群集的性能。尽管在过去十年中对网络聚类方法进行了大量的实证和理论研究,但侧面信息的附加值和用于在聚类算法中最佳地结合的方法的附加值相对较少理解。我们向群集网络提出了一种新的迭代算法,其中包含节点的侧面信息(以协调因子的形式)提出并表明我们的算法在上下文对称随机块模型下是最佳的。我们的算法可以应用于一般上下文随机块模型,并避免与先前提出的方法相比,避免了HyperParameter调整。我们在综合数据实验中确认我们的理论结果,其中我们的算法显着优于其他方法,并表明它也可以应用于签名的图表。最后,我们展示了我们对实际数据方法的实际兴趣。
translated by 谷歌翻译
我们考虑了从节点观测值估算多个网络拓扑的问题,其中假定这些网络是从相同(未知)随机图模型中绘制的。我们采用图形作为我们的随机图模型,这是一个非参数模型,可以从中绘制出潜在不同大小的图形。图形子的多功能性使我们能够解决关节推理问题,即使对于要恢复的图形包含不同数量的节点并且缺乏整个图形的精确比对的情况。我们的解决方案是基于将最大似然惩罚与Graphon估计方案结合在一起,可用于增强现有网络推理方法。通过引入嘈杂图抽样信息的强大方法,进一步增强了所提出的联合网络和图形估计。我们通过将其性能与合成和实际数据集中的竞争方法进行比较来验证我们提出的方法。
translated by 谷歌翻译