In this work we study statistical properties of graph-based algorithms for multi-manifold clustering (MMC). In MMC the goal is to retrieve the multi-manifold structure underlying a given Euclidean data set when this one is assumed to be obtained by sampling a distribution on a union of manifolds $\mathcal{M} = \mathcal{M}_1 \cup\dots \cup \mathcal{M}_N$ that may intersect with each other and that may have different dimensions. We investigate sufficient conditions that similarity graphs on data sets must satisfy in order for their corresponding graph Laplacians to capture the right geometric information to solve the MMC problem. Precisely, we provide high probability error bounds for the spectral approximation of a tensorized Laplacian on $\mathcal{M}$ with a suitable graph Laplacian built from the observations; the recovered tensorized Laplacian contains all geometric information of all the individual underlying manifolds. We provide an example of a family of similarity graphs, which we call annular proximity graphs with angle constraints, satisfying these sufficient conditions. We contrast our family of graphs with other constructions in the literature based on the alignment of tangent planes. Extensive numerical experiments expand the insights that our theory provides on the MMC problem.
translated by 谷歌翻译
本文研究了基于Laplacian Eigenmaps(Le)的基于Laplacian EIGENMAPS(PCR-LE)的主要成分回归的统计性质,这是基于Laplacian Eigenmaps(Le)的非参数回归的方法。 PCR-LE通过投影观察到的响应的向量$ {\ bf y} =(y_1,\ ldots,y_n)$ to to changbood图表拉普拉斯的某些特征向量跨越的子空间。我们表明PCR-Le通过SoboLev空格实现了随机设计回归的最小收敛速率。在设计密度$ P $的足够平滑条件下,PCR-le达到估计的最佳速率(其中已知平方$ l ^ 2 $ norm的最佳速率为$ n ^ { - 2s /(2s + d) )} $)和健美的测试($ n ^ { - 4s /(4s + d)$)。我们还表明PCR-LE是\ EMPH {歧管Adaptive}:即,我们考虑在小型内在维度$ M $的歧管上支持设计的情况,并为PCR-LE提供更快的界限Minimax估计($ n ^ { - 2s /(2s + m)$)和测试($ n ^ { - 4s /(4s + m)$)收敛率。有趣的是,这些利率几乎总是比图形拉普拉斯特征向量的已知收敛率更快;换句话说,对于这个问题的回归估计的特征似乎更容易,统计上讲,而不是估计特征本身。我们通过经验证据支持这些理论结果。
translated by 谷歌翻译
我们调查识别来自域中的采样点的域的边界。我们向边界引入正常矢量的新估计,指向边界的距离,以及对边界条内的点位于边界的测试。可以有效地计算估算器,并且比文献中存在的估计更准确。我们为估算者提供严格的错误估计。此外,我们使用检测到的边界点来解决Point云上PDE的边值问题。我们在点云上证明了LAPLACH和EIKONG方程的错误估计。最后,我们提供了一系列数值实验,说明了我们的边界估计器,在点云上的PDE应用程序的性能,以及在图像数据集上测试。
translated by 谷歌翻译
Graph clustering is a fundamental problem in unsupervised learning, with numerous applications in computer science and in analysing real-world data. In many real-world applications, we find that the clusters have a significant high-level structure. This is often overlooked in the design and analysis of graph clustering algorithms which make strong simplifying assumptions about the structure of the graph. This thesis addresses the natural question of whether the structure of clusters can be learned efficiently and describes four new algorithmic results for learning such structure in graphs and hypergraphs. All of the presented theoretical results are extensively evaluated on both synthetic and real-word datasets of different domains, including image classification and segmentation, migration networks, co-authorship networks, and natural language processing. These experimental results demonstrate that the newly developed algorithms are practical, effective, and immediately applicable for learning the structure of clusters in real-world data.
translated by 谷歌翻译
We consider the problem of estimating a multivariate function $f_0$ of bounded variation (BV), from noisy observations $y_i = f_0(x_i) + z_i$ made at random design points $x_i \in \mathbb{R}^d$, $i=1,\ldots,n$. We study an estimator that forms the Voronoi diagram of the design points, and then solves an optimization problem that regularizes according to a certain discrete notion of total variation (TV): the sum of weighted absolute differences of parameters $\theta_i,\theta_j$ (which estimate the function values $f_0(x_i),f_0(x_j)$) at all neighboring cells $i,j$ in the Voronoi diagram. This is seen to be equivalent to a variational optimization problem that regularizes according to the usual continuum (measure-theoretic) notion of TV, once we restrict the domain to functions that are piecewise constant over the Voronoi diagram. The regression estimator under consideration hence performs (shrunken) local averaging over adaptively formed unions of Voronoi cells, and we refer to it as the Voronoigram, following the ideas in Koenker (2005), and drawing inspiration from Tukey's regressogram (Tukey, 1961). Our contributions in this paper span both the conceptual and theoretical frontiers: we discuss some of the unique properties of the Voronoigram in comparison to TV-regularized estimators that use other graph-based discretizations; we derive the asymptotic limit of the Voronoi TV functional; and we prove that the Voronoigram is minimax rate optimal (up to log factors) for estimating BV functions that are essentially bounded.
translated by 谷歌翻译
In recent years, spectral clustering has become one of the most popular modern clustering algorithms. It is simple to implement, can be solved efficiently by standard linear algebra software, and very often outperforms traditional clustering algorithms such as the k-means algorithm. On the first glance spectral clustering appears slightly mysterious, and it is not obvious to see why it works at all and what it really does. The goal of this tutorial is to give some intuition on those questions. We describe different graph Laplacians and their basic properties, present the most common spectral clustering algorithms, and derive those algorithms from scratch by several different approaches. Advantages and disadvantages of the different spectral clustering algorithms are discussed.
translated by 谷歌翻译
在此备忘录中,我们开发了一般框架,它允许同时研究$ \ MathBB R ^ D $和惠特尼在$ \ Mathbb r的离散和非离散子集附近的insoctry扩展问题附近的标签和未标记的近对准数据问题。^ d $与某些几何形状。此外,我们调查了与集群,维度减少,流形学习,视觉以及最小的能量分区,差异和最小最大优化的相关工作。给出了谐波分析,计算机视觉,歧管学习和与我们工作的信号处理中的众多开放问题。本发明内容中的一部分工作基于纸张中查尔斯Fefferman的联合研究[48],[49],[50],[51]。
translated by 谷歌翻译
当图形亲和力矩阵是由$ n $随机样品构建的,在$ d $ d $维歧管上构建图形亲和力矩阵时,这项工作研究图形拉普拉斯元素与拉普拉斯 - 贝特拉米操作员的光谱收敛。通过分析DIRICHLET形成融合并通过歧管加热核卷积构建候选本本函数,我们证明,使用高斯内核,可以设置核band band band band parame $ \ epsilon \ sim \ sim(\ log n/ n/ n)^{1/(D /2+2)} $使得特征值收敛率为$ n^{ - 1/(d/2+2)} $,并且2-norm中的特征向量收敛率$ n^{ - 1/(d+) 4)} $;当$ \ epsilon \ sim(\ log n/n)^{1/(d/2+3)} $时,eigenValue和eigenVector速率均为$ n^{ - 1/(d/2+3)} $。这些费率最高为$ \ log n $因素,并被证明是有限的许多低洼特征值。当数据在歧管上均匀采样以及密度校正的图laplacian(在两个边的度矩阵中归一化)时,结果适用于非归一化和随机漫步图拉普拉斯laplacians laplacians laplacians以及密度校正的图laplacian(其中两侧的级别矩阵)采样数据。作为中间结果,我们证明了密度校正图拉普拉斯的新点和差异形式的收敛速率。提供数值结果以验证理论。
translated by 谷歌翻译
Lipschitz Learning是一种基于图的半监督学习方法,其中一个人通过在加权图上求解Infinity Laplace方程来扩展标签到未标记的数据集的标签。在这项工作中,随着顶点的数量生长到无穷大,我们证明了图形无穷大行道方程的解决方案的统一收敛速率。它们的连续内容是绝对最小化LipsChitz扩展,即关于从图形顶点采样图形顶点的域的测地度量。我们在图表权重的非常一般的假设下工作,标记顶点的集合和连续域。我们的主要贡献是,即使对于非常稀疏的图形,我们也获得了定量的收敛速率,因为它们通常出现在半监督学习等应用中。特别是,我们的框架允许绘制到连接半径的图形带宽。为了证明,我们首先显示图表距离函数的定量收敛性声明,在连续体中的测量距离功能。使用“与距离函数的比较”原理,我们可以将这些收敛语句传递给无限谐波函数,绝对最小化Lipschitz扩展。
translated by 谷歌翻译
在机器学习中调用多种假设需要了解歧管的几何形状和维度,理论决定了需要多少样本。但是,在应用程序数据中,采样可能不均匀,歧管属性是未知的,并且(可能)非纯化;这意味着社区必须适应本地结构。我们介绍了一种用于推断相似性内核提供数据的自适应邻域的算法。从本地保守的邻域(Gabriel)图开始,我们根据加权对应物进行迭代率稀疏。在每个步骤中,线性程序在全球范围内产生最小的社区,并且体积统计数据揭示了邻居离群值可能违反了歧管几何形状。我们将自适应邻域应用于非线性维度降低,地球计算和维度估计。与标准算法的比较,例如使用K-Nearest邻居,证明了它们的实用性。
translated by 谷歌翻译
我们将最初在多维扩展和降低多元数据的降低领域发展为功能设置。我们专注于经典缩放和ISOMAP - 在这些领域中起重要作用的原型方法 - 并在功能数据分析的背景下展示它们的使用。在此过程中,我们强调了环境公制扮演的关键作用。
translated by 谷歌翻译
我们研究了紧凑型歧管M上的回归问题。为了利用数据的基本几何形状和拓扑结构,回归任务是基于歧管的前几个特征函数执行的,该特征是歧管的laplace-beltrami操作员,通过拓扑处罚进行正规化。提出的惩罚基于本征函数或估计功能的子级集的拓扑。显示总体方法可在合成和真实数据集上对各种应用产生有希望的和竞争性能。我们还根据回归函数估计,其预测误差及其平滑度(从拓扑意义上)提供理论保证。综上所述,这些结果支持我们方法在目标函数“拓扑平滑”的情况下的相关性。
translated by 谷歌翻译
我们提出了一种基于langevin扩散的算法,以在球体的产物歧管上进行非凸优化和采样。在对数Sobolev不平等的情况下,我们根据Kullback-Leibler Divergence建立了有限的迭代迭代收敛到Gibbs分布的保证。我们表明,有了适当的温度选择,可以保证,次级最小值的次数差距很小,概率很高。作为一种应用,我们考虑了使用对角线约束解决半决赛程序(SDP)的burer- monteiro方法,并分析提出的langevin算法以优化非凸目标。特别是,我们为Burer建立了对数Sobolev的不平等现象 - 当没有虚假的局部最小值时,但在鞍点下,蒙蒂罗问题。结合结果,我们为SDP和最大切割问题提供了全局最佳保证。更确切地说,我们证明了Langevin算法在$ \ widetilde {\ omega}(\ epsilon^{ - 5})$ tererations $ tererations $ \ widetilde {\ omega}(\ omega}中,具有很高的概率。
translated by 谷歌翻译
随机块模型(SBM)是一个随机图模型,其连接不同的顶点组不同。它被广泛用作研究聚类和社区检测的规范模型,并提供了肥沃的基础来研究组合统计和更普遍的数据科学中出现的信息理论和计算权衡。该专着调查了最近在SBM中建立社区检测的基本限制的最新发展,无论是在信息理论和计算方案方面,以及各种恢复要求,例如精确,部分和弱恢复。讨论的主要结果是在Chernoff-Hellinger阈值中进行精确恢复的相转换,Kesten-Stigum阈值弱恢复的相变,最佳的SNR - 单位信息折衷的部分恢复以及信息理论和信息理论之间的差距计算阈值。该专着给出了在寻求限制时开发的主要算法的原则推导,特别是通过绘制绘制,半定义编程,(线性化)信念传播,经典/非背带频谱和图形供电。还讨论了其他块模型的扩展,例如几何模型和一些开放问题。
translated by 谷歌翻译
我们研究了趋势过滤的多元版本,称为Kronecker趋势过滤或KTF,因为设计点以$ D $维度形成格子。 KTF是单变量趋势过滤的自然延伸(Steidl等,2006; Kim等人,2009; Tibshirani,2014),并通过最大限度地减少惩罚最小二乘问题,其罚款术语总和绝对(高阶)沿每个坐标方向估计参数的差异。相应的惩罚运算符可以编写单次趋势过滤惩罚运营商的Kronecker产品,因此名称Kronecker趋势过滤。等效,可以在$ \ ell_1 $ -penalized基础回归问题上查看KTF,其中基本功能是下降阶段函数的张量产品,是一个分段多项式(离散样条)基础,基于单变量趋势过滤。本文是Sadhanala等人的统一和延伸结果。 (2016,2017)。我们开发了一套完整的理论结果,描述了$ k \ grone 0 $和$ d \ geq 1 $的$ k ^ {\ mathrm {th}} $ over kronecker趋势过滤的行为。这揭示了许多有趣的现象,包括KTF在估计异构平滑的功能时KTF的优势,并且在$ d = 2(k + 1)$的相位过渡,一个边界过去(在高维对 - 光滑侧)线性泡沫不能完全保持一致。我们还利用Tibshirani(2020)的离散花键来利用最近的结果,特别是离散的花键插值结果,使我们能够将KTF估计扩展到恒定时间内的任何偏离晶格位置(与晶格数量的大小无关)。
translated by 谷歌翻译
Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.
translated by 谷歌翻译
在(特殊的)平滑样条问题中,一个人考虑了二次数据保真惩罚和拉普拉斯正则化的变异问题。可以通过用聚拉普拉斯的正规机构代替拉普拉斯的常规机构来获得较高的规律性。该方法很容易适应图,在这里,我们考虑在完全监督的,非参数,噪声损坏的回归问题中图形多拉普拉斯正则化。特别是,给定一个数据集$ \ {x_i \} _ {i = 1}^n $和一组嘈杂的标签$ \ {y_i \} _ {i = 1}^n \ subset \ subset \ mathbb {r}令$ u_n:\ {x_i \} _ {i = 1}^n \ to \ mathbb {r} $是由数据保真项组成的能量的最小化器,由数据保真术语和适当缩放的图形poly-laplacian项组成。当$ y_i = g(x_i)+\ xi_i $,对于IID噪声$ \ xi_i $,并使用几何随机图,我们在大型中识别(高概率)$ u_n $ to $ g $的收敛速率数据限制$ n \ to \ infty $。此外,我们的速率(到对数)与通常的平滑样条模型中已知的收敛速率相吻合。
translated by 谷歌翻译
Experimental sciences have come to depend heavily on our ability to organize, interpret and analyze high-dimensional datasets produced from observations of a large number of variables governed by natural processes. Natural laws, conservation principles, and dynamical structure introduce intricate inter-dependencies among these observed variables, which in turn yield geometric structure, with fewer degrees of freedom, on the dataset. We show how fine-scale features of this structure in data can be extracted from \emph{discrete} approximations to quantum mechanical processes given by data-driven graph Laplacians and localized wavepackets. This data-driven quantization procedure leads to a novel, yet natural uncertainty principle for data analysis induced by limited data. We illustrate the new approach with algorithms and several applications to real-world data, including the learning of patterns and anomalies in social distancing and mobility behavior during the COVID-19 pandemic.
translated by 谷歌翻译
我们提出了一种谱聚类算法,用于分析多元极端的依赖性结构。更具体地,我们专注于多元极端的渐近依赖性,其特征在于极值理论中的角度或光谱测量。我们的工作研究了基于从极端样本构成的随机价值的随机价值的频谱聚类的理论性能,即,半径超过大阈值的随机载体的角部分。特别是,我们得出了线性因子模型产生的极端的渐近分布,并证明,在某些条件下,光谱聚类可以一致地识别该模型中产生的极端簇。利用这一结果,我们提出了一种简单的一致估计策略,用于学习角度测量。我们的理论发现与说明我们方法的有限样本性能的数值实验相辅相成。
translated by 谷歌翻译
我们研究具有流形结构的物理系统的langevin动力学$ \ MATHCAL {M} \ subset \ Mathbb {r}^p $,基于收集的样品点$ \ {\ Mathsf {x} _i \} _ {_i \} _ {i = 1} ^n \ subset \ mathcal {m} $探测未知歧管$ \ mathcal {m} $。通过扩散图,我们首先了解反应坐标$ \ {\ MATHSF {y} _i \} _ {i = 1}^n \ subset \ subset \ mathcal {n} $对应于$ \ {\ {\ mathsf {x} _i _i \ \ \ \ \ _i \ \ \ \ {x} } _ {i = 1}^n $,其中$ \ mathcal {n} $是$ \ mathcal {m} $的歧义diffeomorphic,并且与$ \ mathbb {r}^\ ell $ insometryally嵌入了$ \ ell $,带有$ \ ell \ ell \ ell \ ell \ el \ ell \ el \ el \ ell \ el \ LL P $。在$ \ Mathcal {n} $上的诱导Langevin动力学在反应坐标方面捕获了缓慢的时间尺度动力学,例如生化反应的构象变化。要构建$ \ Mathcal {n} $上的Langevin Dynamics的高效稳定近似,我们利用反应坐标$ \ MATHSF {y} n effertold $ \ Mathcal {n} $上的歧管$ \ Mathcal {n} $上的相应的fokker-planck方程$。我们为此Fokker-Planck方程提出了可实施的,无条件稳定的数据驱动的有限卷方程,该方程将自动合并$ \ Mathcal {n} $的歧管结构。此外,我们在$ \ Mathcal {n} $上提供了有限卷方案的加权$ L^2 $收敛分析。所提出的有限体积方案在$ \ {\ Mathsf {y} _i \} _ {i = 1}^n $上导致Markov链,并具有近似的过渡概率和最近的邻居点之间的跳跃速率。在无条件稳定的显式时间离散化之后,数据驱动的有限体积方案为$ \ Mathcal {n} $上的Langevin Dynamics提供了近似的Markov进程,并且近似的Markov进程享有详细的平衡,Ergodicity和其他良好的属性。
translated by 谷歌翻译