当观察结果被截断时,我们仅限于数据集的不完整图片。最近的方法通过转向得分匹配来处理截短的密度估计问题,而不需要访问棘手的归一化常数。我们为Riemannian歧管提供了一个新颖的扩展,以截断得分匹配。在$ \ r^3 $中的二维领域上向von Mises-Fisher和Kent发行版提供了申请,以及美国极端风暴观察的现实应用。在模拟数据实验中,我们的分数匹配估计器能够以较低的估计误差近似于真实的参数值,并显示出比最大似然估计器的改进。
translated by 谷歌翻译
采样约束连续分布的问题经常出现在许多机器/统计学习模型中。许多Monte Carlo Markov链(MCMC)采样方法已适应以处理随机变量的不同类型的约束。在这些方法中,与其他对应物相比,汉密尔顿蒙特卡洛(HMC)和相关方法在计算效率方面具有显着优势。在本文中,我们首先回顾了HMC和一些扩展的抽样方法,然后具体解释了三种受约束的基于HMC的采样方法,反射,重新制定和球形HMC。为了说明,我们应用这些方法来解决三个众所周知的约束采样问题,截断的多元正常分布,贝叶斯正则回归和非参数密度估计。在这篇综述中,我们还将约束的采样与受约束设计空间的实验的统计设计中的另一个类似问题联系起来。
translated by 谷歌翻译
扩散模型是图像产生和似然估计的最新方法。在这项工作中,我们将连续的时间扩散模型推广到任意的Riemannian流形,并得出了可能性估计的变异框架。在计算上,我们提出了计算可能性估计中需要的黎曼分歧的新方法。此外,在概括欧几里得案例时,我们证明,最大化该变异的下限等效于Riemannian得分匹配。从经验上讲,我们证明了Riemannian扩散模型在各种光滑的歧管上的表达能力,例如球体,Tori,双曲线和正交组。我们提出的方法在所有基准测试基准上实现了新的最先进的可能性。
translated by 谷歌翻译
Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score function $\nabla_x \log p(x)$ -- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood -- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated -- i.e. the Poincar\'e, log-Sobolev and isoperimetric constant -- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant -- even for simple families of distributions like exponential families with rich enough sufficient statistics -- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.
translated by 谷歌翻译
基于分数的分歧已被广泛用于机器学习和统计应用。尽管他们的经验成功,但在将它们用于多模式分布时仍观察到了失明问题。在这项工作中,我们讨论了失明问题,并提出了一个新的分歧家庭,可以减轻失明问题。在密度估计的背景下,我们说明了我们提出的差异,与传统方法相比,报告的性能提高了。
translated by 谷歌翻译
对于要表示为歧管上点的2D对象的图像和形状等数据结构,这是常见的。从此类数据中产生消毒的差异私有估计的机制的实用性与它与空间的基础结构和几何形状的兼容性密切相关。特别是,如最近所示,拉普拉斯机理在正面弯曲的歧管上的效用(例如肯德尔的2D形状空间)受到曲率的显着影响。关注歧管上的点样品样本的Fr \'echet平均值的问题,我们利用均值的表征为由平方距离总和组成的目标函数的最小化器,并开发了k-norm梯度机制在Riemannian歧管上,有利于产生接近目标函数零的梯度的值。对于正面弯曲的歧管的情况,我们描述了如何使用平方距离函数的梯度比Laplace机制更好地控制灵敏度,并在数值上在callosa的形状数据集上进行数值演示。还提出了机理在球体上的实用性的进一步说明以及对称正定矩阵的多种示意图。
translated by 谷歌翻译
One often wants to estimate statistical models where the probability density function is known only up to a multiplicative normalization constant. Typically, one then has to resort to Markov Chain Monte Carlo methods, or approximations of the normalization constant. Here, we propose that such models can be estimated by minimizing the expected squared distance between the gradient of the log-density given by the model and the gradient of the log-density of the observed data. While the estimation of the gradient of log-density function is, in principle, a very difficult non-parametric problem, we prove a surprising result that gives a simple formula for this objective function. The density function of the observed data does not appear in this formula, which simplifies to a sample average of a sum of some derivatives of the log-density given by the model. The validity of the method is demonstrated on multivariate Gaussian and independent component analysis models, and by estimating an overcomplete filter set for natural image data.
translated by 谷歌翻译
我们对通过歧管(例如球形,Tori和其他隐式表面)描述的复杂几何形状的学习生成模型感兴趣。现有(欧几里德)生成模型的当前延伸仅限于特定几何形状,并且通常遭受高计算成本。我们介绍了Moser Flow(MF),是连续标准化流量(CNF)系列内的一类新的生成型号。 MF还通过解决方案产生CNF,然而,与其他CNF方法不同,其模型(学习)密度被参数化,因为源(先前)密度减去神经网络(NN)的发散。分歧是局部线性差分操作员,易于近似和计算歧管。因此,与其他CNFS不同,MF不需要在训练期间通过颂歌求解器调用或反向。此外,将模型密度明确表示为NN的发散而不是作为颂歌的解决方案有助于学习高保真密度。从理论上讲,我们证明了MF在合适的假设下构成了通用密度近似器。经验上,我们首次证明了流动模型的使用从一般曲面采样,并在挑战地球和气候的挑战性几何形状和现实世界基准中实现了密度估计,样本质量和培训复杂性的显着改善科学。
translated by 谷歌翻译
离散状态空间代表了对统计推断的主要计算挑战,因为归一化常数的计算需要在大型或可能的无限集中进行求和,这可能是不切实际的。本文通过开发适合离散可怜的可能性的新型贝叶斯推理程序来解决这一计算挑战。受到连续数据的最新方法学进步的启发,主要思想是使用离散的Fisher Divergence更新有关模型参数的信念,以代替有问题的棘手的可能性。结果是可以使用标准计算工具(例如Markov Chain Monte Carlo)进行采样的广义后部,从而规避了棘手的归一化常数。分析了广义后验的统计特性,并具有足够的后验一致性和渐近正态性的条件。此外,提出了一种新颖的通用后代校准方法。应用程序在离散空间数据的晶格模型和计数数据的多元模型上介绍,在每种情况下,方法论都以低计算成本促进通用的贝叶斯推断。
translated by 谷歌翻译
引入了Wasserstein距离的许多变体,以减轻其原始计算负担。尤其是切成薄片的距离(SW),该距离(SW)利用了一维投影,可以使用封闭式的瓦斯汀距离解决方案。然而,它仅限于生活在欧几里得空间中的数据,而Wasserstein距离已被研究和最近在歧管上使用。我们更具体地专门地关注球体,为此定义了新颖的SW差异,我们称之为球形切片 - 拖鞋,这是朝着定义SW差异的第一步。我们的构造明显基于圆圈上瓦斯汀距离的封闭式解决方案,以及新的球形ra径。除了有效的算法和相应的实现外,我们在几个机器学习用例中说明了它的属性,这些用例中,数据的球形表示受到威胁:在球体上的密度估计,变异推理或超球体自动编码器。
translated by 谷歌翻译
Jeffreys分歧是广泛用于信息科学的面向克鲁克 - 雷布尔分歧的着名对称化。由于高斯混合模型之间的jeffreys在闭合形式之间提供,因此在文献中提出了具有优缺点的各种技术,以估计,近似或降低这种发散。在本文中,我们提出了一种简单而快速的启发式,以近似与任意数量的组件的两个单变量高斯混合物之间的Jeffreys分歧。我们的启发式依赖于将混合物转换成属于指数家庭的双重参数化概率密度。特别是,我们考虑多功能多项式指数家庭密度,并设计分歧,以闭合形成高斯混合物与其多项式指数密度近似的拟合的良好度。这种拟合的良好分歧是Hyv \“Arinen分歧的概括,用于估计具有计算棘手的癌症的模型。它允许我们通过选择用于近似混合物的多项式指数密度的订单来执行模型选择。我们展示实验地,我们的启发式近似于jeffreys发散的数量幅度提高了随机蒙特卡罗估计的计算时间,同时接近jeffreys发散,特别是当混合物具有非常少量的模式时。此外,我们的混合物 - 指数家庭转换技术可能在其他设置中证明。
translated by 谷歌翻译
本文研究了鳞状高斯分布(NC-MSG)的非中心混合物的统计模型。使用与此分布相关的Fisher-Rao信息几何形状,我们得出了Riemannian梯度下降算法。该算法用于两个最小化问题。第一个是最小化正规化对数可能性(NLL)。后者使白色高斯分布与NC-MSG之间的权衡。给出了正则化的条件,以便在没有样本上的假设的情况下保证了该问题的最低限度。然后,得出了两个NC-MSG之间的Kullback-Leibler(KL)差异。这种差异使我们能够定义一个最小化问题,以计算几个NC-MSG的质量中心。提出的Riemannian梯度下降算法被利用以解决第二个最小化问题。数值实验表明了这两个问题的良好性能和riemannian梯度下降的速度。最后,实施了最接近的质心分类器,利用KL Divergence及其相关的质量中心。该分类器应用于大型数据集Breizhcrops,显示出良好的精度以及对测试集的刚性转换的稳健性。
translated by 谷歌翻译
变分贝叶斯推断是一个重要的机器学习工具,可从统计数据中找到应用到机器人技术。目的是从所选家族中找到一个近似概率密度函数(PDF),从某种意义上说,它最接近贝叶斯后部。接近度通常是通过选择适当的损失功能(例如Kullback-Leibler(KL)差异)来定义的。在本文中,我们通过利用(大多数)PDF是贝叶斯希尔伯特空间的成员,在仔细定义矢量添加,标量乘法和内部产品的情况下,探讨了变异推断的新表述。我们表明,在适当的条件下,基于KL差异的变异推断可以等于迭代性投影,从欧几里得意义上讲,贝叶斯后部到对应于所选近似族的子空间上。我们通过此通用框架的细节为高斯近似家族的特定情况进行了努力,并显示了与另一种高斯变异推理方法的等效性。此外,我们讨论了表现出稀疏性的系统的含义,该系统在贝叶斯空间中自然处理,并给出了一个高维机器人状态估计问题的示例,因此可以解决。我们提供了一些初步示例,说明如何将方法应用于非高斯推论,并详细讨论该方法的局限性,以鼓励沿着这些路线进行跟进。
translated by 谷歌翻译
连续归一化流(CNF)是一类生成模型,可以通过求解普通的微分方程(ODE)将先验分布转换为模型分布。我们建议通过最大程度地减少概率路径差异(PPD)来训练CNF,这是CNF产生的概率密度路径与目标概率密度路径之间的新型差异家族。 PPD是使用对数质量保护公式制定的,该公式是线性的一阶部分微分方程,将对数目标概率和CNF的定义向量场进行配方。 PPD比现有方法具有多个关键好处:它避免了在迭代中解决颂歌的需求,很容易应用于歧管数据,比例到高维度,并与大型目标路径兼容,该目标路径在有限的时间内插值纯噪声和数据。从理论上讲,PPD显示为结合经典概率差异。从经验上讲,我们表明,通过最小化PPD实现最新的CNF在现有的低维歧管基准上获得了最新的可能性和样品质量,并且是生成模型以扩展到中度高维歧管的第一个示例。
translated by 谷歌翻译
In the classical setting of self-selection, the goal is to learn $k$ models, simultaneously from observations $(x^{(i)}, y^{(i)})$ where $y^{(i)}$ is the output of one of $k$ underlying models on input $x^{(i)}$. In contrast to mixture models, where we observe the output of a randomly selected model, here the observed model depends on the outputs themselves, and is determined by some known selection criterion. For example, we might observe the highest output, the smallest output, or the median output of the $k$ models. In known-index self-selection, the identity of the observed model output is observable; in unknown-index self-selection, it is not. Self-selection has a long history in Econometrics and applications in various theoretical and applied fields, including treatment effect estimation, imitation learning, learning from strategically reported data, and learning from markets at disequilibrium. In this work, we present the first computationally and statistically efficient estimation algorithms for the most standard setting of this problem where the models are linear. In the known-index case, we require poly$(1/\varepsilon, k, d)$ sample and time complexity to estimate all model parameters to accuracy $\varepsilon$ in $d$ dimensions, and can accommodate quite general selection criteria. In the more challenging unknown-index case, even the identifiability of the linear models (from infinitely many samples) was not known. We show three results in this case for the commonly studied $\max$ self-selection criterion: (1) we show that the linear models are indeed identifiable, (2) for general $k$ we provide an algorithm with poly$(d) \exp(\text{poly}(k))$ sample and time complexity to estimate the regression parameters up to error $1/\text{poly}(k)$, and (3) for $k = 2$ we provide an algorithm for any error $\varepsilon$ and poly$(d, 1/\varepsilon)$ sample and time complexity.
translated by 谷歌翻译
With the rapid development of data collection techniques, complex data objects that are not in the Euclidean space are frequently encountered in new statistical applications. Fr\'echet regression model (Peterson & M\"uller 2019) provides a promising framework for regression analysis with metric space-valued responses. In this paper, we introduce a flexible sufficient dimension reduction (SDR) method for Fr\'echet regression to achieve two purposes: to mitigate the curse of dimensionality caused by high-dimensional predictors and to provide a visual inspection tool for Fr\'echet regression. Our approach is flexible enough to turn any existing SDR method for Euclidean (X,Y) into one for Euclidean X and metric space-valued Y. The basic idea is to first map the metric-space valued random object $Y$ to a real-valued random variable $f(Y)$ using a class of functions, and then perform classical SDR to the transformed data. If the class of functions is sufficiently rich, then we are guaranteed to uncover the Fr\'echet SDR space. We showed that such a class, which we call an ensemble, can be generated by a universal kernel. We established the consistency and asymptotic convergence rate of the proposed methods. The finite-sample performance of the proposed methods is illustrated through simulation studies for several commonly encountered metric spaces that include Wasserstein space, the space of symmetric positive definite matrices, and the sphere. We illustrated the data visualization aspect of our method by exploring the human mortality distribution data across countries and by studying the distribution of hematoma density.
translated by 谷歌翻译
We present a new estimation principle for parameterized statistical models. The idea is to perform nonlinear logistic regression to discriminate between the observed data and some artificially generated noise, using the model log-density function in the regression nonlinearity. We show that this leads to a consistent (convergent) estimator of the parameters, and analyze the asymptotic variance. In particular, the method is shown to directly work for unnormalized models, i.e. models where the density function does not integrate to one. The normalization constant can be estimated just like any other parameter. For a tractable ICA model, we compare the method with other estimation methods that can be used to learn unnormalized models, including score matching, contrastive divergence, and maximum-likelihood where the normalization constant is estimated with importance sampling. Simulations show that noise-contrastive estimation offers the best trade-off between computational and statistical efficiency. The method is then applied to the modeling of natural images: We show that the method can successfully estimate a large-scale two-layer model and a Markov random field.
translated by 谷歌翻译
For distributions $\mathbb{P}$ and $\mathbb{Q}$ with different supports or undefined densities, the divergence $\textrm{D}(\mathbb{P}||\mathbb{Q})$ may not exist. We define a Spread Divergence $\tilde{\textrm{D}}(\mathbb{P}||\mathbb{Q})$ on modified $\mathbb{P}$ and $\mathbb{Q}$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread. We also give examples of using a Spread Divergence to train implicit generative models, including linear models (Independent Components Analysis) and non-linear models (Deep Generative Networks).
translated by 谷歌翻译
指数族在机器学习中广泛使用,包括连续和离散域中的许多分布(例如,通过SoftMax变换,Gaussian,Dirichlet,Poisson和分类分布)。这些家庭中的每个家庭的分布都有固定的支持。相比之下,对于有限域而言,最近在SoftMax稀疏替代方案(例如Sparsemax,$ \ alpha $ -entmax和Fusedmax)的稀疏替代方案中导致了带有不同支持的分布。本文基于几种技术贡献,开发了连续分布的稀疏替代方案:首先,我们定义了$ \ omega $ regultion的预测图和任意域的Fenchel-young损失(可能是无限或连续的)。对于线性参数化的家族,我们表明,Fenchel-Young损失的最小化等效于统计的矩匹配,从而概括了指数家族的基本特性。当$ \ omega $是带有参数$ \ alpha $的Tsallis negentropy时,我们将获得````trabormed rompential指数)'',其中包括$ \ alpha $ -entmax和sparsemax和sparsemax($ \ alpha = 2 $)。对于二次能量函数,产生的密度为$ \ beta $ -Gaussians,椭圆形分布的实例,其中包含特殊情况,即高斯,双重量级,三人级和epanechnikov密度,我们为差异而得出了差异的封闭式表达式, Tsallis熵和Fenchel-Young损失。当$ \ Omega $是总变化或Sobolev正常化程序时,我们将获得Fusedmax的连续版本。最后,我们引入了连续的注意机制,从\ {1、4/3、3/3、3/2、2 \} $中得出有效的梯度反向传播算法。使用这些算法,我们证明了我们的稀疏连续分布,用于基于注意力的音频分类和视觉问题回答,表明它们允许参加时间间隔和紧凑区域。
translated by 谷歌翻译
我们将最初在多维扩展和降低多元数据的降低领域发展为功能设置。我们专注于经典缩放和ISOMAP - 在这些领域中起重要作用的原型方法 - 并在功能数据分析的背景下展示它们的使用。在此过程中,我们强调了环境公制扮演的关键作用。
translated by 谷歌翻译