智能论文笔记

Blindness of score-based methods to isolated components and mixing proportions

Li K. Wenliang , Heishiro Kanagawa

分类： (统计)机器学习 | 机器学习

2020-08-23

诸如密度估计和近似贝叶斯推理的统计任务通常涉及具有未知标准化常量的密度。基于分数的方法，包括分数匹配，是流行的技术，因为它们没有规范化常数。虽然这些方法享有理论担保，但有点熟知的事实是，当感兴趣的无通知分配具有分离成分时，它们表现出实际的失效模式 - 它们无法发现分离的组件或识别组件之间的正确混合比例。我们使用简单的分布来展示这些发现，并提出启发式尝试解决这些问题。我们希望在开发新算法和应用程序时引发理论家和从业者的注意。

translated by 谷歌翻译

Controlling Moments with Kernel Stein Discrepancies

Heishiro Kanagawa , Arthur Gretton , Lester Mackey

分类： (统计)机器学习 | 机器学习

2022-11-10

Quantifying the deviation of a probability distribution is challenging when the target distribution is defined by a density with an intractable normalizing constant. The kernel Stein discrepancy (KSD) was proposed to address this problem and has been applied to various tasks including diagnosing approximate MCMC samplers and goodness-of-fit testing for unnormalized statistical models. This article investigates a convergence control property of the diffusion kernel Stein discrepancy (DKSD), an instance of the KSD proposed by Barp et al. (2019). We extend the result of Gorham and Mackey (2017), which showed that the KSD controls the bounded-Lipschitz metric, to functions of polynomial growth. Specifically, we prove that the DKSD controls the integral probability metric defined by a class of pseudo-Lipschitz functions, a polynomial generalization of Lipschitz functions. We also provide practical sufficient conditions on the reproducing kernel for the stated property to hold. In particular, we show that the DKSD detects non-convergence in moments with an appropriate kernel.

translated by 谷歌翻译

Robust Generalised Bayesian Inference for Intractable Likelihoods

Takuo Matsubara , Jeremias Knoblauch , François-Xavier Briol , Chris. J. Oates

分类： (统计)机器学习

2021-04-15

广义贝叶斯推理使用损失函数而不是可能性的先前信仰更新，因此可以用于赋予鲁棒性，以防止可能的错误规范的可能性。在这里，我们认为广泛化的贝叶斯推论斯坦坦差异作为损失函数的损失，由应用程序的可能性含有难治性归一化常数。在这种情况下，斯坦因差异来避免归一化恒定的评估，并产生封闭形式或使用标准马尔可夫链蒙特卡罗的通用后出版物。在理论层面上，我们显示了一致性，渐近的正常性和偏见 - 稳健性，突出了这些物业如何受到斯坦因差异的选择。然后，我们提供关于一系列棘手分布的数值实验，包括基于内核的指数家庭模型和非高斯图形模型的应用。

translated by 谷歌翻译

Towards Healing the Blindness of Score Matching

Mingtian Zhang , Oscar Key , Peter Hayes , David Barber , Brooks Paige , François-Xavier Briol

分类： (统计)机器学习 | 机器学习

2022-09-15

基于分数的分歧已被广泛用于机器学习和统计应用。尽管他们的经验成功，但在将它们用于多模式分布时仍观察到了失明问题。在这项工作中，我们讨论了失明问题，并提出了一个新的分歧家庭，可以减轻失明问题。在密度估计的背景下，我们说明了我们提出的差异，与传统方法相比，报告的性能提高了。

translated by 谷歌翻译

Statistical Efficiency of Score Matching: The View from Isoperimetry

Frederic Koehler , Alexander Heckett , Andrej Risteski

分类：机器学习 | (统计)机器学习

2022-10-03

Deep generative models parametrized up to a normalizing constant (e.g. energy-based models) are difficult to train by maximizing the likelihood of the data because the likelihood and/or gradients thereof cannot be explicitly or efficiently written down. Score matching is a training method, whereby instead of fitting the likelihood $\log p(x)$ for the training data, we instead fit the score function $\nabla_x \log p(x)$ -- obviating the need to evaluate the partition function. Though this estimator is known to be consistent, its unclear whether (and when) its statistical efficiency is comparable to that of maximum likelihood -- which is known to be (asymptotically) optimal. We initiate this line of inquiry in this paper, and show a tight connection between statistical efficiency of score matching and the isoperimetric properties of the distribution being estimated -- i.e. the Poincar\'e, log-Sobolev and isoperimetric constant -- quantities which govern the mixing time of Markov processes like Langevin dynamics. Roughly, we show that the score matching estimator is statistically comparable to the maximum likelihood when the distribution has a small isoperimetric constant. Conversely, if the distribution has a large isoperimetric constant -- even for simple families of distributions like exponential families with rich enough sufficient statistics -- score matching will be substantially less efficient than maximum likelihood. We suitably formalize these results both in the finite sample regime, and in the asymptotic regime. Finally, we identify a direct parallel in the discrete setting, where we connect the statistical properties of pseudolikelihood estimation with approximate tensorization of entropy and the Glauber dynamics.

translated by 谷歌翻译

Generalised Bayesian Inference for Discrete Intractable Likelihood

Takuo Matsubara , Jeremias Knoblauch , François-Xavier Briol , Chris. J. Oates

分类： (统计)机器学习

2022-06-16

离散状态空间代表了对统计推断的主要计算挑战，因为归一化常数的计算需要在大型或可能的无限集中进行求和，这可能是不切实际的。本文通过开发适合离散可怜的可能性的新型贝叶斯推理程序来解决这一计算挑战。受到连续数据的最新方法学进步的启发，主要思想是使用离散的Fisher Divergence更新有关模型参数的信念，以代替有问题的棘手的可能性。结果是可以使用标准计算工具（例如Markov Chain Monte Carlo）进行采样的广义后部，从而规避了棘手的归一化常数。分析了广义后验的统计特性，并具有足够的后验一致性和渐近正态性的条件。此外，提出了一种新颖的通用后代校准方法。应用程序在离散空间数据的晶格模型和计数数据的多元模型上介绍，在每种情况下，方法论都以低计算成本促进通用的贝叶斯推断。

translated by 谷歌翻译

Efficient MCMC Sampling with Dimension-Free Convergence Rate using ADMM-type Splitting

Maxime Vono , Daniel Paulin , Arnaud Doucet

分类： (统计)机器学习

2019-05-23

对复杂模型执行精确的贝叶斯推理是计算的难治性的。马尔可夫链蒙特卡罗（MCMC）算法可以提供后部分布的可靠近似，但对于大型数据集和高维模型昂贵。减轻这种复杂性的标准方法包括使用子采样技术或在群集中分发数据。然而，这些方法通常在高维方案中不可靠。我们在此处专注于最近的替代类别的MCMC方案，利用类似于乘客（ADMM）优化算法的庆祝交替方向使用的分裂策略。这些方法似乎提供了凭经验最先进的性能，但其高维层的理论行为目前未知。在本文中，我们提出了一个详细的理论研究，该算法之一称为分裂Gibbs采样器。在规律条件下，我们使用RICCI曲率和耦合思路为此方案建立了明确的收敛速率。我们以数字插图支持我们的理论。

translated by 谷歌翻译

Neural Stein critics with staged $L^2$-regularization

Matthew Repasky , Xiuyuan Cheng , Yao Xie

分类： (统计)机器学习 | 机器学习

2022-07-07

学习将模型分布与观察到的数据区分开来是统计和机器学习中的一个基本问题，而高维数据仍然是这些问题的挑战性环境。量化概率分布差异的指标（例如Stein差异）在高维度的统计测试中起重要作用。在本文中，我们考虑了一个希望区分未知概率分布和名义模型分布的数据的设置。虽然最近的研究表明，最佳$ l^2 $ regularized Stein评论家等于两个概率分布的分数函数的差异，最多是乘法常数，但我们研究了$ l^2 $正则化的作用，训练神经网络时差异评论家功能。由训练神经网络的神经切线内核理论的激励，我们开发了一种新的分期程序，用于训练时间的正则化重量。这利用了早期培训的优势，同时还可以延迟过度拟合。从理论上讲，我们将训练动态与大的正则重量与在早期培训时间的“懒惰训练”制度的内核回归优化相关联。在模拟的高维分布漂移数据和评估图像数据的生成模型的应用中，证明了分期$ l^2 $正则化的好处。

translated by 谷歌翻译

A Spectral Representation of Kernel Stein Discrepancy with Application to Goodness-of-Fit Tests for Measures on Infinite Dimensional Hilbert Spaces

George Wynne , Mikołaj Kasprzak , Andrew B. Duncan

分类： (统计)机器学习

2022-06-09

内核Stein差异（KSD）是一种基于内核的广泛使用概率指标之间差异的非参数量度。它通常在用户从候选概率度量中收集的样本集合的情况下使用，并希望将它们与指定的目标概率度量进行比较。 KSD的一个有用属性是，它可以仅从候选度量的样本中计算出来，并且不知道目标度量的正常化常数。 KSD已用于一系列设置，包括合适的测试，参数推断，MCMC输出评估和生成建模。当前KSD方法论的两个主要问题是（i）超出有限维度欧几里得环境之外的适用性以及（ii）缺乏影响KSD性能的清晰度。本文提供了KSD的新频谱表示，这两种补救措施都使KSD适用于希尔伯特（Hilbert）评估数据，并揭示了内核和Stein oterator Choice对KSD的影响。我们通过在许多合成数据实验中对各种高斯和非高斯功能模型进行拟合优度测试来证明所提出的方法的功效。

translated by 谷歌翻译

Smooth $p$-Wasserstein Distance: Structure, Empirical Approximation, and Statistical Applications

Sloan Nietert , Ziv Goldfeld , Kengo Kato

分类： (统计)机器学习

2021-01-11

概率分布之间的差异措施，通常被称为统计距离，在概率理论，统计和机器学习中普遍存在。为了在估计这些距离的距离时，对维度的诅咒，最近的工作已经提出了通过带有高斯内核的卷积在测量的分布中平滑局部不规则性。通过该框架的可扩展性至高维度，我们研究了高斯平滑$ P $ -wassersein距离$ \ mathsf {w} _p ^ {（\ sigma）} $的结构和统计行为，用于任意$ p \ GEQ 1 $。在建立$ \ mathsf {w} _p ^ {（\ sigma）} $的基本度量和拓扑属性之后，我们探索$ \ mathsf {w} _p ^ {（\ sigma）}（\ hat {\ mu} _n，\ mu）$，其中$ \ hat {\ mu} _n $是$ n $独立观察的实证分布$ \ mu $。我们证明$ \ mathsf {w} _p ^ {（\ sigma）} $享受$ n ^ { - 1/2} $的参数经验融合速率，这对比$ n ^ { - 1 / d} $率对于未平滑的$ \ mathsf {w} _p $ why $ d \ geq 3 $。我们的证明依赖于控制$ \ mathsf {w} _p ^ {（\ sigma）} $ by $ p $ th-sting spoollow sobolev restion $ \ mathsf {d} _p ^ {（\ sigma）} $并导出限制$ \ sqrt {n} \，\ mathsf {d} _p ^ {（\ sigma）}（\ hat {\ mu} _n，\ mu）$，适用于所有尺寸$ d $。作为应用程序，我们提供了使用$ \ mathsf {w} _p ^ {（\ sigma）} $的两个样本测试和最小距离估计的渐近保证，使用$ p = 2 $的实验使用$ \ mathsf {d} _2 ^ {（\ sigma）} $。

translated by 谷歌翻译

A kernel two-sample test

分类：

We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests.

translated by 谷歌翻译

On the representation and learning of monotone triangular transport maps

Ricardo Baptista , Youssef Marzouk , Olivier Zahm

分类： (统计)机器学习 | 机器学习

2020-09-22

度量的运输提供了一种用于建模复杂概率分布的多功能方法，并具有密度估计，贝叶斯推理，生成建模及其他方法的应用。单调三角传输地图$ \ unicode {x2014} $近似值$ \ unicode {x2013} $ rosenblatt（kr）重新安排$ \ unicode {x2014} $是这些任务的规范选择。然而，此类地图的表示和参数化对它们的一般性和表现力以及对从数据学习地图学习（例如，通过最大似然估计）出现的优化问题的属性产生了重大影响。我们提出了一个通用框架，用于通过平滑函数的可逆变换来表示单调三角图。我们建立了有关转化的条件，以使相关的无限维度最小化问题没有伪造的局部最小值，即所有局部最小值都是全球最小值。我们展示了满足某些尾巴条件的目标分布，唯一的全局最小化器与KR地图相对应。鉴于来自目标的样品，我们提出了一种自适应算法，该算法估计了基础KR映射的稀疏半参数近似。我们证明了如何将该框架应用于关节和条件密度估计，无可能的推断以及有向图形模型的结构学习，并在一系列样本量之间具有稳定的概括性能。

translated by 谷歌翻译

Bayesian imaging using Plug & Play priors: when Langevin meets Tweedie

Rémi Laumont , Valentin de Bortoli , Andrés Almansa , Julie Delon , Alain Durmus , Marcelo Pereyra

分类：计算机视觉 | (统计)机器学习

2021-03-08

自Venkatakrishnan等人的开创性工作以来。 2013年，即插即用（PNP）方法在贝叶斯成像中变得普遍存在。这些方法通过将显式似然函数与预定由图像去噪算法隐式定义的明确定义，导出用于成像中的逆问题的最小均方误差（MMSE）或最大后验误差（MAP）估计器。文献中提出的PNP算法主要不同于他们用于优化或采样的迭代方案。在优化方案的情况下，一些最近的作品能够保证收敛到一个定点，尽管不一定是地图估计。在采样方案的情况下，据我们所知，没有已知的收敛证明。关于潜在的贝叶斯模型和估算器是否具有明确定义，良好的良好，并且具有支持这些数值方案所需的基本规律性属性，还存在重要的开放性问题。为了解决这些限制，本文开发了用于对PNP前锋进行贝叶斯推断的理论，方法和可忽略的会聚算法。我们介绍了两个算法：1）PNP-ULA（未调整的Langevin算法），用于蒙特卡罗采样和MMSE推断; 2）PNP-SGD（随机梯度下降）用于MAP推理。利用Markov链的定量融合的最新结果，我们为这两种算法建立了详细的收敛保证，在现实假设下，在去噪运营商使用的现实假设下，特别注意基于深神经网络的遣散者。我们还表明这些算法大致瞄准了良好的决策理论上最佳的贝叶斯模型。所提出的算法在几种规范问题上证明了诸如图像去纹，染色和去噪，其中它们用于点估计以及不确定的可视化和量化。

translated by 谷歌翻译

Mean field Variational Inference via Wasserstein Gradient Flow

Rentian Yao , Yun Yang

分类： (统计)机器学习

2022-07-17

变性推理（VI）为基于传统的采样方法提供了一种吸引人的替代方法，用于实施贝叶斯推断，因为其概念性的简单性，统计准确性和计算可扩展性。然而，常见的变分近似方案（例如平均场（MF）近似）需要某些共轭结构以促进有效的计算，这可能会增加不必要的限制对可行的先验分布家族，并对变异近似族对差异进行进一步的限制。在这项工作中，我们开发了一个通用计算框架，用于实施MF-VI VIA WASSERSTEIN梯度流（WGF），这是概率度量空间上的梯度流。当专门针对贝叶斯潜在变量模型时，我们将分析基于时间消化的WGF交替最小化方案的算法收敛，用于实现MF近似。特别是，所提出的算法类似于EM算法的分布版本，包括更新潜在变量变异分布的E step以及在参数的变异分布上进行最陡峭下降的m step。我们的理论分析依赖于概率度量空间中的最佳运输理论和细分微积分。我们证明了时间限制的WGF的指数收敛性，以最大程度地减少普通大地测量学严格的凸度的通用物镜功能。我们还提供了通过使用时间限制的WGF的固定点方程从MF近似获得的变异分布的指数收缩的新证明。我们将方法和理论应用于两个经典的贝叶斯潜在变量模型，即高斯混合模型和回归模型的混合物。还进行了数值实验，以补充这两个模型下的理论发现。

translated by 谷歌翻译

Spread Divergence

Mingtian Zhang , Peter Hayes , Tom Bird , Raza Habib , David Barber

分类： (统计)机器学习 | 机器学习

2018-11-21

For distributions $\mathbb{P}$ and $\mathbb{Q}$ with different supports or undefined densities, the divergence $\textrm{D}(\mathbb{P}||\mathbb{Q})$ may not exist. We define a Spread Divergence $\tilde{\textrm{D}}(\mathbb{P}||\mathbb{Q})$ on modified $\mathbb{P}$ and $\mathbb{Q}$ and describe sufficient conditions for the existence of such a divergence. We demonstrate how to maximize the discriminatory power of a given divergence by parameterizing and learning the spread. We also give examples of using a Spread Divergence to train implicit generative models, including linear models (Independent Components Analysis) and non-linear models (Deep Generative Networks).

translated by 谷歌翻译

Generalized Energy Based Models

Michael Arbel , Liang Zhou , Arthur Gretton

分类： (统计)机器学习 | 机器学习

2020-03-10

我们介绍了用于生成建模的广义能量模型（GEBM）。这些模型组合了两个训练有素的组件：基本分布（通常是隐式模型），可以在高维空间中学习具有低固有尺寸的数据的支持;和能量功能，优化学习支持的概率质量。能量函数和基座都共同构成了最终模型，与GANS不同，它仅保留基本分布（“发电机”）。通过在学习能量和基础之间交替进行培训GEBMS。我们表明，两种培训阶段都明确定义：通过最大化广义可能性来学习能量，并且由此产生的能源的损失提供了学习基础的信息梯度。可以通过MCMC获得来自训练模型的潜在空间的后部的样品，从而在该空间中找到产生更好的质量样本的区域。经验上，图像生成任务上的GEBM样本比来自学习发电机的图像更好，表明所有其他相同，GEBM将优于同样复杂性的GAN。当使用归一化流作为基础测量时，GEBMS成功地启动密度建模任务，返回相当的性能以直接相同网络的最大可能性。

translated by 谷歌翻译

Controlling Wasserstein distances by Kernel norms with application to Compressive Statistical Learning

Titouan Vayer , Rémi Gribonval

分类： (统计)机器学习 | 机器学习

2021-12-01

比较概率分布是许多机器学习算法的关键。最大平均差异（MMD）和最佳运输距离（OT）是在过去几年吸引丰富的关注的概率措施之间的两类距离。本文建立了一些条件，可以通过MMD规范控制Wassersein距离。我们的作品受到压缩统计学习（CSL）理论的推动，资源有效的大规模学习的一般框架，其中训练数据总结在单个向量（称为草图）中，该训练数据捕获与所考虑的学习任务相关的信息。在CSL中的现有结果启发，我们介绍了H \“较旧的较低限制的等距属性（H \”较旧的LRIP）并表明这家属性具有有趣的保证对压缩统计学习。基于MMD与Wassersein距离之间的关系，我们通过引入和研究学习任务的Wassersein可读性的概念来提供压缩统计学习的保证，即概率分布之间的某些特定于特定的特定度量，可以由Wassersein界定距离。

translated by 谷歌翻译

Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm

Qiang Liu , Dilin Wang

分类：

2016-08-16

We propose a general purpose variational inference algorithm that forms a natural counterpart of gradient descent for optimization. Our method iteratively transports a set of particles to match the target distribution, by applying a form of functional gradient descent that minimizes the KL divergence. Empirical studies are performed on various real world models and datasets, on which our method is competitive with existing state-of-the-art methods. The derivation of our method is based on a new theoretical result that connects the derivative of KL divergence under smooth transforms with Stein's identity and a recently proposed kernelized Stein discrepancy, which is of independent interest.

translated by 谷歌翻译

Approximate Bayesian Computation via Classification

Yuexi Wang , Tetsuya Kaji , Veronika Ročková

分类： (统计)机器学习

2021-11-22

近似贝叶斯计算（ABC）使复杂模型中的统计推断能够计算，其可能性难以计算，但易于模拟。 ABC通过接受/拒绝机制构建到后部分布的内核类型近似，该机制比较真实和模拟数据的摘要统计信息。为了避免对汇总统计数据的需求，我们直接将经验分布与通过分类获得的Kullback-Leibler（KL）发散估计值进行比较。特别是，我们将灵活的机器学习分类器混合在ABC中以自动化虚假/真实数据比较。我们考虑传统的接受/拒绝内核以及不需要ABC接受阈值的指数加权方案。我们的理论结果表明，我们的ABC后部分布集中在真实参数周围的速率取决于分类器的估计误差。我们得出了限制后形状的结果，并找到了一个正确缩放的指数内核，渐近常态持有。我们展示了我们对模拟示例以及在股票波动率估计的背景下的真实数据的有用性。

translated by 谷歌翻译

Strong identifiability and parameter learning in regression with heterogeneous response

Dat Do , Linh Do , XuanLong Nguyen

分类： (统计)机器学习

2022-12-08

Mixtures of regression are a powerful class of models for regression learning with respect to a highly uncertain and heterogeneous response variable of interest. In addition to being a rich predictive model for the response given some covariates, the parameters in this model class provide useful information about the heterogeneity in the data population, which is represented by the conditional distributions for the response given the covariates associated with a number of distinct but latent subpopulations. In this paper, we investigate conditions of strong identifiability, rates of convergence for conditional density and parameter estimation, and the Bayesian posterior contraction behavior arising in finite mixture of regression models, under exact-fitted and over-fitted settings and when the number of components is unknown. This theory is applicable to common choices of link functions and families of conditional distributions employed by practitioners. We provide simulation studies and data illustrations, which shed some light on the parameter learning behavior found in several popular regression mixture models reported in the literature.

translated by 谷歌翻译