Nearest neighbor methods are a popular class of nonparametric estimators withseveral desirable properties, such as adaptivity to different distance scalesin different regions of space. Prior work on convergence rates for nearestneighbor classification has not fully reflected these subtle properties. Weanalyze the behavior of these estimators in metric spaces and providefinite-sample, distribution-dependent rates of convergence under minimalassumptions. As a by-product, we are able to establish the universalconsistency of nearest neighbor in a broader range of data spaces than waspreviously known. We illustrate our upper and lower bounds by introducingsmoothness classes that are customized for nearest neighbor classification.
translated by 谷歌翻译
Consider the standard linear regression model Y = Xβ * +w, where Y ∈ R n is an observation vector, X ∈ R n×d is a design matrix, β * ∈ R d is the unknown regression vector, and w ∼ N (0, σ 2 I) is additive Gaussian noise. This paper studies the minimax rates of convergence for estimation of β * for ℓ p-losses and in the ℓ 2-prediction loss, assuming that β * belongs to an ℓ q-ball B q (R q) for some q ∈ [0, 1]. We show that under suitable regularity conditions on the design matrix X, the minimax error in ℓ 2-loss and ℓ 2-prediction loss scales as R q log d n 1− q 2. In addition, we provide lower bounds on minimax risks in ℓ p-norms, for all p ∈ [1, +∞], p = q. Our proofs of the lower bounds are information-theoretic in nature, based on Fano's inequality and results on the metric entropy of the balls B q (R q), whereas our proofs of the upper bounds are direct and constructive, involving direct analysis of least-squares over ℓ q-balls. For the special case q = 0, a comparison with ℓ 2-risks achieved by computationally efficient ℓ 1-relaxations reveals that although such methods can achieve the minimax rates up to constant factors, they require slightly stronger assumptions on the design matrix X than algorithms involving least-squares over the ℓ 0-ball.
translated by 谷歌翻译
We study generalized density-based clustering in which sharply defined clusters such as clusters on lower-dimensional manifolds are allowed. We show that accurate clustering is possible even in high dimensions. We propose two data-based methods for choosing the bandwidth and we study the stability properties of density clusters. We show that a simple graph-based algorithm successfully approximates the high density clusters.
translated by 谷歌翻译
Sparse additive models are families of d-variate functions with the additive decomposition f * = ∑ j∈S f * j , where S is an unknown subset of cardinality s d. In this paper, we consider the case where each univariate component function f * j lies in a reproducing kernel Hilbert space (RKHS), and analyze a method for estimating the unknown function f * based on kernels combined with 1-type convex regularization. Working within a high-dimensional framework that allows both the dimension d and sparsity s to increase with n, we derive convergence rates in the L 2 (P) and L 2 (P n) norms over the class F d,s,H of sparse additive models with each univariate function f * j in the unit ball of a univariate RKHS with bounded kernel function. We complement our upper bounds by deriving minimax lower bounds on the L 2 (P) error, thereby showing the optimality of our method. Thus, we obtain optimal minimax rates for many interesting classes of sparse additive models, including polynomials, splines, and Sobolev classes. We also show that if, in contrast to our univariate conditions, the d-variate function class is assumed to be globally bounded, then much faster estimation rates are possible for any sparsity s = Ω(√ n), showing that global boundedness is a significant restriction in the high-dimensional setting.
translated by 谷歌翻译
Performance bounds for criteria for model selection are developed using recent theory for sieves. The model selection criteria are based on an empirical loss or contrast function with an added penalty term motivated by empirical process theory and roughly proportional to the number of parameters needed to describe the model divided by the number of observations. Most of our examples involve density or regression estimation settings and we focus on the problem of estimating the unknown density or regression function. We show that the quadratic risk of the minimum penalized empirical contrast estimator is bounded by an index of the accuracy of the sieve. This accuracy index quantifies the trade-off among the candidate models between the approximation error and parameter dimension relative to sample size. If we choose a list of models which exhibit good approximation properties with respect to different classes of smoothness, the estimator can be simultaneously minimax rate optimal in each of those classes. This is what is usually called adaptation. The type of classes of smoothness in which one gets adaptation depends heavily on the list of models. If too many models are involved in order to get accurate approximation of many wide classes of functions simultaneously, it may happen that the estimator is only approx-Work supported in part by the NSF grant ECS-9410760, and URA CNRS 1321 "Statistique et modèles aléatoires", and URA CNRS 743 "Modélisation stochastique et Statistique". A. Barron et al. imately adaptive (typically up to a slowly varying function of the sample size). We shall provide various illustrations of our method such as penalized maximum likelihood, projection or least squares estimation. The models will involve commonly used finite dimensional expansions such as piece-wise polynomials with fixed or variable knots, trigonometric polynomials, wavelets, neural nets and related nonlinear expansions defined by superpo-sition of ridge functions.
translated by 谷歌翻译
单指标模型是一种功能强大而又简单的模型,广泛应用于统计学,机器学习和其他科学领域。它将回归函数建模为$ g(<a,x>)$,其中a是未知索引向量,x是特征。本文讨论了该框架的非线性推广,以允许使用多个索引向量的回归量,适应响应中的局部变化。为此,我们利用条件分布过功能驱动的分区,并使用线性回归来本地估计索引向量。然后,我们通过应用kNN类型估计器来回归,该估计器使用测地线度量的本地化。我们提出了估计局部指数向量和样本外预测的理论保证,并证明了我们的方法在合成和真实世界数据集上的实验性能,并将其与最先进的方法进行了比较。
translated by 谷歌翻译
We consider a problem of multiclass classification, where the training sample Sn = {(X i , Y i)} n i=1 is generated from the model P(Y = m|X = x) = ηm(x), 1 m M , and η 1
translated by 谷歌翻译
我们提出了一般无限损失函数的新超额风险界限,包括对数损失和平方损失,其中损失的分布可能很重。边界适用于一般估计量,但它们在应用于$ \ eta $ -generalized贝叶斯,MDL和经验风险最小化估计时进行了优化。在对数丢失的情况下,只要学习率$ \ eta $正确设置,界限就意味着在Hellinger度量的时间化方面在错误指定下的广义贝叶斯推理的收敛性。对于一般损失函数,我们的界限依赖于两个独立的条件:$ v $ -GRIP(广义反向信息投影)条件,它控制超额损失的下尾;以及控制上尾的新引入的见证条件。 $ v $ -GRIP条件中的参数$ v $确定可实现的利率,并且类似于Tsybakov边际条件中的指数和Bernstein条件的有限损失,$ v $ -GRIP条件一般化;有利的$ v $组合小模型复杂性导致$ \ tilde {O}(1 / n)$的费率。证人条件允许我们将超额风险与“退火”版本联系起来,通过这种方式我们推广了几个先前的结果,将Hellingerand R \'enyi分歧与KL分歧联系起来。
translated by 谷歌翻译
过去的工作已经表明,有些令人惊讶的是,过度参数化可以帮助神经网络中的泛化。为了解释这一现象,我们采用基于保证金的观点。我们建立:1)对于多层前馈网络,弱正则化交叉熵损失的全局最小值具有所有网络中的最大归一化裕度,2)因此,增加过度参数化改善了归一化裕度和广义误差界限用于双层网络。特别是,无限大小的神经网络享有最佳的泛化保证。典型的无限特征方法是核方法;我们将神经netmargin与内核方法进行比较,并构建自然实例,其中kernelmethods具有更弱的泛化保证。我们凭经验验证了这两种方法之间的差距。最后,这种无限神经元观点对于分析优化也很有成效。我们证明了无限大小网络上的扰动梯度流在多项式时间内找到了一个全局优化器。
translated by 谷歌翻译
We study the stochastic batched convex optimization problem, in which we use many parallel observations to optimize a convex function given limited rounds of interaction. In each of M rounds, an algorithm may query for information at n points, and after issuing all n queries, it receives un-biased noisy function and/or (sub)gradient evaluations at the n points. After M such rounds, the algorithm must output an estimator. We provide lower and upper bounds on the performance of such batched convex optimization algorithms in zeroth and first-order settings for Lipschitz convex and smooth strongly convex functions. Our rates of convergence (nearly) achieve the fully sequential rate once M = O(d log log n), where d is the problem dimension, but the rates may exponentially degrade as the dimension d increases, in distinction from fully sequential settings.
translated by 谷歌翻译
我们考虑自然发生的类型的双样本积分函数的估计,例如,当感兴趣的对象是未知概率密度之间的偏差时。我们的第一个主要结果是,在广义上,加权最近邻估计是有效的,在实现局部渐近极小极大下界的意义上。此外,我们还证明了相应的中心极限定理,这有利于构造功能性的,具有极小的宽度的非常有效的置信区间。我们的结果的一个有趣的结果是发现,对于某些功能,ouretimator的最坏情况性能可能会改善自然“oracle”估计器的性能,后者在观察时被允许接近未知密度的值。
translated by 谷歌翻译
Most high-dimensional estimation and prediction methods propose to minimize a cost function (empirical risk) that is written as a sum of losses associated to each data point (each example). In this paper we focus on the case of non-convex losses, which is practically important but still poorly understood. Classical empirical process theory implies uniform convergence of the empirical (or sample) risk to the population risk. While-under additional assumptions-uniform convergence implies consistency of the resulting M-estimator, it does not ensure that the latter can be computed efficiently. In order to capture the complexity of computing M-estimators, we propose to study the landscape of the empirical risk, namely its stationary points and their properties. We establish uniform convergence of the gradient and Hessian of the empirical risk to their population counterparts, as soon as the number of samples becomes larger than the number of unknown parameters (modulo logarithmic factors). Consequently, good properties of the population risk can be carried to the empirical risk, and we are able to establish one-to-one correspondence of their stationary points. We demonstrate that in several problems such as non-convex binary classification, robust regression, and Gaussian mixture model, this result implies a complete characterization of the landscape of the empirical risk, and of the convergence properties of descent algorithms. We extend our analysis to the very high-dimensional setting in which the number of parameters exceeds the number of samples, and provide a characterization of the empirical risk landscape under a nearly information-theoretically minimal condition. Namely, if the number of samples exceeds the sparsity of the unknown parameters vector (modulo logarithmic factors), then a suitable uniform convergence result takes place. We apply this result to non-convex binary classification and robust regression in very high-dimension.
translated by 谷歌翻译
The 'classical' random graph models, in particular G(n, p), are 'homogeneous', in the sense that the degrees (for example) tend to be concentrated around a typical value. Many graphs arising in the real world do not have this property, having, for example, power-law degree distributions. Thus there has been a lot of recent interest in defining and studying 'inhomogeneous' random graph models. One of the most studied properties of these new models is their 'robust-ness', or, equivalently, the 'phase transition' as an edge density parameter is varied. For G(n, p), p = c/n, the phase transition at c = 1 has been a central topic in the study of random graphs for well over 40 years. Many of the new inhomogeneous models are rather complicated; although there are exceptions, in most cases precise questions such as determining exactly the critical point of the phase transition are approachable only when there is independence between the edges. Fortunately, some models studied have this property already, and others can be approximated by models with independence. Here we introduce a very general model of an inhomogeneous random graph with (conditional) independence between the edges, which scales so that the number of edges is linear in the number of vertices. This scaling corresponds to the p = c/n scaling for G(n, p) used to study the phase transition; also, it seems to be a property of many large real-world graphs. Our model includes as special cases many models previously studied. We show that, under one very weak assumption (that the expected number of edges is 'what it should be'), many properties of the model can be determined, in particular the critical point of the phase transition, and the size of the giant component above the transition. We do this by relating our random graphs to branching processes, which are much easier to analyze. We also consider other properties of the model, showing, for example, that when there is a giant component, it is 'stable': for a typical random graph, no matter how we add or delete o(n) edges, the size of the giant component does not change by more than o(n).
translated by 谷歌翻译
The increased availability of massive data sets provides a unique opportunity to discover subtle patterns in their distributions, but also imposes overwhelming computational challenges. To fully utilize the information contained in big data, we propose a two-step procedure: (i) estimate conditional quantile functions at different levels in a parallel computing environment; (ii) construct a conditional quan-tile regression process through projection based on these estimated quantile curves. Our general quantile regression framework covers both linear models with fixed or growing dimension and series approximation models. We prove that the proposed procedure does not sacrifice any statistical inferential accuracy provided that the number of distributed computing units and quantile levels are chosen properly. In particular, a sharp upper bound for the former and a sharp lower bound for the latter are derived to capture the minimal computational cost from a statistical perspective. As an important application, the statistical inference on conditional distribution functions is considered. Moreover, we propose computationally efficient approaches to conducting inference in the distributed estimation setting described above. Those approaches directly utilize the availability of estimators from sub-samples and can be carried out at almost no additional computational cost. Simulations confirm our statistical inferential theory.
translated by 谷歌翻译
文献中有越来越多的证据表明,学习算法的稳定性是允许学习算法概括的关键特征。尽管在这个方向上有各种富有洞察力的结果,但在文献中我们所拥有的基于稳定性的泛化边界似乎是一种被忽视的二分法。一方面,文献似乎表明,估计风险的指数推广界限是最优的,只能通过严格的,分布独立的和计算上难以处理的稳定性概念(例如均匀稳定性)来获得。另一方面,似乎较弱的稳定性概念,例如假设稳定性,虽然它是依赖于分布和更适合的计算,但是只能产生对于估计风险的多项式泛化界,这是次优的。在本文中,我们解决了这两种结果体系之间的差距。特别地,我们在这里讨论的主要问题是,是否有可能使用计算上易处理和分布依赖的稳定性概念来获得估计风险的指数推广边界,但是比统一稳定性弱。利用最近的浓度均匀性的进展,并使用弱于均匀稳定性但依赖于分布并且易于计算的稳定性概念,我们得出了一般学习规则返回的假设的估计风险集中的指数尾部约束,其中估计的风险表示根据重新取代估计(经验误差),或删除(或留下一次)估计。作为一个例子,我们推导出具有无界响应的岭回归的指数尾界 - 在Bousquet和Elisseeff(2002)的统一稳定性结果不适用的情况下。
translated by 谷歌翻译
The last few years have witnessed important new developments in the theory and practice of pattern classification. We intend to survey some of the main new ideas that have led to these recent results. Mathematics Subject Classification. 62G08, 60E15, 68Q32.
translated by 谷歌翻译
我们用抽象的策略类和连续的动作空间来研究语境强盗学习。我们获得了两个质量上不同的遗憾边界:一个在没有连续性假设的情况下与一个平滑版本的策略类竞争,而另一个需要标准的Lipschitz假设。两个边界都表现出与数据相关的“缩放”行为,并且在没有调整的情况下,可以改善良性问题。我们还研究适应未知平滑度参数,建立自适应价格并推导出不需要额外信息的最佳算法。
translated by 谷歌翻译
本文旨在根据其异常程度将多变量未标记观察作为无监督统计学习任务进行排序。在1-d情况下,这个问题通常通过尾部估计技术解决:单变量观察被视为更加“异常”,因为它们位于潜在概率分布的尾部中。还希望设置一个标量值的“评分”函数,以便比较多变量观察的异常程度。在这里,我们通过一种新的功能性能标准(称为质量体积曲线(MV曲线注入))来制定将异常评分作为M-估计问题的问题,其最佳元素几乎在所有支持下严格增加了密度的变换。密度。我们首先研究给定评分函数的MV曲线的统计估计,并且我们提供了使用平滑的bootstraparoach建立置信区域的策略。接下来解决了关于该组分段常数评分函数的该功能标准的优化。这归结为估计经验最小体积集的序列,其水平是从数据中自适应地选择的,以便适应最佳MV曲线的变化,同时通过逐步曲线控制其近似的偏差。然后针对由此获得的经验评分函数的MV曲线和最佳MV曲线之间的差异建立广义边界。
translated by 谷歌翻译
受到AlphaGo Zero(AGZ)成功的启发,它利用蒙特卡罗树搜索(MCTS)和神经网络监督学习来学习最优政策和价值功能,在这项工作中,我们专注于正式建立这样一种方法确实找到了渐近的最优政策,以及在此过程中建立非渐近保证。我们将重点关注无限期贴现马尔可夫决策过程以确定结果。首先,它需要在文献中建立MCTS声称的属性,对于任何给定的查询状态,MCTS为具有足够模拟MDP步骤的状态提供近似值函数。我们提供了非渐近分析,通过分析非固定多臂匪装置来建立这种性质。我们的证据表明MCTS需要利用多项式而不是对数“上置信度限制”来建立其期望的性能 - 有趣的是,AGZ选择这样的多项式约束。使用它作为构建块,结合最近邻监督学习,我们认为MCTS充当“政策改进”运营商;它具有自然的“自举”属性,可以迭代地改进所有状态的值函数逼近,这是由于与超级学习相结合,尽管仅在有限多个状态下进行评估。实际上,我们建立了学习$ \ _ \ _ \ _ \ _ \ _ \ _间/ $ inform中值函数的$ \ varepsilon $近似值,MCTS与最近邻居相结合需要samplesscaling为$ \ widetilde {O} \ big(\ varepsilon ^ { - (d + 4)} \ big)$,其中$ d $是状态空间的维度。这几乎是最优的,因为$ \ widetilde {\ Omega} \ big(\ varepsilon ^ { - (d + 2)} \ big)的minimax下限。$
translated by 谷歌翻译
We study high-dimensional distribution learning in an agnostic setting where an adversary is allowed to arbitrarily corrupt an ε-fraction of the samples. Such questions have a rich history spanning statistics, machine learning and theoretical computer science. Even in the most basic settings, the only known approaches are either computationally inefficient or lose dimension-dependent factors in their error guarantees. This raises the following question: Is high-dimensional agnostic distribution learning even possible, algorithmically? In this work, we obtain the first computationally efficient algorithms with dimension-independent error guarantees for agnostically learning several fundamental classes of high-dimensional distributions: (1) a single Gaussian, (2) a product distribution on the hypercube, (3) mixtures of two product distributions (under a natural balancedness condition), and (4) mixtures of spherical Gaussians. Our algorithms achieve error that is independent of the dimension, and in many cases scales nearly-linearly with the fraction of adversarially corrupted samples. Moreover, we develop a general recipe for detecting and correcting corruptions in high-dimensions that may be applicable to many other problems.
translated by 谷歌翻译