We introduce a very general method for high dimensional classification, based on careful combination of the results of applying an arbitrary base classifier to random projections of the feature vectors into a lower dimensional space. In one special case that we study in detail, the random projections are divided into disjoint groups, and within each group we select the projection yielding the smallest estimate of the test error. Our random-projection ensemble classifier then aggregates the results of applying the base classifier on the selected projections, with a data-driven voting threshold to determine the final assignment. Our theoretical results elucidate the effect on performance of increasing the number of projections. Moreover, under a boundary condition that is implied by the sufficient dimension reduction assumption, we show that the test excess risk of the random-projection ensemble classifier can be controlled by terms that do not depend on the original data dimension and a term that becomes negligible as the number of projections increases. The classifier is also compared empirically with several other popular high dimensional classifiers via an extensive simulation study, which reveals its excellent finite sample performance.
translated by 谷歌翻译
The problem of learning linear-discriminant concepts can be solved by various mistake-driven update procedures, including the Winnow family of algorithms and the well-known Perceptron algorithm. In this paper we define the general class of "quasi-additive" algorithms, which includes Perceptron and Winnow as special cases. We give a single proof of convergence that covers a broad subset of algorithms in this class, including both Perceptron and Winnow, but also many new algorithms. Our proof hinges on analyzing a generic measure of progress construction that gives insight as to when and how such algorithms converge. Our measure of progress construction also permits us to obtain good mistake bounds for individual algorithms. We apply our unified analysis to new algorithms as well as existing algorithms. When applied to known algorithms , our method "automatically" produces close variants of existing proofs (recovering similar bounds)-thus showing that, in a certain sense, these seemingly diverse results are fundamentally isomorphic. However, we also demonstrate that the unifying principles are more broadly applicable, and analyze a new class of algorithms that smoothly interpolate between the additive-update behavior of Perceptron and the multiplicative-update behavior of Winnow.
translated by 谷歌翻译
The automated categorization (or classification) of texts into predefinedcategories has witnessed a booming interest in the last ten years, due to theincreased availability of documents in digital form and the ensuing need toorganize them. In the research community the dominant approach to this problemis based on machine learning techniques: a general inductive processautomatically builds a classifier by learning, from a set of preclassifieddocuments, the characteristics of the categories. The advantages of thisapproach over the knowledge engineering approach (consisting in the manualdefinition of a classifier by domain experts) are a very good effectiveness,considerable savings in terms of expert manpower, and straightforwardportability to different domains. This survey discusses the main approaches totext categorization that fall within the machine learning paradigm. We willdiscuss in detail issues pertaining to three different problems, namelydocument representation, classifier construction, and classifier evaluation.
translated by 谷歌翻译
我们考虑基于作为插值估计器计算的一步估计量加上参数影响函数估计量的经验均值的非参数模型下的标量参数的推断。我们关注一类具有影响函数的参数,这些函数依赖于两个无限维度的烦扰函数,并且使得感兴趣的参数的一个stepetimator的偏差是对两个烦扰函数的估计误差的乘积的期望。我们的课程包括许多重要的治疗效果对比因果推断和计量学因素,如ATE,ATT,与持续治疗的综合因果对比,以及不随意丢失的结果的平均值。我们提出了目标参数的提取器,这些参数接受用于扰乱函数的近似稀疏回归模型,从而允许潜在混淆的数量甚至大于样本大小。通过应用分裂,交叉拟合和基于目标函数的讨厌函数的$ \ ell_1 $ _regularized回归估计,其方向性导数与参数的影响函数一致,我们获得目标参数的选择器具有两个理想的鲁棒性:(1)它们当两个函数都遵循近似稀疏模型时,即使一个函数具有非稀疏回归系数,另一个函数具有足够稀疏的回归系数,并且(2),它们具有双倍稳健性,因为它们是根本一致且非常正常的)它们是模型双重鲁棒的,因为它们是根本一致且渐近正常的,即使其中一个有害函数不遵循近似稀疏模型,只要另一个有害函数遵循具有足够稀疏回归系数的近似稀疏模型。
translated by 谷歌翻译
translated by 谷歌翻译
我们提出了一般无限损失函数的新超额风险界限,包括对数损失和平方损失,其中损失的分布可能很重。边界适用于一般估计量,但它们在应用于$ \ eta $ -generalized贝叶斯,MDL和经验风险最小化估计时进行了优化。在对数丢失的情况下,只要学习率$ \ eta $正确设置,界限就意味着在Hellinger度量的时间化方面在错误指定下的广义贝叶斯推理的收敛性。对于一般损失函数,我们的界限依赖于两个独立的条件:$ v $ -GRIP(广义反向信息投影)条件,它控制超额损失的下尾;以及控制上尾的新引入的见证条件。 $ v $ -GRIP条件中的参数$ v $确定可实现的利率,并且类似于Tsybakov边际条件中的指数和Bernstein条件的有限损失,$ v $ -GRIP条件一般化;有利的$ v $组合小模型复杂性导致$ \ tilde {O}(1 / n)$的费率。证人条件允许我们将超额风险与“退火”版本联系起来,通过这种方式我们推广了几个先前的结果,将Hellingerand R \'enyi分歧与KL分歧联系起来。
translated by 谷歌翻译
This paper considers Bayesian counterparts of the classical tests for goodness of fit and their use in judging the fit of a single Bayesian model to the observed data. We focus on posterior predictive assessment, in a framework that also includes conditioning on auxiliary statistics. The Bayesian formulation facilitates the construction and calculation of a meaningful reference distribution not only for any (classical) statistic, but also for any parameter-dependent "statistic" or discrepancy. The latter allows us to propose the realized discrepancy assessment of model fitness, which directly measures the true discrepancy between data and the posited model, for any aspect of the model which we want to explore. The computation required for the realized discrepancy assessment is a straightforward byproduct of the posterior simulation used for the original Bayesian analysis. We illustrate with three applied examples. The first example, which serves mainly to motivate the work, illustrates the difficulty of classical tests in assessing the fitness of a Poisson model to a positron emission tomography image that is constrained to be nonnegative. The second and third examples illustrate the details of the posterior predictive approach in two problems: estimation in a model with inequality constraints on the parameters, and estimation in a mixture model. In all three examples, standard test statistics (either a χ 2 or a likelihood ratio) are not pivotal: the difficulty is not just how to compute the reference distribution for the test, but that in the classical framework no such distribution exists, independent of the unknown model parameters.
translated by 谷歌翻译
Data with mixed-type (metric-ordinal-nominal) variables are typical for social strat-ification, i.e. partitioning a population into social classes. Approaches to cluster such data are compared, namely a latent class mixture model assuming local independence and dissimilar-ity-based methods such as k-medoids. The design of an appropriate dissimilarity measure and the estimation of the number of clusters are discussed as well, comparing the Bayesian information criterion with dissimilarity-based criteria. The comparison is based on a philosophy of cluster analysis that connects the problem of a choice of a suitable clustering method closely to the application by considering direct interpretations of the implications of the methodology. The application of this philosophy to economic data from the 2007 US Survey of Consumer Finances demonstrates techniques and decisions required to obtain an interpretable clustering. The clustering is shown to be significantly more structured than a suitable null model. One result is that the data-based strata are not as strongly connected to occupation categories as is often assumed in the literature.
translated by 谷歌翻译
从自动驾驶车辆和倒车机器人到虚拟助手,我们下一次在美发沙龙或在那家餐厅用餐 - 机器学习系统越来越普遍。这样做的主要原因是这些方法具有非凡的预测能力。然而,这些模型中的大多数仍然是黑盒子,这意味着人类追随并理解其错综复杂的内部运作是非常具有挑战性的。因此,在这种日益复杂的复杂性下,可解释性受到了影响。机器学习模型。特别是对于新规则,例如通用数据保护条例(GDPR),这些黑箱所做出的合理性和可预测性的必要性是不可或缺的。在行业和实践需求的推动下,研究界已经认识到这种可解释性问题,并着重于在过去的几年中开发出越来越多的所谓解释方法。这些方法解释了黑盒机器学习模型所做的个人预测,并有助于恢复一些丢失的可解释性。然而,随着这些解释方法的扩散,通常不清楚哪种解释方法提供更高的解释质量,或者通常更适合于手头的情况。因此,在本论文中,我们提出了anaxiomatic框架,它允许比较不同平台方法的质量。通过实验验证,我们发现开发的框架有助于评估不同解释方法的解释质量,并得出在独立研究中一致的结论。
translated by 谷歌翻译
In many real-world classification problems, the labels of training examplesare randomly corrupted. Most previous theoretical work on classification withlabel noise assumes that the two classes are separable, that the label noise isindependent of the true class label, or that the noise proportions for eachclass are known. In this work, we give conditions that are necessary andsufficient for the true class-conditional distributions to be identifiable.These conditions are weaker than those analyzed previously, and allow for theclasses to be nonseparable and the noise levels to be asymmetric and unknown.The conditions essentially state that a majority of the observed labels arecorrect and that the true class-conditional distributions are "mutuallyirreducible," a concept we introduce that limits the similarity of the twodistributions. For any label noise problem, there is a unique pair of trueclass-conditional distributions satisfying the proposed conditions, and weargue that this pair corresponds in a certain sense to maximal denoising of theobserved distributions. Our results are facilitated by a connection to "mixture proportionestimation," which is the problem of estimating the maximal proportion of onedistribution that is present in another. We establish a novel rate ofconvergence result for mixture proportion estimation, and apply this to obtainconsistency of a discrimination rule based on surrogate loss minimization.Experimental results on benchmark data and a nuclear particle classificationproblem demonstrate the efficacy of our approach.
translated by 谷歌翻译
The F-measure, which has originally been introduced in information retrieval, is nowadays routinely used as a performance metric for problems such as binary classification, multi-label classification, and structured output prediction. Optimizing this measure is a statistically and computationally challenging problem, since no closed-form solution exists. Adopting a decision-theoretic perspective, this article provides a formal and experimental analysis of different approaches for maximizing the F-measure. We start with a Bayes-risk analysis of related loss functions, such as Hamming loss and subset zero-one loss, showing that optimizing such losses as a surrogate of the F-measure leads to a high worst-case regret. Subsequently, we perform a similar type of analysis for F-measure maximizing algorithms , showing that such algorithms are approximate, while relying on additional assumptions regarding the statistical distribution of the binary response variables. Furthermore, we present a new algorithm which is not only computationally efficient but also Bayes-optimal, regardless of the underlying distribution. To this end, the algorithm requires only a quadratic (with respect to the number of binary responses) number of parameters of the joint distribution. We illustrate the practical performance of all analyzed methods by means of experiments with multi-label classification problems.
translated by 谷歌翻译
A regularized boosting method is introduced, for which regularization is obtained through a pe-nalization function. It is shown through oracle inequalities that this method is model adaptive. The rate of convergence of the probability of misclassification is investigated. It is shown that for quite a large class of distributions, the probability of error converges to the Bayes risk at a rate faster than n −(V +2)/(4(V +1)) where V is the VC dimension of the "base" class whose elements are combined by boosting methods to obtain an aggregated classifier. The dimension-independent nature of the rates may partially explain the good behavior of these methods in practical problems. Under Tsybakov's noise condition the rate of convergence is even faster. We investigate the conditions necessary to obtain such rates for different base classes. The special case of boosting using decision stumps is studied in detail. We characterize the class of classifiers realizable by aggregating decision stumps. It is shown that some versions of boosting work especially well in high-dimensional logistic additive models. It appears that adding a limited labelling noise to the training data may in certain cases improve the convergence, as has been also suggested by other authors.
translated by 谷歌翻译
我们开发和分析递归的,数据驱动的,随机的次梯度方法,以优化一种新的多功能凸风险度量,这里称为平均半偏差。它们的构建依赖于风险规划器的概念,即具有一定性质的一维非线性映射,基本上推广了在上半部分风险度量中的正部分权重函数。在我们正式介绍了半自由度之后,我们研究了它们的基本性质,并提出了基本的建设性表征结果,证明了它们的一般性。然后,我们引入并严格分析MESSAGEp算法,这是一种有效的随机子梯度过程,用于迭代求解凸 - 半偏差风险 - 逆问题到最优。 MESSAGEp算法可以作为(Yang等,2018)的T-SCGD算法的应用而导出。然而,(Yang et al。,2018)的一般理论框架是狭义的和结构上的限制性的,就半 - 半偏差的优化而言,包括经典的上半径半风险度量。通过利用问题结构,我们提出了一组基本上较弱的假设,在这些假设下,我们在相同的强烈意义下建立MESSAGEp算法的路径收敛(Yang et al。,2018)。新框架揭示了随机位置函数的扩展性与所考虑的特定均值 - 半偏差风险度量的平滑性之间的基本权衡。此外,我们明确地表明,支持underour框架的均值 - 半偏差问题的类别严格地大于(Yang et al。,2018)所支持的各类问题。因此,组合随机优化的适用性是针对严格更广泛的中间半问题而建立的,证明了这项工作的目的。
translated by 谷歌翻译
正定核是机器学习中的一个重要工具,可以通过隐式线性化问题几何来解决困难或棘手的问题。在本文中,我们对地球移动器的距离(EMD)进行了一套理论解释,并提出了地球移动器的交叉点(EMI),这是一组与不同尺寸的EMD的正向模拟。我们提供EMD或EMD的某些近似值为负定的条件。我们还提出了一个可以应用于任何kerneland的正定保持变换,也可以用于推导出肯定的基于EMD的内核,并证明Jaccard索引只是这种变换的结果。最后,我们基于EMI和提议的转换与EMD不变的计算机视觉任务来评估内核,并表明即使在无限的内核技术中,EMD通常也是劣等的。
translated by 谷歌翻译
算法预测越来越多地用于辅助或在某些情况下提供人类决策,这种发展对机器学习过程的输出提出了新的要求。为了促进人类交互,我们希望他们输出以某种方式简单或可解释的预测功能。而且因为它们会影响相应的决策,我们还会设定公平的预测功能,这些功能的分配对弱势群体有利(或者至少不会造成伤害)。我们开发了一个正式模型来探索简单与不平等之间的关系。虽然这两个概念似乎是由定性分离目标驱动的,但我们的主要结果表明它们之间存在根本的不一致。具体而言,我们形成了一个生成简单预测函数的一般框架,在这个框架中我们表明每个简单的预测函数都是严格可以改进的:存在更复杂的预测功能,严格来说更有效,也更严格。换句话说,使用简单的预测功能既可以降低弱势群体的效用,又可以降低整体福利。我们的结果不仅仅是关于算法,而是关于产生简单模型的任何过程,并且因此连接到刻板印象的心理学和关于统计歧视的早期经济学文献。
translated by 谷歌翻译
我们考虑形状约束下的非参数回归问题。主要的例子包括等渗回归(关于任何偏序),单峰/凸回归,加性形状限制回归和约束单指数模型。我们回顾了这些问题中最小二乘估计(LSE)的一些理论性质,强调了LSE的自适应性质。特别是,我们研究了LSE风险的行为及其点状限制分布理论,特别强调等张回归。我们研究了围绕这些形状受限函数构造逐点置信区间的各种方法。我们还简要讨论了LSE的计算,并指出了一些开放的研究问题和未来方向。
translated by 谷歌翻译
Searching for an effective dimension reduction space is an important problem in regression, especially for high dimensional data. We propose an adaptive approach based on semiparametric models, which we call the (conditional) minimum average variance estimation (MAVE) method, within quite a general setting. The MAVE method has the following advantages. Most existing methods must undersmooth the nonparametric link function estimator to achieve a faster rate of consistency for the estimator of the parameters (than for that of the nonparametric function). In contrast, a faster consistency rate can be achieved by the MAVE method even without undersmoothing the nonparametric link function estimator. The MAVE method is applicable to a wide range of models, with fewer restrictions on the distribution of the covariates, to the extent that even time series can be included. Because of the faster rate of consistency for the parameter estimators, it is possible for us to estimate the dimension of the space consistently. The relationship of the MAVE method with other methods is also investigated. In particular, a simple outer product gradient estimator is proposed as an initial estimator. In addition to theoretical results, we demonstrate the efficacy of the MAVE method for high dimensional data sets through simulation. Two real data sets are analysed by using the MAVE approach.
translated by 谷歌翻译
This article is a survey of methods for measuring agreement among corpus annotators. It exposes the mathematics and underlying assumptions of agreement coefficients, covering Krippendorff's alpha as well as Scott's pi and Cohen's kappa; discusses the use of coefficients in several annotation tasks; and argues that weighted, alpha-like coefficients, traditionally less used than kappa-like measures in computational linguistics, may be more appropriate for many corpus annotation tasks-but that their use makes the interpretation of the value of the coefficient even harder.
translated by 谷歌翻译